When we're Training on Power BI a frequent question we get is "I need to make changes to my data for reporting, where is the best place to do that, in the Data Model with DAX, or in Power Query?" The answer is, as ever "It depends". To answer the question properly, we need to understand where we can make data changes in Power BI, then I can introduce a maxim that explains the best place to make those changes.
There are a bewildering number of places we can make changes to our data for use in Power BI, we need to look at what and where these places are. Let's take a couple of scenarios, in one we have a SQL database and we want to join two tables together and report on the total values, and another where we have several CSV files we need to join together and report on as one large list of data.
Once we've imported the data into Power BI, we can use DAX to make these changes in our Data Model, here we can create relationships and use these to join the data elements together. Then we can use DAX to aggregate the values, add up the numbers and calculate the values we'll report on. In the Data Model our scenarios are pretty much the same, the data is just presented as tables in Power BI which DAX can manipulate with commands like "USERELATIONSHIP" and "LOOKUPVALUE".
Power Query Transformations
One alternative is to make these transforms in Power Query. In the SQL scenario we can "join" two tables together in Power Query and create a single table that may easier to work with. In the CSV files scenario we can connect to a folder of files with Power Query and concatenate them to present the files as one table in the Data Model.
Another place we could make these transforms is in a Dataflow, if you haven't looked at Dataflows yet, I'd encourage you to do so. A Dataflow is essentially Power Query Online, a Cloud-based tool that lets you transform your data, from SQL databases, Excel or CSV files etc., and store that changed data in the Cloud, this can then be then loaded into a Data Model pre-cleaned and ready to go. We can take the queries from Power Query for our SQL or CSV scenario above and create them as Dataflows which can then be re-used for many other Data Models.
Original Data Transformations
Of course, we don't have to make these transformations on Power BI, we could create a View in SQL to join our two tables or change the Database to hold them differently. In the CSV scenario we could get the data owner to provide one big table rather than keep them split as many smaller separate tables.
With many different ways to shape data, where is the best place to do it? Probably the finest answer to this question is given by Matthew Roche. He's a Program Manager on the Power BI team at Microsoft, with his own blog over at BI Polar – Business Intelligence, Data Governance, Mental Health, Diversity, Martial Arts, and Heavy Metal. (ssbipolar.com). Outside of Power BI Matthew is passionate about Historical European Martial Arts and a considerable swordsman. Maybe this is why his observations are so sharp and pointed. In a recent "Guy in a Cube" livestream session he made a brilliant comment, which I've started to call "Roche's Maxim".
Data transformations should be made "As far upstream as possible, as far downstream as necessary".
What does this mean? Well, "Upstream" means near to the original data, the database or data or files you're reporting on. "Downstream" is away from this data, nearer to the report or analysis you're creating from the data. What Roche's Maxim succinctly says is that you should make data transformation as near to the original data as you can. In our SQL example you could make the transforms in the database or as a SQL View. In our CSV example you could join the original files together at the start, before you load them into Power BI. But it recognises that we can't always make the transformation that far "upstream" near the original Data. Maybe we don't have access to the Database, or can't use SQL, because we don't know the language, or don't have access to create Views. Perhaps the CSV files arrive monthly, not all at once, or they are on a file system we can't write to. In these cases its necessary to make the transformations further "downstream" away from the original Data. This could be in a Dataflow, or in Power Query in Power BI desktop. Finally, if none of these options are available, we can make the changes with DAX. Its always going to be more efficient to make the transformations as near to the original data as possible.
Why is this important ?
Knowing where to make the transforms is important because these changes all have a cost. It may be a cost in time, or in computing power. Making lots of changes through DAX will take time to execute and may slow your report, and that will happen whenever the report is viewed. Making the changes in Power Query moves the time cost to whenever data refresh happens, which may be less frequently than when the report is viewed. Moving the Power Query changes to a Dataflow, both allows the Query to be re-used multiple times, but also moves the time cost to when the Dataflow is refreshed, which may be different from when the Data Model is refreshed. Finally making changes in the Database itself as a database modification or a View puts the time cost to the Database server (which may be the best place for it) making Reports and Data Model refresh more efficient and hence faster. But the thing is, there's always a cost, so here's my corollary to Roche's Maxim..
You've got to pay the cost of data transformations; your only choice is when and where you pay that cost.
When you're designing your Power BI reports, or more often, when you're trying to work our why your Report or data refresh is slow, remember Roche's Maxim, and maybe think about moving your data transforms further "upstream".
Burningsuit provide Training and Consultancy in Power BI. We've helped many companies implement, use and learn Power BI for business benefit. Contact us to see how we can help you.