A SAQL Cogrouping Hack
Some of my blogs are inspired by questions or use cases that I encounter when working with Tableau CRM. In the past month, I’ve had the same question around joining two datasets in a query; one dataset is grouped by two dimensions, but the other dataset is missing one of those dimensions, hence the question becomes “how can you do a cogrouping when one dataset is missing one of the dimensions to group and join on?”. In this blog, I will try to give you a SAQL workaround or hack to accomplish this.
Hold on, wait a minute!
Before we get deep into the SAQL, let’s just get a few things straight.
When we join two datasets in SAQL we are joining on the aggregates. In other words, we first decide on a grouping and use that in both datasets thereafter we join using cogrouping, and hence we will lose the detail or grain of the data. I’ve explained this before in my simply SAQL blog on cogrouping, so if you want more information on this, please have a look at that blog.
But why is this relevant? Well, in short, you might very well be better off joining your dataset in the data layer instead of the design layer. That way you don’t need to do any cogrouping or SAQL because your data will already be joined on the grain level. Using data prep this is actually really simple to do:
- add both datasets or sources as an input node,
- join the two nodes using lookup, left, right, inner or outer join,
- add an output node and create a new dataset.
What are we trying to achieve?
Alright, if you must do it in SAQL, let’s have a look at that workaround. For illustration, I’ve created two sample datasets. One dataset is called Fruit and contains a dimension Fruit and a measure Number. The second dataset is called FruitColor and has two dimensions Fruit and Color and a measure Number.
The idea is to group by Fruit and Color and join the two tables showing numbers from both datasets. Normally we would:
- load both datasets
- cogroup on Fruit and Color
- project groupings and measures
It would look something like this:
q1 = load "FruitColor";
q2 = load "Fruit";
q = cogroup q1 by ('Fruit', 'Color') full, q2 by ('Fruit', 'Color');
q = foreach q generate coalesce(q1.'Fruit', q2.'Fruit') as 'Fruit', coalesce(q1.'Color', q2.'Color') as 'Color', sum(q1.'Number') as 'sum_Numberq1', sum(q2.'Number') as 'sum_Numberq2';
q = order q by 'Fruit' asc;
q = limit q 2000;
However, the Fruit dataset is as mentioned missing the dimension Color, making it impossible to use the above approach.
The workaround
The approach I will take is for the Fruit dataset to generate a default value and call it Color before I join the two datasets to account for the missing dimension. The steps would be:
- load the first dataset, group and project it with the two dimensions and measure,
- load the second dataset, group it on the one common dimension, project the dimension and measure plus generate the missing dimension,
- cogroup and project dimensions and measures.
Let’s have a look at the fruit example.
--I'll make comments in the query as it progresses by starting with two dashes as done here.
First, we will load the dataset FruitColor, then group by Fruit and Color and finally project the dimensions and take the sum of Number.
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
We will now introduce our second stream and call it q2 using the Fruit dataset. Let’s start with just grouping and projecting the Fruit dimension but else do the same as before.
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', sum(q2.'Number') as 'sum_Number';
In the q2 stream, we are missing the Color dimension, so we will create the dimension for each fruit and give it the value None. This is easily done in the foreach statement as seen below.
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
You will now have a query that looks similar to this:
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2: adding the missing dimension
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
We can now add our cogrouping on Fruit and Color from both streams. In the example we are doing a full outer join.
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2: adding the missing dimension
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
--Joining q1 and q2
q = cogroup q1 by ('Fruit', 'Color') full, q2 by ('Fruit', 'Color');
Next, we will project the dimensions and measures as done below. To avoid null values in dimensions and duplicate columns we are using coalesce so we use the q1 value as the default value and if null the q2 value is used. Note that you cannot have two measures that are called the same, so below you’ll see NumberQ1 and NumberQ2, Q1 and Q2 representing the stream the measure is originating from.
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2: adding the missing dimension
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
--Joining q1 and q2
q = cogroup q1 by ('Fruit', 'Color') full, q2 by ('Fruit', 'Color');
q = foreach q generate coalesce(q1.'Fruit', q2.'Fruit') as 'Fruit', coalesce(q1.'Color', q2.'Color') as 'Color', sum(q1.'sum_Number') as 'NumberQ1', sum(q2.'sum_Number') as 'NumberQ2';
Since the Number measure, in this case, symbolize the same, we can add the two values together as one number. Below you can see how this can be done.
sum(q1.'sum_Number') + sum(q2.'sum_Number') as 'Number'
But since we are having null values we do need to take those into consideration, which we can do with a case statement.
case when sum(q1.'sum_Number') is null then 0 else sum(q1.'sum_Number') end + case when sum(q2.'sum_Number') is null then 0 else sum(q2.'sum_Number') end as 'Number'
The query should now look similar to this:
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2: adding the missing dimension
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
--Joining q1 and q2
q = cogroup q1 by ('Fruit', 'Color') full, q2 by ('Fruit', 'Color');
q = foreach q generate coalesce(q1.'Fruit', q2.'Fruit') as 'Fruit', coalesce(q1.'Color', q2.'Color') as 'Color', case when sum(q1.'sum_Number') is null then 0 else sum(q1.'sum_Number') end + case when sum(q2.'sum_Number') is null then 0 else sum(q2.'sum_Number') end as 'Number';
Now we just need to add our order and limit statement and the final query looks like this:
--Stream q1
q1 = load "FruitColor";
q1 = group q1 by ('Fruit', 'Color');
q1 = foreach q1 generate q1.'Fruit' as 'Fruit', q1.'Color' as 'Color', sum(q1.'Number') as 'sum_Number';
--Stream q2: adding the missing dimension
q2 = load "Fruit";
q2 = group q2 by 'Fruit';
q2 = foreach q2 generate q2.'Fruit' as 'Fruit', "None" as 'Color', sum(q2.'Number') as 'sum_Number';
--Joining q1 and q2
q = cogroup q1 by ('Fruit', 'Color') full, q2 by ('Fruit', 'Color');
q = foreach q generate coalesce(q1.'Fruit', q2.'Fruit') as 'Fruit', coalesce(q1.'Color', q2.'Color') as 'Color', case when sum(q1.'sum_Number') is null then 0 else sum(q1.'sum_Number') end + case when sum(q2.'sum_Number') is null then 0 else sum(q2.'sum_Number') end as 'Number';
--Order and limit results
q = order q by 'Fruit' asc;
q = limit q 2000;
That’s that… We can now run your query and switch to a chart to see the result.
OHHHH SNAPPPPP!!!! This is awesome!!!
Thanks as always!!!
Awesome workaround very helpful