SAQL simply explained – Part 4
What we have covered so far in the SAQL series have been some basic features, however, it can be a lot more powerful. Let’s say you have two different datasets you want to use for a single query; it could be cases and opportunities for an account. Well, with SAQL we can leverage co-grouping to join datasets or streams in a single query. So let’s have a look at this function.
Cogrouping or data blending
Imagine you want to look at how many opportunities and cases one account has in the same chart using Einstein Analytics. Generally, you would not have this information in a single dataset as opportunities and cases are not related other than via a relation to the account object. Yes, if you use your dataflow skills you would be able to create this dataset, however for the purpose of illustrating how co-grouping works we will assume you have two datasets; opportunities and cases both generated from the Sales Analytics Templated App.
But before we get to the actual example let me just explain what co-grouping does. Looking at the two datasets mentioned before we have two tables; one for opportunities and one for cases with one common field the Account Name. Using co-grouping we can group each table by the Account Name and join the two tables together, check out the illustration below.
The illustration uses the expression ‘full’ which is the same as an outer join. It makes sense in this case since we are interested in all opportunities and cases, but we cannot be sure that all accounts have both opportunities and cases hence an inner, left or right join wouldn’t be usable for this purpose.
Personally, when I started working with SQL (which in some ways are similar to SAQL) and SAQL I found it a little hard to grasp the concepts of joins. The illustration below back then helped me understand the concepts when joining two tables.
Depending on what you are aiming to achieve with your co-grouping you can use the expressions inner, full, left and right. More about this later in this blog. However, it is important to note that co-grouping is not a true join as you may know from SQL. In SAQL we group the result first and then we join the result in other words we do an aggregated join, hence we loose the grain level unlike a true join.
If you are still unsure what the different types of joins mean, then I can recommend using the recipe in the data manager to illustrate the concept. Simply create a new recipe and choose a left table (opportunities) and add a right table (cases), when you choose the different joins, you will see a preview of the data. In the gif below you can see a very basic example using the same data as the sample tables above.
Let’s try this co-grouping
When it comes to SAQL I always let the tool do as much as possible, hence manually writing as little as possible in SAQL. So let’s first create a query with the opportunities dataset from the Sales Analytics app, having ‘Count of Rows’ and group by ‘Account Name’.
This is as much as we can do in the UI, so we need to switch to SAQL by clicking on the ‘Query Mode” button in the top right corner of the explorer. You should now see something like below.
1 q = load "opportunity"; 2 q = group q by 'Account.Name'; 3 q = foreach q generate 'Account.Name' as 'Account.Name', count() as 'count'; 4 q = order q by 'Account.Name' asc; 5 q = limit q 2000;
The first thing we need to do is add in our second dataset ‘cases’. This dataset can also be found in the Sales Analytics App. To do this click on the ‘Add New Dataset’ link just above the query and find the dataset. When you find it click on it and you’ll see a new line 2.
In SAQL it looks like below. Notice line 2, which can be hard to see in the gif above.
1 q = load "opportunity";
2 Case = load "case";
3 q = group q by 'Account.Name';
4 q = foreach q generate 'Account.Name' as 'Account.Name', count() as 'count';
5 q = order q by 'Account.Name' asc;
6 q = limit q 2000;
Alright, we got the additional load statement (line 2), but we are not yet calling any fields from the dataset. As you have seen described in the previous blogs in this series we need to have a ‘group’ and a ‘foreach’ statement to define our grouping and measure. Remember the goal is to join the two datasets together based on the ‘Account Name’. In order to do this, we will use a ‘cogroup’ statement instead of the current ‘group’ statement.
Looking at my ‘cases’ dataset, I also have a field called ‘Account Name’ that I can use for my grouping. So I can change my line 3 to the following:
q = cogroup q by 'Account.Name' full, Case by 'Account.Name';
As you can see in the example I have added two things; the expression ‘full’ defining the type of join I want to use and the field ‘Account.Name’ I want to group by including which stream or dataset this field can be found. Remember in the section above, we discussed why it makes sense to use a full or outer join; we cannot be sure that all accounts both have one or more opportunities as well as one or more cases. Also, notice that ‘Account.Name’ from the ‘q’ stream (opportunities) is the left table and ‘Account.Name’ from the ‘Case’ stream (cases) is the right table.
With the changes to line 3 you should now have a SAQL query like below, but don’t hit ‘Run Query’ yet, it won’t work.
1 q = load "opportunity"; 2 Case = load "case"; 3 q = cogroup q by 'Account.Name' full, Case by 'Account.Name'; 4 q = foreach q generate 'Account.Name' as 'Account.Name', count() as 'count'; 5 q = order q by 'Account.Name' asc; 6 q = limit q 2000;
As mentioned earlier in this blog series it is not enough to group by a field, we also need to project the fields we want to use in the chart using the foreach statement. Let’s have a look at how we can do that with our example.
In the ‘foreach’ statement we need to do two things:
- Add the ‘Account.Name’ and ‘Count of Rows’ from the ‘Case’ dataset.
- Define from which stream or dataset the fields can be found.
The ‘foreach’ statement in line 4 should look something like below.
q = foreach q generate q.'Account.Name' as 'Account.Name', Case.'Account.Name' as 'Case.Account.Name', count('q') as 'Oppty', count('Case') as 'Case';
So what does it all mean? Well, I’ve kept my original ‘Account.Name’ but we haven’t defined which dataset or stream to grab the field from, to do this you add the stream name just before calling the field. Below you can see we are referencing the q stream which is also known to be opportunities.
q.'Account.Name' as 'Account.Name'
We need to do the same with the measure, but instead of doing it before calling the field it is defined inside the aggregate function – in this case, count(). I’ve also renamed the field from count to ‘Oppty’. This is not strictly necessary to do, however, when I call the ‘Count of Rows’ from the ‘Case’ dataset I cannot call that ‘count’ as I cannot have two fields with the same name. Also when you have two ‘Count of Rows’ you want the label to clearly indicate what it’s counting. At the end of the day, your original count() should reference your opportunities like below.
count('q') as 'Oppty'
Note if you had another aggregation than count() you would add the stream just before the field, for example sum(q.’Amount’).
It’s time to add in the grouping from the ‘Case’ dataset, which is defined in the stream called ‘Case’, hence the projection of the ‘Account.Name’ looks like below. Again remember we cannot have two fields called the same, so I have changed the name to ‘Case.Account.Name’.
Case.'Account.Name' as 'Case.Account.Name'
And the measure looks like this referencing the ‘Case’ stream:
count('Case') as 'Case'
The whole SAQL query should look like below with the new changes to line 4. You can go ahead and click ‘Run Query’ to see the result.
1 q = load "opportunity"; 2 Case = load "case"; 3 q = cogroup q by 'Account.Name' full, Case by 'Account.Name'; 4 q = foreach q generate q.'Account.Name' as 'Account.Name', Case.'Account.Name' as 'Case.Account.Name', count('q') as 'Oppty', count('Case') as 'Case'; 5 q = order q by 'Account.Name' asc; 6 q = limit q 2000;
If you run the query you will notice you have two columns containing the Account Name. If you look further (and scroll down) you will see missing values in both columns. Remember an account may or may not have an opportunity as well as it may or may not have a case. When either is the case, the account name will be missing.
Adding a bit of coalesce
Now, I don’t want to have the same column there twice, nor am I interested in relying on the Account Name from a single stream/dataset as it will result in some missing Account Names. To fix this I can do a simple coalesce() function which will allow us to take the Account Name value from opportunities, but if it null we take the one from cases.
The syntax of the coalesce() function can be seen below. You will notice that inside the parentheses we have the API names of the fields to use in prioritized order.
coalesce('first_value', 'second_value', 'third_value', ...)
The fix is done in line 4, which you can see below. I still have to give my field a label, so after my coalesce function I call it ‘Account.Name’.
q = foreach q generate coalesce(q.'Account.Name', Case.'Account.Name') as 'Account.Name', count('q') as 'Oppty', count('Case') as 'Case';
The whole SAQL query should now look like below.
1 q = load "opportunity"; 2 Case = load "case"; 3 q = cogroup q by 'Account.Name' full, Case by 'Account.Name'; 4 q = foreach q generate coalesce(q.'Account.Name', Case.'Account.Name') as 'Account.Name', count('q') as 'Oppty', count('Case') as 'Case'; 5 q = order q by 'Account.Name' asc; 6 q = limit q 2000;
Switching back to the chart mode you should have a similar chart as below showing how many opportunities and cases each account has.
Note that in Spring 20 it is possible to do co-grouping aka data blending in the UI. You can read more about it in the blog Co-grouping with clicks not code.
Other things to know about co-grouping
With the above example hopefully, you can see how you can use co-grouping in your SAQL query. There are many other use cases, but this example should allow you to understand the principles and apply them to your own use case. I do however want to leave you with a few additional notes that are worth knowing.
Group vs. cogroup
In this blog, we have specifically used ‘cogroup’ as a statement for joining our dataset. However, you can also just use ‘group’ as ‘cogroup’ and ‘group’ are interchangeable. Hence the two statements below will give the same result.
q = cogroup q by 'Account.Name' full, Case by 'Account.Name'; q = group q by 'Account.Name' full, Case by 'Account.Name';
Co-group expressions
Earlier I promised we would get back to the co-group expressions. The example we went through in this blog used a full or outer join as we wanted all the values from opportunities and cases. But we have other options; we can do left, right, inner and full joins (remember the illustration in the beginning).
When you use the ‘cogroup’ statement note that the first stream and field your call will be your left table and the second will be your right table. Where we used ‘full’ in our example you can replace that with ‘left, ‘right’ or ‘inner’ as shown below.
q = cogroup q by 'Account.Name' left, Case by 'Account.Name'; q = cogroup q by 'Account.Name' right, Case by 'Account.Name'; q = cogroup q by 'Account.Name' inner, Case by 'Account.Name'; q = cogroup q by 'Account.Name' full, Case by 'Account.Name';
However, you don’t actually have to use ‘inner’ as leaving out the expression is the same as an ‘inner’ join, consequently, the two examples below produce the same result.
q = cogroup q by 'Account.Name' inner, Case by 'Account.Name'; q = cogroup q by 'Account.Name', Case by 'Account.Name';
Streams
First of all, you can have multiple streams/datasets in your query, it doesn’t have to be just two as the example above. In fact, you can have as many as makes sense, just remember to load in the dataset. When you create your cogroup statement, you add the additional groupings separated by a comma – and don’t forget the expression (left, right and full). You can see below an example where I have included opportunities, cases and activities for an account.
1 q1 = load "OpportunitySample"; 2 q2 = load "CaseSample"; 3 q3 = load "ActivitySample"; 4 result = cogroup q1 by 'Account_Name' full, q2 by 'Account_Name' full, q3 by 'Account_Name'; 5 result = foreach result generate coalesce(q1.'Account_Name', q2.'Account_Name', q3.'Account_Name') as 'Account_Name', count('q1') as 'Oppty', count('q2') as 'Case', count('q3') as 'Activity';
The example we have gone over in this blog uses two different datasets, but you can have use cases where you want to apply co-grouping to a single dataset. If this is relevant for your use case be aware you cannot call the same stream in your cogroup statement, below example is illustrating how this wrong-doing would look.
q = cogroup q by 'Account.Name' full, q by 'Account.Name';
Instead, you need to load the same dataset twice creating two streams.
1 q = load "opportunity"; 2 q1 = load "opportunity"; 3 q = cogroup q by 'Account.Name' full, q1 by 'Account.Name'; 4 ...
Multiple datasets
As you have seen in the example used in this blog it is possible to use two or more datasets in a query. This is really powerful stuff but there are a few things to be aware of. The first thing is the first called dataset is the primary dataset, hence this is what controls faceting and XMD. The second thing is, in order for your query to work you need to have both datasets referenced in the dashboard JSON in the dataset section. If you write your load statement manually and not add it in via the button as we did earlier in the example above your dataset will not automatically be added to the dataset section of the JSON. If you do not have other steps (queries) that uses this dataset as the primary dataset you will get an error when running the query.
Hi, This is a great place to learn SAQL. I have an issue identifying a code that calculates Avg of Distinct Count. Any hack to solve this?
Business requirement: To display Average number of policies (a dimension field) not grouped by any category. Just average number of policies as a single row.
Soooo….we get this type of aggregate over an entire result set question all the time but unfortunately, its not in our compare table aggregate functions yet. Here’s how you can do it with your dataset. The key to doing any type of aggregate calculation across every row is to use a windowing function with partitioning. Here’s the SAQL you’d use in the Query Editor. I’m getting a distinct count of policies with the unique(‘PolicyId’) part. Since I’m grouped by the account name already, the avg() of policy over a partition of all gives you what you’re looking for. HTH.
policy = load “policiesdataset”;
policy = group policy by ‘Account.Name’;
policy = foreach policy generate ‘Account.Name’ as ‘Account.Name’, avg(unique(‘PolicyId’)) over ([..] partition by all) as ‘Avg of distinct count of policies’;
policy = order policy by (‘Account.Name’ asc);
How can we get dimensions from Case stream. Not just aggregation of measures?
Were you able to get dimensions data without aggregates?
This is blog is my to go place for SAQL and EA for learning new things 🙂
I ran into a problem and can’t come up with a solution… Cogroup with mismatched number of columns?
For ex, I have a activity table that is grouped by 5 dimensions (month(text), user_id, manager_id, dept_id, role_id). I have another stream that generated months in role from hierarchy dataset that had only user_id and role_id.
How do I display months in role in first table? Basically, I need to do a “vlookup” by user_id and role_id…