A SAQL use case with dates
Lately, I’ve been tasked to build some new dashboards, getting my hands dirty so to speak. I thought I would share one of the use cases I’ve worked on as it involves SAQL and dates. It’s often something I get questions on, so I figured let me provide this example and hopefully, it may give some support or inspiration to an approach in using dates with SAQL.
Let’s say you need to create a chart reporting new and existing records per month but you may not have any new records each month, yet you want your chart to not have gaps in the dates. I’ll use my blogs as an example. While I strive to post something new every month, sometimes I don’t have time, or simply I don’t have a use case to cover, how can I make sure I have a bar for that month even if I didn’t publish anything? Also, how can I calculate how many blogs I’ve already published prior to that month? This is the use case I will try to cover in this blog – you can see the end result in the image below.
Getting the basic query from the UI
Before I go to SAQL I always try to get as much done via the UI, it serves two things:
- make sure I’m not using SAQL when I can use the UI, thus making it easier to maintain after,
- and avoid writing everything from scratch and having to remember the right syntax for every part of my query.
I’ll go ahead and add my filters and group by published date (Y-M).
Next, I’m going to create a running total formula to see what is the total number of blogs. I’ll use the compare table/formula function to do this.
Let the SAQL begin
Now it’s time to switch to SAQL. You may be thinking why? Well, the first thing I want to do is use the fill statement to fill in missing dates, we often use this in connection with timeseries. But I also want to include a fake start and end date this will be helpful when we add filters and when I haven’t published anything in the current month. Plus later on, we need to create a calculation using a derived field. All of these things are not possible to do in the UI.
My starting SAQL looks like the following.
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', first(A) as 'A', sum(sum(A)) over ([..0] partition by all order by ('Date_Year','Date_Month')) as 'Total';
q = order q by ('Date_Year' asc, 'Date_Month' asc);
q = limit q 2000;
Step 1: Create a fake start and end date
The first thing we need to do is to create three foreach
statements aka streams (see query below):
- The UI provided us with the first stream and it is looking at all the dates from the dataset – note I renamed the stream to q1 (line 3).
- The second stream is a dynamic date that always looks at the current month (line 7). I do this by using the function
date_to_string()
which allows me to convert a date part into a string. The date part I use is dynamic as I always want to use the current month since there is a chance I have yet to publish something that month, thus I use the functionnow()
. As for the measure, I just want to add the value 0 as it illustrated nothing has been published. Make sure to use the exact same API names and structure as the previousforeach
statement (line 3). - The third stream is static and provides a starting date we want to use (line 10). It’s debatable if you need to use this, but if your chart should show the past 12 months but your first date is 6 months ago, you may still want to show the previous 6 months. To get your static date you need to put the year and month value in double quotes and reference them as the API names we used in q1 (line 3). Again, as with the second stream, we want to include the value 0 as a measure.
Finally, it’s time to append the three streams (q1, q2, and q3) by using the union
statement (line 13).
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q1 = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
--create a dynamic end date
q2 = foreach q generate date_to_string(now(),"yyyy") as 'Date_Year', date_to_string(now(),"MM") as 'Date_Month', 0 as 'A';
--create a static start date
q3 = foreach q generate "2016" as 'Date_Year', "01" as 'Date_Month', 0 as 'A';
--append all 3 date streams
u1 = union q1, q2, q3;
Step 2: Fill in missing dates
Now we are ready to fill in the missing dates from our start (static) date to our last (current month) date. Here we can use the fill
statement that is often used in connection with timeseries. In line 16, you will see how this is used. The dateCols()
need to match the date combination you are looking to use in your grouping, which in this case is Y-M.
Note: See the different date combinations available and how to use it with a partition in the SAQL developer guide.
When we are using the fill
statement, it will look to insert a row for each missing date, however, the measure will just be a null value. To account for this we can use a case
statement to say when 'A' is null
then use the value 0
else use the value in 'A'
. You can find this in line 19 below.
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q1 = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
--create a dynamic end date
q2 = foreach q generate date_to_string(now(),"yyyy") as 'Date_Year', date_to_string(now(),"MM") as 'Date_Month', 0 as 'A';
--create a static start date
q3 = foreach q generate "2016" as 'Date_Year', "01" as 'Date_Month', 0 as 'A';
--append all 3 date streams
u1 = union q1, q2, q3;
--fill in missing dates
f = fill u1 by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
--Account for null values in dates that are missing
f = foreach f generate 'Date_Year', 'Date_Month', case when 'A' is null then 0 else 'A' end as 'A';
Step 3: Running total and new type dimension
The next part of the query is referenced as c (line 22 and 23). This part originates from what we created through the UI, only I’ve renamed the streams. The foreach
statement in line 23 has the date (Y-M), the number of blogs posted in a given month, and a windowing function calculating the running total. Thus we get two measures as an output.
Now I want to be able to show this in a stacked bar chart where I in one bar can see how many existing blogs I have and how many new ones were published in that month. Thus I need to convert those two measures into a dimension and also calculate the existing value by subtracting new aka 'A'
from running total aka 'Total'
.
In line 26, we are projecting the date parts followed by a static value "New"
that is referenced as 'Type'
and finally the 'Number of Blogs'
measure earlier referenced as 'A'
.
In line 27, we are doing the same thing, however, here we want the static value to be "Existing"
and subtract 'A'
from 'Total'
to get the blogs written in previous months.
Next, we want to append c1
and c2
using union in line 30. Following that in line 31, we are grouping by our date parts and also the new dimension 'Type'
.
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q1 = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
--create a dynamic end date
q2 = foreach q generate date_to_string(now(),"yyyy") as 'Date_Year', date_to_string(now(),"MM") as 'Date_Month', 0 as 'A';
--create a static start date
q3 = foreach q generate "2016" as 'Date_Year', "01" as 'Date_Month', 0 as 'A';
--append all 3 date streams
u1 = union q1, q2, q3;
--fill in missing dates
f = fill u1 by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
--Account for null values in dates that are missing
f = foreach f generate 'Date_Year', 'Date_Month', case when 'A' is null then 0 else 'A' end as 'A';
--Create running total by Y-M
c = group f by ('Date_Year', 'Date_Month');
c = foreach c generate 'Date_Year', 'Date_Month', first(A) as 'A', sum(sum(A)) over ([..0] partition by all order by ('Date_Year','Date_Month')) as 'Total';
--Convert the two measures into two dimensions for stacked bar chart and calculate existing number of blogs
c1 = foreach c generate 'Date_Year', 'Date_Month', "New" as 'Type', 'A' as 'Number of Blogs';
c2 = foreach c generate 'Date_Year', 'Date_Month', "Existing" as 'Type', 'Total' - 'A' as 'Number of Blogs';
--Append the new and existing blog streams
u2 = union c1, c2;
u2 = group u2 by ('Date_Year', 'Date_Month', 'Type');
Step 4: Add a post-projection filter
Now I have been blogging since 2016, which will be a very long chart if we include all months, so I say let’s just show the last 12 months. Normally we have filters right after the load
statement, however, that means we will exclude blogs written prior to the date defined in the filter, and the running total and thereby existing blog value would be significantly lower (and wrong). This is why want to add this filter after the running total calculation also known as post-projection.
When we add a date filter, we need to use the full date (Y-M-D), and combine all the date parts in the date()
function – for example: date('Date_Year', 'Date_Month', 'Date_Month')
. However, we grouped by Y-M, so we need to add a static value for the day.
In line 34, we are projecting the year and the month, adding the static value "01"
as 'Date_Day'
followed by the 'Type'
and 'Number of Blogs'
. We are using first()
as we already summed the date in a previous step.
Once we have this we can add our filter as done in line 37.
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q1 = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
--create a dynamic end date
q2 = foreach q generate date_to_string(now(),"yyyy") as 'Date_Year', date_to_string(now(),"MM") as 'Date_Month', 0 as 'A';
--create a static start date
q3 = foreach q generate "2016" as 'Date_Year', "01" as 'Date_Month', 0 as 'A';
--append all 3 date streams
u1 = union q1, q2, q3;
--fill in missing dates
f = fill u1 by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
--Account for null values in dates that are missing
f = foreach f generate 'Date_Year', 'Date_Month', case when 'A' is null then 0 else 'A' end as 'A';
--Create running total by Y-M
c = group f by ('Date_Year', 'Date_Month');
c = foreach c generate 'Date_Year', 'Date_Month', first(A) as 'A', sum(sum(A)) over ([..0] partition by all order by ('Date_Year','Date_Month')) as 'Total';
--Convert the two measures into two dimensions for stacked bar chart and calculate existing number of blogs
c1 = foreach c generate 'Date_Year', 'Date_Month', "New" as 'Type', 'A' as 'Number of Blogs';
c2 = foreach c generate 'Date_Year', 'Date_Month', "Existing" as 'Type', 'Total' - 'A' as 'Number of Blogs';
--Append the new and existing blog streams
u2 = union c1, c2;
u2 = group u2 by ('Date_Year', 'Date_Month', 'Type');
--Create a fake day to use in the following date filter
u2 = foreach u2 generate 'Date_Year', 'Date_Month', "01" as 'Date_Day','Type', first('Number of Blogs') as 'Number of Blogs';
--Add post projection filter
u2 = filter u2 by date('Date_Year', 'Date_Month', 'Date_Day') in ["12 months ago".."current day"];
Step 5: Final projection
The remaining thing to do is drop the date part for the day in a new foreach
statement but else keep the previous dimensions and measure (line 40). And as always add your order
(line 42) and limit
(line 43) statement.
Note: I have renamed my streams throughout, you want to make sure that references that appear after being updated to avoid errors as I’ve done in my order
and limit
statement.
Here’s the final query with commentary:
q = load "Blogs";
q = filter q by 'Status' == "publish";
q = group q by ('Date_Year', 'Date_Month');
q1 = foreach q generate q.'Date_Year' as 'Date_Year', q.'Date_Month' as 'Date_Month', count(q) as 'A';
--create a dynamic end date
q2 = foreach q generate date_to_string(now(),"yyyy") as 'Date_Year', date_to_string(now(),"MM") as 'Date_Month', 0 as 'A';
--create a static start date
q3 = foreach q generate "2016" as 'Date_Year', "01" as 'Date_Month', 0 as 'A';
--append all 3 date streams
u1 = union q1, q2, q3;
--fill in missing dates
f = fill u1 by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
--Account for null values in dates that are missing
f = foreach f generate 'Date_Year', 'Date_Month', case when 'A' is null then 0 else 'A' end as 'A';
--Create running total by Y-M
c = group f by ('Date_Year', 'Date_Month');
c = foreach c generate 'Date_Year', 'Date_Month', first(A) as 'A', sum(sum(A)) over ([..0] partition by all order by ('Date_Year','Date_Month')) as 'Total';
--Convert the two measures into two dimensions for stacked bar chart and calculate existing number of blogs
c1 = foreach c generate 'Date_Year', 'Date_Month', "New" as 'Type', 'A' as 'Number of Blogs';
c2 = foreach c generate 'Date_Year', 'Date_Month', "Existing" as 'Type', 'Total' - 'A' as 'Number of Blogs';
--Append the new and existing blog streams
u2 = union c1, c2;
u2 = group u2 by ('Date_Year', 'Date_Month', 'Type');
--Create a fake day to use in the following date filter
u2 = foreach u2 generate 'Date_Year', 'Date_Month', "01" as 'Date_Day','Type', first('Number of Blogs') as 'Number of Blogs';
--Add post projection filter
u2 = filter u2 by date('Date_Year', 'Date_Month', 'Date_Day') in ["12 months ago".."current day"];
--Project final fields
u2 = foreach u2 generate 'Date_Year', 'Date_Month', 'Type', 'Number of Blogs';
u2 = order u2 by ('Date_Year' asc, 'Date_Month' asc);
u2 = limit u2 2000;
Once you run your query and switch back to your chart you should see something like this – mine has a few changes to the formatting.
Hello,
A very nice post!
I think this line:
u2 = filter u2 by date(‘Date_Year’, ‘Date_Month’, ‘Date_Month’) in [“12 months ago”..”current day”];
should read:
u2 = filter u2 by date(‘Date_Year’, ‘Date_Month’, ‘Date_Day’) in [“12 months ago”..”current day”];
Well spotted, Jiri! I’ve corrected the SAQL to have the correct date reference. Thanks.
Powerful !
I think i should ‘ve used this sum() window function when i wrote 600 line saql code
showing stock amount : )
Date filter[“12 months ago”..”current day”] do not consider current month and records falling under current month though we provide end date as “current day” here. It will only consider last day of the previous month. What should be the filter criteria in that case?