A Look to the Future with Timeseries – Part 2
In my last post, I introduced one of my highlights from the Einstein Analytics Winter 19 release timeseries – the ability to predict future trends based on your data. We looked at the basics around the SAQL ‘timeseries’ statement as well as the ‘fill’ statement. Now I am not done with this topic, because there are more we can do! So let’s get to it!
The difference between organic and conventional avocados
In the first part of this blog series, I introduced a dataset around avocado prices, in this example, we will continue looking into the trending prices of this amazing fruit! So make sure you have the avocado dataset available in Einstein Analytics. And if you don’t have it just yet, check out the link in the first blog series, regardless I can recommend you read that post first as there are certain concepts I have explained there that I will skip in this post.
Okay, so we have already looked at how are the average avocado prices are developing and we have also seen the trend with a prediction interval. Now wouldn’t it be great to see the trend of prices for organic and conventional avocados? Regardless of your answer, that’s what I will be showing now.
First, of course, we need the basics, so go ahead and explore your avocado dataset and make sure you choose Average of ‘AveragePrice’ as a measure and as a grouping we need ‘Date’ (Year-Month) plus ‘type’. You will now end up with a chart like below. It is worth knowing that currently, you can only group by one additional field beside your date when you are using time series.
Just as we did in the previous blog post on timeseries we need to switch to SAQL mode to use this new statement. So in the top right corner select ‘SAQL Mode’
You should now have something like below which basically just loads the dataset, group by ‘Date’ and ‘type’ whereafter it’s projected and ordered by ‘Date’ and ‘type’.
q = load "avocado"; q = group q by ('Date_Year', 'Date_Month', 'type'); q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type' as 'type', avg('AveragePrice') as 'avg_AveragePrice'; q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc); q = limit q 2000;
The first thing we want to add is a new ‘foreach’ statement after our grouping (line 2). It’s almost a copy of the existing ‘foreach’ statement (line 3), however, we are not joining our date together.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type' as 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
As I mentioned in the previous blog, we may realize that there are gaps in our dates so we can use the ‘fill’ statement to fill in those missing dates. Please feel free to look at the documentation here.
The ‘fill’ statement will be added after our new ‘foreach’ statement. Looking back to when we used the ‘fill’ statement in part 1, there is one major difference, we now want to group by ‘type’. This means we need to add the partition into out ‘fill’ statement as you see below.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type' as 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
With that done we can now add the actual ‘timeseries’ statement to our SAQL query. I’ve kept my ‘timeseries’ statement simple and only included the parameters ‘length’, ‘dateCols’ and of course ‘partition’. We need ‘partition’ since we want to differentiate between organic and conventional avocado prices. Note that you can, of course, add more parameters should you wish to – have a look at the documentation for possible parameters.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type' as 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
Next, we need to modify the last ‘foreach’ statement. All I want to do in this scenario is change the projection of the measure to either look at the AveragePrice from the data or if that is null look at the predicted value from the timeseries calculation. I have used the coalesce function for this – exactly the same as in part 1 of this blog series.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
You can now go ahead and click ‘Run Query’ and see the result update. If you look at your chart it should look similar to mine. Pretty cool, right? We can now see that organic prices are generally higher than conventional and you can see the prediction for organic and conventional avocados are slightly different one goes slightly up where the other goes slightly down.
Filtering out old data
Now, this particular chart is fine to look at but what if you had data for the last 8 years? Your chart would be rather compact and you may even have to scroll to see all the data points. But do you really care to see data that is 8 years old? Probably not, you are really just interested in what’s going to happen. So let’s try and filter out data that is older than last year.
Normally when we work with filters we would add them right after the load statement – you can try and explore a dataset, add a filter and switch to SAQL mode. The reason we do this is that we don’t want to group data we don’t need as it has an impact on the query performance. But in this case, if we filter right after our ‘load’ statement then our ‘timeseries’ statement will not use all the data available and the confidence in the prediction would be less hence the prediction would be less accurate. So in this particular scenario, we want to apply a post projection filter. This basically just means we will add this filter after our last ‘foreach’ statement.
Now there is a little trick to this. If you notice a date filter contains all date part, like this:
q = filter q by date('Date_Year', 'Date_Month', 'Date_Day') in ["current year" .. "current month"];
However, timeseries do not support the day part. This means we have to be a little creative in our query and create a static day part that we can use. We do this by modifying the last ‘foreach’ statement by first separating our date parts, since we cannot use ‘Date_Year~~~Date_Month’ and we also add the static day part “01” and call it ‘Date_Day’.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year', 'Date_Month', "01" as 'Date_Day', 'type', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
With that done we can now filter out anything older than last year. We also want to make sure that in our filter we include our predictions, so if you in the ‘timeseries’ statement have defined ‘length’ as 12 you want to add that in the filter. Meaning if you modified the length to be 6 or 20 you, of course, want your filter to reflect that.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice'; q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year', 'Date_Month', "01" as 'Date_Day', 'type', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price';
q = filter q by date('Date_Year', 'Date_Month', 'Date_Day') in ["1 year ago".."12 months ahead"];
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
Now we want to make one final projection where we combine the year and the month but throw away the day, since we don’t need it anymore. We also don’t need to repeat the coalesce function, instead we can just take the ‘Avg Price’ we projected previously. The final ‘foreach’ statement and query should look like below.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month', 'type');
q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type');
q = foreach q generate 'Date_Year', 'Date_Month', "01" as 'Date_Day', 'type', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price';
q = filter q by date('Date_Year', 'Date_Month', 'Date_Day') in ["1 year ago".."12 months ahead"];
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type', 'Avg Price';
q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc);
q = limit q 2000;
If you run your query and switch back to SAQL mode you would have something like this.
Before you go
You can take the learning from part 1 and part 2 of this blog series, combine the examples and have one awesome timeseries example.
q = load "avocado"; q = group q by ('Date_Year', 'Date_Month', 'type'); q = foreach q generate 'Date_Year', 'Date_Month', 'type', avg('AveragePrice') as 'avg_AveragePrice'; q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type'); q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), partition='type', predictionInterval=95); q = foreach q generate 'Date_Year', 'Date_Month', "01" as 'Date_Day', 'type', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price', 'Forecasted_Avg_Avocado_Price_high_95' as 'High Forecast', 'Forecasted_Avg_Avocado_Price_low_95' as 'Low Forecast'; q = filter q by date('Date_Year', 'Date_Month', 'Date_Day') in ["1 year ago".."12 months ahead"]; q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'type', 'Avg Price', 'High Forecast', 'Low Forecast'; q = order q by ('Date_Year~~~Date_Month' asc, 'type' asc); q = limit q 2000;
I do have to say many thanks to Pete Lyons and Antonio Scaramuzzino for brainstorming on the post projection filter.