A Look to the Future with Timeseries – Part 1
It’s Winter 19 release time! And that means we get some new awesome features in our production environment! One of the things I am excited about is the new timeseries SAQL function. Why? Well, if you look at trends when it comes to big data it is all about machine learning and AI. Einstein Analytics haven’t really had any way to look at predictions – that’s where Einstein Discovery comes in. With timeseries we can actually start predicting what is going to happen based on patterns in our existing data. Let’s have a look at an example.
Prepping your data
You can, of course, use any existing datasets that you have or create a new one with your data from Salesforce. However, I decided to look elsewhere for some data that I could use, so I turned to Kaggle and found some data on avocado prices, cause who does not love avocado? Anyway if you want to do the same as me you can find the data here. We are not going to use all the measures and dimensions and frankly, I don’t know how accurate this dataset is, but it doesn’t matter. Let’s have some fun.
Make sure you download the dataset, unzip it and upload it to Einstein Analytics. Notice that the UI has changed a little bit.
I am sure you are a smart cookie so I won’t go through how you upload the csv file, but once it’s uploaded hit “Explore” in the top right corner.
Let’s look to the future
The very first thing we want to do is to choose the measure average of AveragePrice and group by Date (Year-Month).
Now it’s time to the fun part, we will move away from chart mode and into SAQL mode. So click the SAQL icon in the top right corner. At any point you can always refer to the timeseries documentation here, there are more parameters available than I will be using in my example.
Alright, if you are new to SAQL this may look a little scary, but it really isn’t. All it’s doing is what you told it to load the avocado dataset, group by the date field and showing the average of AveragePrice. Yes, there is also an order and limit. Anyway, your query should look something like this:
q = load "avocado"; q = group q by ('Date_Year', 'Date_Month'); q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice'; q = order q by 'Date_Year~~~Date_Month' asc; q = limit q 2000;
The first thing we want to add is a new ‘foreach’ statement after our grouping (line 2). It’s almost a copy of the existing ‘foreach’ statement (line 3), however, we are not joining our date together.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;
Now when I group by date there may be months where we do not have any data, in order to avoid these gaps we can use the new fill function to fill in the missing dates. Please feel free to look at the documentation here.
The ‘fill’ statement will be added after our new ‘foreach’ statement. For this to work make sure if you have grouped by Year-Month then you use the same combination in the ‘fill’ statement and not switch to let’s say Year-Quarter. The syntax is highlighted below.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;
With the gaps filled we have now come to adding in our ‘timeseries’ statement – yay! So what we need to do is after our new fill statement, make room for the timeseries by pressing enter. The syntax is to first call the timeseries but then define all the parameters that we want to use, not all are mandatory to use to please refer to the documentation for that – I will be using ‘length’ and ‘dateCols’. ‘dateCols’ is the time period and ‘length’ is how many of those time periods do I want to predict. And of course the measure you see before all the parameters is the measure that is being predicted.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;
So we got this far. One more thing to do before we hit run is to make sure the last ‘foreach’ statement has included our new field ‘Forecasted_Avg_Avocado_Price’, also we don’t need to call the aggregation again. So we just want to delete the aggregation and add the new measure to the last part of the ‘foreach’ statement.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month','avg_AveragePrice', 'Forecasted_Avg_Avocado_Price';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;
Now you can hit run. Notice you have your aggregation of the data from the data in the ‘avg_AveragePrice’ and the 12 months in the future has a blank value. However, these are filled in the new ‘Forecasted_Avg_Avocado_Price’.
This is not the end, we can do more. Maybe we want to see how the prediction interval is looking like. We can add that to our ‘timeseries’ and last ‘foreach’ statement. The parameter is ‘predictionInterval’ and we can make it between 80 and 95, which will be the confidence for the prediction. Let’s try to add it to our query in the ‘timeseries’ statement as well as our ‘foreach’ statement, so it’s being projected.
q = load "avocado"; q = group q by ('Date_Year', 'Date_Month'); q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice'; q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M")); q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), predictionInterval=95); q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', 'avg_AveragePrice', 'Forecasted_Avg_Avocado_Price', 'Forecasted_Avg_Avocado_Price_high_95', 'Forecasted_Avg_Avocado_Price_low_95'; q = order q by 'Date_Year~~~Date_Month' asc; q = limit q 2000;
Notice that in my projection (the ‘foreach’ statement) I am calling my predicted measure again but adding ‘_high_95’ and ‘_low_95’. This is just the syntax for these fields. If I had called my measure ‘Price’ and I had used 80 in my prediction interval then it would have been ‘Price_high_80’ and ‘Price_low_80’. Once you have the new parameter added and the two extra fields projected feel free to run the query. You will now see the high and low values as well.
If I switch over to a timeline chart, it would look something like this – if you made the same changes to the chart of course.
We are not done yet. Let’s clean up the chart a little bit, for instance, why not use ‘avg_AveragePrice’ when we have data and ‘Forecasted_Avg_Avocado_Price’ when we don’t have any actuals? Also, the naming could be a little better. So switch back to SAQL mode.
We will be using the ‘coalesce’ function to take the avg_AveragePrice when we have data and when we don’t when that column is blank, we will take the predicted value instead. The syntax for ‘coalesce’ is simple:
coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price'
The first part before the comma is the default value and the value after the comma is the value we will be using if the first turned out to be null. Now the latter “as ‘Avg Price'” is the new name for our column.
We are also renaming the high and the low forecast the be more meaningful as you can see below.
q = load "avocado";
q = group q by ('Date_Year', 'Date_Month');
q = foreach q generate 'Date_Year', 'Date_Month', avg('AveragePrice') as 'avg_AveragePrice';
q = fill q by (dateCols=('Date_Year', 'Date_Month', "Y-M"));
q = timeseries q generate 'avg_AveragePrice' as 'Forecasted_Avg_Avocado_Price' with (length=12, dateCols=('Date_Year', 'Date_Month', "Y-M"), predictionInterval=95);
q = foreach q generate 'Date_Year' + "~~~" + 'Date_Month' as 'Date_Year~~~Date_Month', coalesce('avg_AveragePrice', 'Forecasted_Avg_Avocado_Price') as 'Avg Price', 'Forecasted_Avg_Avocado_Price_high_95' as 'High Forecast', 'Forecasted_Avg_Avocado_Price_low_95' as 'Low Forecast';
q = order q by 'Date_Year~~~Date_Month' asc;
q = limit q 2000;
You can now hit run and switch back to your chart mode and you will have a chart that is similar to mine – I have made a few more adjustments to the formatting of the chart.
If you want to do more with the new ‘timeseries’ statement, then have a look at the second part of this blog series.
super cool. thanks for sharing. when trying to use my own data, i get this error:
Expecting Field of type IdToken. Got FUNC.
the SAQL is:
q = load “Oppty_w_Offering_2”;
q = filter q by date(‘Opportunity.CloseDate_Year’, ‘Opportunity.CloseDate_Month’, ‘Opportunity.CloseDate_Day’) in [dateRange([2016,5,1], [2241,6,8])];
q = filter q by ‘Opportunity.ForecastCategoryName’ == “Closed”;
q = group q by (‘Opportunity.CloseDate_Year’, ‘Opportunity.CloseDate_Month’);
q = foreach q generate ‘Opportunity.CloseDate_Year’ , ‘Opportunity.CloseDate_Month’ , sum(‘TotalPrice’) as ‘sum_TotalPrice’;
q = fill q by (dateCols=(‘Opportunity.CloseDate_Year’, ‘Opportunity.CloseDate_Month’, “Y-M”));
q = timeseries q generate sum(‘TotalPrice’) as ‘Forecasted_TotalPrice’ with (length=12, dateCols=(‘Opportunity.CloseDate_Year’, ‘Opportunity.CloseDate_Month’, “Y-M”));
q = foreach q generate ‘Opportunity.CloseDate_Year’ + “~~~” + ‘Opportunity.CloseDate_Month’ as ‘Opportunity.CloseDate_Year~~~Opportunity.CloseDate_Month’, ‘Forecasted_TotalPrice’, ‘sum_TotalPrice’;
q = order q by ‘Opportunity.CloseDate_Year~~~Opportunity.CloseDate_Month’ asc;
q = limit q 2000;
Quickly looking at it, it looks alright. But try and remove the filters. If it doesn’t find enough good data to predict from it should return nothing.
How do I format my chart to look like yours at the end?
Thanks.
Hi Rikke,
I played with it a lot and one final thing I cannot achieve is to have dashed line only for the prediction (time series) part. As I used coalesce, the dashed line can be only selected for the final consolidated “Avg Price” (and that makes whole line dashed, not only the predicted part).
How did you manage to do that?
Thanks,
Lukas
Thank you for the very useful time series analysis capabilities and limitations of the Einstein platform. Looking forward to the next article.
Hi Rikkie,
Thanks for this great article and many others. I have one simple question. the way predicted values are generated for forecasting based on historical data present in the dataset it also creates prediction values for them. shouldn’t it match with the historical values present? for instance, I have a sum of amounts as the result column with the actual historical value of 1099305878.59 for 2015 – 1 and the forecasted value turned out to be 1107645328.765194. although the forecasted values are pretty much close to the actual still it strikes whether it is ideal or not.
I couldn’t get the query to run in my TableauCRM version although I copied (in a desperate last try). I have this error. “Undefined identifier: “AveragePrice”. Make sure the “AveragePrice” identifier exists and is spelled correctly without spaces.”
Why would it be?
In order to clarify my previous comment, the part that I can´t get to run is this:
q = load “avocado”;
q = group q by (‘Date_Year’, ‘Date_Month’);
q = foreach q generate ‘Date_Year’, ‘Date_Month’, avg(‘AveragePrice’) as ‘avg_AveragePrice’;
q = fill q by (dateCols=(‘Date_Year’, ‘Date_Month’, “Y-M”));
q = foreach q generate ‘Date_Year’ + “~~~” + ‘Date_Month’ as ‘Date_Year~~~Date_Month’, avg(‘AveragePrice’) as ‘avg_AveragePrice’;
q = order q by ‘Date_Year~~~Date_Month’ asc;
q = limit q 2000;
But I don’t understand why, could you please explain the reason?