Build Forecasting models using Einstein Studio’s Model Builder

5
(5)

This guide will walk you through the process of building a predictive model for a forecasting use case using the Model builder in Data Cloud.

If you’re new to building predictive models and exploring the No-code model builder, make sure to check out the excellent Salesforce Blog Build an AI Model With Clicks In Data Cloud for a comprehensive introduction.

What can you do with the No-code Model Builder?

Before diving into our forecasting use case, let’s explore the algorithms supported by the No-code model builder.

The No-code model builder currently supports two algorithms:

Regression algorithms: These are ideal for predicting numeric outcomes

Use cases

  • Predict a numeric value, e.g., deal amount or quantity.
  • Score entities to prioritize records, e.g., lead score, opportunity score.
  • Predict the probability of an event’s occurrence, e.g., customer churn, loan approval.

Binary classification algorithms: These are useful for predicting binary (true/ false) outcomes


Use cases

Predicting the probability of one of two events occurring (positive / negative, successful / unsuccessful outcomes). e.g. Deal closure, Loan approval

What is Forecasting?

Forecasting involves estimating the future value of a variable based on historical data and patterns. It’s commonly used across various industries to predict trends, make informed decisions, and plan for the future.

Some popular forecasting use cases across industries include:

What is the difference between Forecasting and Prediction?

While Prediction and Forecasting are often used interchangeably, they have distinct meanings: For example, estimating sales in a particular month sounds like a Prediction but is actually Forecasting. Let’s look into what makes them different using a use case.

Use case: Forecasting Savings

Lets consider forecasting a customer’s savings for the next three months as the use case to explain the difference

Forecasting the savings for the next 3 months given past data on savings of a customer, along with some profile attributes (e.g., age, gender, marital status, number of dependents)

Refer to the image below. The training dataset contains savings data on 30 clients for a few months and just 2 of the many variables – age and gender for illustrative purposes

  • Prediction involves estimating the value of a variable for a new or unseen data point based on existing relationships or patterns. If you want to estimate the savings for new customers (different IDs) based on patterns observed in existing customers / training dataset, that’s a prediction.
  • Forecasting involves estimating the future value of an outcome variable based on historical data and identified patterns. If you want to estimate the savings for the same customer in future months, using their past savings data, that’s forecasting.
Image 1: Forecasting vs Prediction

Other key differences summarized:

PREDICTIONFORECASTING
GoalEstimate values for new dataEstimate future values
ExamplesPredicting customer churn, loan approval.Forecasting sales revenue, production volume.
HorizonMore immediate or specific.Broader trends over a longer period.
Usage of historical dataTo train the model to learn patternsTo learn statistical measures for future values.
Degree of certainty on predictionsHighLower
Aspect of timeNot necessarily based on time-seriesMostly time-series based problems
Table 1: Forecasting vs Prediction

What algorithms are used to solve forecasting?

Forecasting problems are typically addressed using a variety of algorithms, including:

  1. Autoregressive algorithms:
    1. ARIMA (Autoregressive Integrated Moving Average): A popular choice that models time series data by considering past values, lagged differences, and moving averages.
  2. Smoothing algorithms:
    1. Holt-Winters: A family of methods that use exponential smoothing to forecast time series data with seasonal components.
    2. Exponential Smoothing: A simpler method that assigns exponentially decreasing weights to past observations.
  3. Regression algorithms:
    1. XGBoost (Extreme Gradient Boosting): A powerful ensemble learning technique that can be used for forecasting by treating time series data as a regression problem.

Reiterating that XGBoost is the key focus of this article i.e. to build forecasting models using Regression algorithms.

Solving Forecasting use cases in Model Builder

The approach:

With just qualitative variables like age and gender, the forecasting of the savings won’t be accurate. So the alternative approach is to use the previous month’s savings as input variables to the model.

While the savings is available against each month, we want to convert this to the same row to use the previous months’ savings as input variables to the model. This is where we will create lagged variables (Savings_Lag1, Savings_Lag2, Savings_Lag3) using Lag functions in Data Transforms.

What are lagged variables:

To predict April’s savings, we’ll use Mar, Feb, and Jan data as past 3 months’ data. Notice how the data moves diagonally downward / offset for each month.

Image 2: Converting data into a Lag-based training dataset

How to achieve this set of Lagged variables in Data transforms?

  • Data Transforms allow you to manipulate and change the structure, format, or values of your data. They are essential in data processing and preparation to ensure data is in the right form for further usage in Data Cloud.
  • Data transforms offer window functions, a powerful feature in SQL that perform calculations across a set of table rows that are somehow related to the current row. They are often used for running totals, moving averages, and other cumulative calculations, One common window function is LAG.

What is the LAG function in Data transforms? And how to use it?

  • The Lag function can help find the previous month’s values on the same column.
  • Lag functions are essential for constructing predictive models. By understanding how past values influence current values, you can create models that forecast future values more accurately.
  • Also lag functions are often used to create new features from existing time series data. These features can improve the performance of your model. This is what we plan to do to forecast savings.
  • Before we get further, it is helpful to understand the syntax of the LAG function.

LAG (column1) OVER (PARTITION BY dimension_column ORDER BY sort_column) AS previous_value

  • PARTITION BY : Use this when data is split across dimensions. In our case, it is Client_ID.
  • ORDER BY : Use this to order the data to get the previous value. Here, it needs to be sorted by month to get the previous month’s value. This is optional but recommended, as the order affects the data fetched.
  • In this example, the Lag function would be:

LAG (Savings)OVER(PARTITION BY Client_id ORDERBY Month) AS Savings_Lag1

  • This will create a new column Savings_Lag1 that contains the previous month’s Savings value for each client

Steps to create a Forecasting model in Data Cloud’s Model Builder:

Follow these detailed steps to build a multivariate Forecasting model in Model builder using a lag-based training dataset.

Step 1: Create a lag-based training dataset using Data transforms

The first step is to create the lag-based training dataset as shown in the Image 2: Converting data into a Lag-based training dataset. Our goal is to create a Data transform which looks similar to below Image

Image 3: Creating a lag-based training dataset using Data transforms
  • Open Data transforms → Click on New Data transform → Select Batch data transform as type
  • Pull in the Data model object containing the original training dataset.
  • Add a transform node
  • Now we have to use the LAG function to create Savings_Lag1 based on the current month’s Savings
    • Select Savings as the base column and click on the Date and Time function icon to reach Custom formula
      • Image 4: Lag function in Data transforms
    • Enable the Multiple row formula toggle to view functions supported across rows (Window Functions).
    • Insert the formula as lag(Savings__c) using the Columns tab. Specify the PARTITION BY and ORDER BY clauses
      • Image 5: Lag function properties in Data transforms
    • The newly created column will be appended to the end. Note that some rows might be null due to the absence of prior data for the first time period.
      • Image 6: Preview with Lagged column
      • Note: The data in the Preview wont show up ordered, but the lag will be based on the ORDER BY parameter provided.
    • Remember: Batch transforms are sequential in nature. To get the t-2 value, create a lag function on the (t-1) value as lag(“Savings_Lag1__c” )
    • Repeat for remaining lag-based variables as lag(“Savings_Lag2__c”), and Lag 3 either as separate nodes (as shown in Image 4) or in the same transformation node.
  • Store the data in an Output Data Model Object (DMO).
  • Don’t forget to click on Run Now to get the outputs stored in the Output DMO.

How many lagged variables is ideal?
Its always good to have a minimum of 3 lagged variables. You can decide what your look back period is. For a 6-month look back, you need to create 6 variables (Lag1 to Lag6)

Step 2: Validating the data

  • While the Preview section of the Output DMO in Data transform can help validate data across rows, it’s best to verify for a specific customer (same as your PARTITION BY value).
  • Use the Query Editor in Data Cloud to check.
    • Image 7: Query Editor results to validate the Lagged function
    • Since the data moves diagonally downward as expected, now we can start training the model 😄
    • Note: You can also cross-validate this result by performing SQL based Lags on the original training dataset.

Step 3: Training the model

  • Now that you have the data in a single DMO with past data, follow the usual process of training a model.
  • As an optional step, you can filter the records with any of the lagged variables as NULL or choose period >= ‘2023-04’
  • In the Set Goal step, Maximize the Savings column.
  • Select XGBoost as the algorithm.
    • Time series forecasting generally benefits from models that adapt well to non-linearity and deep interactions between features; while our GLMs work well for non-linearity, they are limited in their ability to learn deep interactions. Tree-based models (e.g. Random Forests, Gradient Boosting Machines, and Extreme Gradient Boosting) are inherently capable of handling both of these characteristics of time series data, and do so much more efficiently.
    • Boosting algorithms (Gradient Boosting Machines and Extreme Gradient Boosting, a.k.a. XGBoost) improve performance by successively fitting models on the errors of the previous models, so they tend to perform well in general and XGBoost is among the best of them.
  • You could let Autopilot decide the important columns for the first version or decide which set of variables are fed into the model
  • Take a look at the below image for a snapshot of the configurations used to build this model.

    • Image 8: Forecasting model’s configuration
  • Once trained, view the training metrics and the top factors.
    • Image 9: Forecasting model’s top predictors
  • Iterate on the models if need be.
  • When the metrics look good, activate the model and move on to inferencing.

Step 4: Inferencing / Predicting for the next few months

To forecast for the next 3 months of Savings for a given client, we need to build the inferencing or prediction setup

  • We will use Batch transforms (similar to our Training Transform illustrated in the previous section) to build a setup similar to the image below
    • Image 10: Forecasting model’s Inferencing Data transform
  • First, Prepare the inference dataset
    • Bring the data of the latest month (2024-09 / September in this case)
    • To get started with the first few predictions, add 3 months’ historical data since we used 3 months of lagged variables as seen below.
    • Image 11: Forecasting model’s inference dataset
  • Steps to create the Data transforms:
    • First, bring the inference dataset as an Input node
    • Similar to the training setup, add 3 transform nodes (or do 3 lag-based calculations in a single transformation node) to populate the Savings_Lag1, Savings_Lag2, Savings_Lag3 variables.
    • Now add an AI node and select the model you trained and activated in the previous step.
      • The goal is to map the columns in your previous nodes to the model variables which are seen on the left
      • For the 1st AI node, we are predicting a one-month out forecast. In our case, it is for the month of 2024-10.
        • We have all 3 lagged variables available to make the 1st-month out forecast from Sep 2024 i.e. for Oct 2024.
          • Map Actual Savings (Sep data) to the variable: Lag1
          • Map Savings_Lag1 of Sep month to the variable: Lag2
          • Map Savings_Lag2 of Sep month to the variable: Lag3

            • Image 12: Mapping for 1st AI node in the Forecasting model’s inference setup
      • In the 2nd AI node, we are predicting two-months out forecast ie. month of 2024-11.
        • To forecast for lagged variables available to make 2 months out forecast from Sep 2024 i.e. Nov 2024.
          • Map Predicted Savings i.e. One-month out forecast (Oct prediction) to the variable: Lag1
          • Map Actual Savings (Sep data) to the variable: Lag2
          • Map Savings_Lag1 of Sep month to the variable: Lag3

            • Image 13: Mapping for 2nd AI node in the Forecasting model’s inference setup
      • In the 3rd AI node, we are predicting three-months out forecast ie. the month of 2024-12
        • To forecast for lagged variables available to make a 3-month-out forecast from Sep 2024. ie. Dec 2024.
          • Map Predicted Savings 1 i.e Two-months out forecast (Nov prediction) to the variable: Lag1
          • Map Predicted Savings i.e. One-month out forecast (Oct prediction) to the variable: Lag2
          • Map Actual Savings (Sep data) to the variable: Lag3

            • Image 14: Mapping for 3rd AI node in the Forecasting model’s inference setup
      • You can extend this logic to offset the variables across the model’s lag variables for each month.
    • Store the results of these in an Output DMO. You can rename the fields as needed.
      • Image 15: Renaming the fields
    • Save and Run the transform to post the predictions into the Output DMO. This will give you forecasts of the next 3 months.
  • You can validate the predictions by checking via Query Editor.
    • Image 16: 3 months’ forecasting output data validated via Query editor

Best practices to improve your forecasting model’s outputs:

This document is an illustrative approach to showcase solving a specific forecasting use case with just 2 qualitative variables and 1 quantitative variable on Savings.

To improve the results of your model, you can do a combination of the following.

  • Data collection:
    • Use a lot of historical data to capture patterns and trends. The more data, the higher chances of accurate forecasting models.
  • Data preparation:
    • Add more qualitative variables / profile attributes which can contribute to different patterns like demographic, firmographic, geographic variables.
    • Add more quantitative variables and other metrics which can influence the outcome variable
    • Add more lagged variables: Instead of just 3 lagged variables, you could extend it to even 12 or 24 variables i.e. one or two years’ of data. Youjust need to create same number of transforms in the training data transform and in the inferencing data transform. You could apply lags on multiple variables as well.
    • Handle seasonality with dummy variables
      • Seasonal variables capture repeating patterns / recurring events (e.g., holiday seasons, quarterly sales spikes) and help improve forecast accuracy, reduce residuals (errors)
      • Model Builder provides a capability on date columns to call out the Day of the week or Date of the month to automatically add as features. This should help with handling seasonality.
      • You could create holiday based variables e.g. is_holiday = Yes / No based on the granularity of the data or have a column called holiday, with the name of the holiday mentioned as values.
    • Handle trends with dummy variables
      • You could capture weekly / quarterly patterns as a variable indicating day of the week / quarter number as Week_01, Week_02 or Q1, Q2
  • Model training:
    • Try other algorithms like GLM, GBM that are available in Model Builder
      • Predictive modeling and forecasting is never a one-and-done process. Data scientists experiment with many techniques, and revisions of how data is encoded, to find the process that works best for a given task.
    • Try an ensemble of models: Build multiple-algorithm based models and perform an aggregated metric like average or weighted average on the predictions using Data transforms.
  • Model validation:
    • Because of the number of features generated in the model building process, it’s not uncommon to see very high fit metrics on the training data (even cross-validated). However, these will not accurately reflect the true forecasting accuracy of the model. If you have enough data, the best way to estimate model performance is to separate the last several observations in the series, generate the forecasts for them using Data transforms, and calculate the metrics on them. In No Code Model Builder, you can define these observations as a validation holdout.

The key to achieving a great model is deploying your version 1 of the model (possibly on a smaller set of data) to get forecasts out and creating workflows to action on them right in Data Cloud and then iterate on them.

Remember, a good-enough model in production beats waiting for perfection—deliver value and insights now!

Conclusion

In this blog, we explored how to forecast using regression techniques. We created training datasets using past months’ data by leveraging the LAG function in Batch Transforms, and built the model in Model Builder. Then we also learned how to build inferences for live data by mapping variables, even when some of them are missing at prediction time.

The setup explained is extensible to a large extent and should solve for most time-series forecasting use cases to help you derive actionable insights in Data Cloud. If there any complex and unique use cases, would love to hear them.

Special thanks to Bobby Brill for his review and Randy Sherwood for his data science inputs.

For more resources on Model Builder:

Bonus resource on Data Transforms:

How useful was this post?

Click on a star to rate useful the post is!

Written by


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.