(Updated 2/10/21)
The primary goal of this blog post is to provide technical information on the addition of tree-based machine learning (ML) algorithms to Einstein Discovery. This article helps readers understand the capabilities, benefits, and details of ensemble (tree-based) algorithms in Einstein Discovery. Tree-based algorithms go beyond Einstein Discovery’s traditional focus on regression-based modeling. The intended audiences for this document are Salesforce and Einstein admins, business analysts, data scientists, machine learning professionals, and technical managers/executives.
This article covers the high-level details of how and why the tree-based techniques are used in Einstein Discovery. It also provides a “harbor-tour cruise” of the statistical methods involved. However, this article is not intended to be a comprehensive tutorial on the mathematical techniques of the algorithms themselves. That task has been adequately addressed in many books, research papers, and online articles. For readers who want to delve further into the mathematics, this article provides links to related resources.
Note: If you are not already familiar with Einstein Discovery, I strongly recommend you take the Einstein Discovery Basics Trailhead and read the following articles and whitepaper:
- Gain Insight with Einstein Discovery
- Einstein Discovery: How good is my model really?
- Understanding the Differentiating Capabilities and Unique Features of Salesforce Einstein Discovery within the Machine Learning Space
- The Delicacy of Accuracy: A Deep Dive on Classification Performance
- Preparing Your Data For Einstein Discovery
The approach we take in this article is to divide the implementation of tree-based ensemble algorithms in Einstein Discovery into three sections in order to ease the consumption of dense topics:
- What Are the Basic Concepts and Components of Tree-Based Algorithms
- What Specific Algorithms Are New to Einstein Discovery
- Why Are Some Einstein Discovery Models Better Served By Tree-Based Algorithms
Decision-Tree Algorithms
Similar to linear and logistic regression, decision-tree algorithms are part of the supervised learning branch of machine learning. Like linear and logistic regression, decision-tree algorithms minimize or maximize against an outcome (dependent) variable.
The foundations of the tree-based algorithms that are now part of Einstein Discovery have existed for decades and are proven effective for many types of machine learning challenges. In recent years, however, with the advent of large-scale machine learning, tree-based algorithms have evolved into a family of “ensemble” (aka additive) algorithms. More on that later.
Decision trees are a family of supervised learning algorithms that address both Boolean (yes/no, churn/not churn) and continuous variable (forecast, CSAT) target outcomes. Tree-based algorithms empower machine learning models with accuracy, stability, and – especially for Einstein Discovery – model interpretability and explainability.
Decision-Tree Components
A decision tree consists of the following key components:
- Root Node: Represents the topmost decision node and corresponds to the best predictor in the dataset
- Splitting: The process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes.
- Leaf/ Terminal Node: A node that does not split.
- Pruning: The process of removing sub-nodes from a decision node.
- Branch / Sub-Tree: A sub–section of entire tree.
You can think of a decision tree as a representation of a mathematical function that maps a vector of values to a single output. That single output value is the “decision”. The tree reaches the decision by conducting a series of statistical tests, starting from the top of the tree (root), and following the branches until the most suitable leaf is found. The branches are labeled with values from the respective attribute, and leaf nodes actually determine which value should be returned by the mathematical function.
Example Decision Tree
As a real example of a simple and fun tree, below is an image of a single decision tree that was created in RStudio with the Antarctica Penguin dataset from Kaggle. This example tree is built using a variety of physical features that have been measured on each penguin by a team of scientists. The quantitative values of these physical characteristics are used to train a tree-based ML model – with the goal of becoming adept at predicting whether any given penguin is a boy or a girl.
The tree shows you how a rudimentary decision-tree algorithm can be used to classify a penguin as a male or female based on features in the dataset, such as body mass and flipper length. In the business world, this same basic approach can be applied to numerous common classification and regression machine learning use cases for sales, service, marketing, commerce, etc.
Ensembles, Boosting, and Bagging
For modern machine learning, decision tree algorithms are generally implemented as an “ensemble”. What does this mean? The algorithm creates a large group of trees and then uses mathematical techniques to make the best choices from the group. Ensemble learning does not attempt to create one high-quality model. Instead, it generates multiple, relatively low-accuracy models from many decision trees. Then, it combines the predictions from those “weak” models to create its final, more accurate model that learned from the errors of all its predecessors. This type of ensemble learning approach leads to better accuracy and model stability. Most importantly, it addresses the bias-variance tradeoff challenge that plagues many machine learning models.
The types of decision-tree algorithms that have been added to Einstein Discovery use a few optimization techniques – gradient descent, weak learners, and – most critically – boosting and bagging. The origins of “boosting” go back at least to the 1980s. One of the most influential papers was “Thoughts on Hypothesis Boosting” by Kearns. Bagging has a similar rich history.
Boosting – The core idea for gradient boosted tree algorithms is relatively straightforward: start with a ‘weak’ statistical learner and help it adapt to challenging data by creating a succession of models that essentially focus on difficult examples (that its predecessor had mis-classified), and try to improve it. Put another way, boosting is an algorithmic technique that creates a model that is fit on a sequence of models – with larger weighted values (penalizing) given to records that have more substantial errors in successive iterations. As the algorithm iterates further into the collection of trees, it attempts to improve on these records using gradient descent to minimize the loss. Gradient descent is a common mathematical method for optimizing neural networks and other machine learning algorithms. If you’re not familiar with Gradient Descent, this article is a helpful introduction to the concept: Gradient Descent For Machine Learning.
Gradient descent allows ensemble algorithms to improve the overall predictive power of the model and obtain higher accuracy by generating a large number of tree-based models and creating a final model that is an ensemble of the multiple, previous weaker models. This is obviously a vast over-simplification, and there is a great deal of mathematics involved to make it happen. But this explanation should suffice for our purposes here. As with all things machine learning, if you want to dive deeper, you can use Google search to find a variety of articles and papers on this topic.
Bagging – This technique, also known as “bootstrap aggregation”, is a general-purpose procedure for reducing the variance of a model’s performance. It is particularly powerful when used with decision tree algorithms for helping with bias vs. variance trade-off.
To help understand bagging, think about the idea that whenever you create an average on a set of observations this will naturally reduce variance. So one way you could theoretically reduce the variance (and increase the predictive power) of a machine learning method is to take many training datasets from your data, create unique prediction models utilizing each training dataset, and average all those predictions. Sounds great, right? The problem is that most times, we don’t have multiple training datasets that are of sufficient quality.
Because we may not have the luxury of unlimited training data, we can “bootstrap“, by iterating over the samples from a lone training dataset. Therefore, you create some large number of bootstrapped training datasets. After you have those, you can use a bagging algorithm to train the model on the averaged predictions of all the bootstraps. This helps get around the problem of high variance that would exist in a single tree. It is not uncommon to bootstrap thousands of trees to produce a final model. While bagging is useful in many algorithm types, it’s particularly powerful for decision trees, and most commonly associated with Random Forest.
New Algorithms in Einstein Discovery
In recent releases of Einstein Discovery, three new, tree-based, ensemble, algorithms were implemented within the platform: Gradient Boost Machine (GBM), XGBoost, and Random Forest. XGBoost and GBM are based on boosting while Random Forest utilizes bagging techniques. So both of the primary decision tree algorithm methods (boosting and bagging) are now covered in Einstein Discovery.
Gradient Boost Machine
Among the types of boosting algorithms available to the modern machine learning engineer, Gradient Boost Machine (GBM) is one of the most popular. In Einstein Discovery, GBM supports regression and binary classification. GBM uses an ensemble approach, with decision trees and gradient-based optimization acting as the learning mechanisms. When a model is in the process of being created, trees are added one at a time, and the previous trees are maintained. Gradient descent is then used to minimize loss (error) as more trees are added. Gradient descent is beyond the scope of this article, but it is a commonly used method in machine learning algorithms for finding optimal models. One of the early seminal papers on gradient boost can be found here: Boosting Algorithms as Gradient Descent.
As the GBM algorithm processes through a dataset, it creates a number of trees that are typically defined by the data scientist. Within the range of trees that are created, the algorithm creates a model with the best performance on the training data provided. Of course, Einstein Discovery handles this for the user behind the scenes. Similar to the GLM algorithms, the new tree-based options in Einstein Discovery use the H2O library. The details and the mathematics used in the algorithm are located here: H2O Gradient Boosting Machine.
You can understand individual decision trees by simply visualizing the tree structure as described earlier in this post. Gradient boosting models, however, are typically built with hundreds of decision trees. Therefore, they cannot be easily interpreted by inspection of the individual trees. Fortunately, Einstein Discovery uses advanced techniques to summarize and interpret gradient boosting models so that a ‘Story’ created with GBM (or XGBoost) is human-readable and interpretable. Even though there are a large number of tree models being created, Einstein Discovery still surfaces the important variables, values, and insights found within the story. This is a unique and powerful capability within Einstein Discovery!
XGBoost
XGBoost (“eXtreme Gradient Boosting”) is arguably the most popular of the modern boosting algorithms. It has reigned supreme with many recent machine learning challenges on the highly competitive Kaggle.com site. It has become a leading algorithm choice when the competitors are tasked with ML problems that use structured or tabular datasets on classification and regression predictive modeling problems.
Like GBM, XGBoost is an ensemble algorithm based on gradient boosting techniques, but with a few key differences. Notably, XGBoost uses a more regularized model formalization to control over-fitting, which often results in better performance. The creator of the XGBoost algorithm, Tianqi Chen, said that the algorithm name “actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms”. For those of you who want all the details, you’ll find them in the paper XGBoost: A Scalable Tree Boosting System.
Random Forest
Unlike XGBoost and GBM, Random Forest is a bagging-based algorithm. It has a long and rich history now spanning nearly three decades. When applying Random Forest to a dataset, it generates a “forest” of classification or regression trees, rather than a single decision tree. These large number of individual decision trees then operate as ensemble. It has proven to be a very effective and flexible algorithm – perhaps one of the most widely used algorithms in machine learning aside from regression.
The real “secret sauce” in how Random Forest achieves such high effectiveness is one minor (but critical) extension on top of standard bagging algorithms. In bagging, the learning algorithm looks through all variables and values in a dataset when determining how to select the most optimal split-point. The Random Forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features for each split (tree). This allows Random Forest to avoid common problems such as overfitting, and undesirably high correlation between individual trees. Aggregating predictions from multiple models with ensemble methods works better if the predictions from the individual trees are (mostly) uncorrelated. Therefore, in Random Forest, the sub-trees are created with random subsets of the features (columns) in each tree. This results in models that have higher predictive power and better accuracy.
How are the New Algorithms Surfaced in Einstein Discovery
- When creating a new Story (or on the Story Settings page), you’ll see an Algorithm drop down menu.
- If you know which algorithm you want to use, select it from the menu before creating your story.
- If you are unsure, select “Model Tournament.” This will run a model tournament and surface the best model and story to the user. Note that this option does take longer to run because it generates multiple models to test.
- If you do not specify an algorithm, GLM is selected as the default because it is generally the quickest to finish.
When Are Tree-Based Algorithms More Useful and Effective?
Last but not least, when and why should you use these new algorithms? There is no clear-cut method to know ahead of time which algorithm will perform optimally, so testing and experimentation are sometimes required in order to compare model performance. Some datasets are naturally better suited to linear regression, while for others, tree-based methods yield better results. If you don’t want to try and decipher which algorithm is optimal, no worries! As mentioned earlier, Einstein Discovery handles this automatically via a “model tournament,” so you don’t need to run the comparison manually.
As we said previously, it’s difficult to always know which type of algorithm will perform better on any given dataset, but at a high level, there are several circumstances in which tree-based ensemble algorithms have unique advantages over other supervised machine learning algorithms. These scenarios include (but are not limited to):
- Datasets with target outcomes that are a continuous variable (numeric) have proven to benefit the most from the new algorithms in our experience
- Datasets where there are significant non-linearity and unusual statistical distributions of data (non-Gaussian)
- Datasets where the relationship between dependent & independent variables is unusually complex
- Datasets where the class distribution of the target variable is skewed/imbalanced
- Large datasets with many columns
- Datasets that require careful hyper-parameter tuning
- Datasets that are prone to overfitting (XGBoost provides regularization similar to GLM)
- Datasets with several different data types such as numerical/continuous, date/time, categorical, boolean, etc.
Conclusion
The latest version of Einstein Discovery brings us another incredible collection of new capabilities and improvements in Einstein Discovery, and the new tree-based algorithms are certainly one of the most exciting! Hopefully, this brief article gave you a working understanding of – and an appreciation for – the new algorithms that have been introduced in Einstein Discovery. Now go forth and create some awesome trees!
Boom! Great content, as always.