Preparing Your Data For Einstein Discovery
Einstein Discovery is AI-powered analytics that enables business users to automatically discover relevant patterns based on their data – without having to build sophisticated data models. However, Einstein Discovery is only as good as the historical data it is analyzing. Wrangling the data, which is always the hardest part, involves cleaning and consolidating your historical data. Sometimes this will involve creating new derived or transformed fields. The process of preparing your data doesn’t have to be overwhelming. Whether you are working with the variety of data preparation features within Einstein Analytics, or you are using external tools, as long as you focus on how Einstein Discovery will analyze the data, you will be successful.
Before preparing your data, it’s important to come up with the right use case. Einstein Discovery is entirely outcome focused, which means you need to provide historical data which has already reached the outcome (think opportunities that have already closed rather than open pipeline). Lastly, Einstein Discovery can analyze two types of outcomes: 1. Binary (true/false, yes/no, win/loss) 2. Numeric (Amount, Quantity, Time, etc). Once you have selected your use case, you can start to collect your data.
This process can be broken down into five steps assuming the right use case:
- Identify training dataset
- Join your data
- Choose the columns to include
- Filter Data
- Calculate fields
Let’s deep dive into an example – predicting opportunity time-to-close
What if sales executives had more confidence that deals would be closed on time? What if machine learning could help predict the total time it will take for an opportunity to close, based on historical data? Because we are predicting number of days to close an opportunity, we can apply a Linear Regression machine learning algorithm, but don’t worry if you don’t know what any of that means, Einstein Discovery will apply it automatically when you choose the outcome to analyze.
While not all sales process times can be predicted, this example assumes sales reps are tracking their entire deal cycle in Salesforce. In order to train Einstein Discovery to be able to predict how long it will take an opportunity to close, we need to feed it historical data that shows how long it took for actual deals to close. Not all historical data will be useful however, and it will be important to clean and prepare your data before Einstein Discovery is smart enough to make predictions. For example, if your sales process changed drastically in the last year, then historical data from two years ago will not surface relevant insights to improving today’s sales cycle.
Step 1. Identify data for training
First, it’s important to identify where all your historical data lives. This will be the data used to train Einstein Discovery about what has happened in the past, so it can identify all of the patterns. Opportunity data, for instance, should live in Salesforce, but more specifically, it lives in the Opportunity object. Information about who the rep is selling to should be tracked in the Account Object. Related activities to the Opportunity should live in the Task and Event objects, various members of the selling team live in the Opportunity Team Member object and details about the Sales Rep should live in the User Object. This is all important because Einstein Discovery will need to learn from each and every historical opportunity, which should include Account, and Owner information and may also include counts of tasks, events and/or team members. If the data is already in Salesforce, Einstein Analytics Data Manager makes it very easy to sync your objects and Recipes can be used to join your data.
Step 2. Join your data
To predict opportunity time-to-close, we will want to end up with opportunities as the finest grain. i.e. all rows will be individual opportunities. Using an Einstein Analytics Recipe we can easily join parent objects such as account and opportunity owner. For child objects like tasks, events and opportunity team members, we will have to aggregate these values. The simplest approach is to create separate datasets using the aggregate transformation (Eg. count number of task rows by whatId), then join these back to the opportunity dataset.
The following objects were joined for this example
- Opportunity (this is the base object)
- Account
- User (to extract some details of the owner)
- UserRole (further extract details of the owner)
- Record Type (to get the record type name)
- Tasks and Events (as aggregate counts)
- Opportunity Team Members (as aggregate counts)
Step 3. Choosing the columns to include
After joining 6 objects together, you will undoubtedly have a ton of columns to choose from. Should you include them all in the final dataset? Most columns won’t actually provide any useful insights. and you will find that when analyzing your data using Einstein Discovery, the leaner the dataset the better. This part can be tricky, and if you are not sure whether or not to include the field, it’s better to just include it and let Einstein Discovery decide if it’s useful or not. Below are some considerations to take when choosing the fields to analyze:
Avoid fields which are not relevant to the outcome
When analyzing data that has already reached the outcome, it’s important to eliminate columns where there are values that do not occur when the record has reached its final state. For example if we are only looking at closed opportunities we should ignore Stage, Forecast Category or Probability, because we will only be analyzing data which is either Closed Won or Closed Lost. The predictive model won’t know what to do if it sees Stage = Qualification because it was never trained for this value.
Choose fields with less unique values
Einstein Discovery will be looking for patterns. Fields with more unique values (high cardinality) such as Id or unique name fields as well as fields that are mostly constant (or null) will not contain any signal and can be removed. High cardinality is not just limited to Id fields, Einstein Discovery can detect patterns in fields with no more than 100 unique values. If a field has more than 100 unique values, Einstein Discovery will only look at the top 100 values by occurrence and all others will be automatically grouped together in an “other” bucket. A good example of this is the account name on an opportunity. If you are analyzing 50K opportunities there might be 10K different accounts associated. Einstein Discovery can only identify patterns from the top 100 Accounts by occurrence, and the other 9.9K will be grouped together. Rather than looking at the account name, you should consider including account fields such as industry, segment, annual revenue and size that can be used as an account profile because those are the patterns you will be are more interested in anyway.
Be careful with Dates
Date fields are tricky. Patterns in date fields usually occur by extracting the month of year or day of week, this is already part of the Einstein Discovery Story creation, so we do not need to worry about this now. Additionally, the time between dates like Close Date – Created Date is interesting and we will look at this shortly, but the dates themselves are less exciting. Dates like Last Modified Date, Last Viewed Date and Last Activity Dates probably won’t have any interesting information, so you should probably just remove them.
Some examples of fields which were selected by the object (This example only looks at standard fields for simplicity purposes, but all custom fields are supported)
- Opportunity – Lead Source, Opportunity Type, Created Date, Close Date (so we can calculate the difference between the two dates)
- Account – Industry, Annual Revenue, Number of Employees, Account Source, Account Type, Billing Country
- User – Created Date (for calculation purposes later), (Opportunity Owner Manager) Full Name
- UserRole – (Opportunity Owner) Role Name, (Account Owner) Role Name
- RecordType – (Opportunity) Record Type Name, (Account) Record Type Name
- Task and Events – Row Counts
- Opportunity Team Members – Row Counts
Some examples of fields which were not selected
- LastModifiedDate – Knowing when the record was last modified doesn’t tell me anything about how long it took to close. Opportunities very well could have been updated after they were closed.
- AccountId – Too many unique values
- StageName – This is a field which designates the sales process stage. Since we are training when IsWon=true, the model will only know the stage “Closed Won”. When it sees stages like “Qualification” the model won’t know what to do that value and consider it out of bounds for the model.
- Billing City – This was removed because there was a high frequency of null occurrences in this particular dataset
- IsClosed – This was removed because it only ever has the value of true (since we filtered on IsWon = true)
Step 4. Filter your data
We are only interested in opportunities which have already reached the outcome. For Opportunities, it typically means when the Opportunity is closed (IsClosed= true). However, to get a good sense of deals that go through the entire sales cycle, we’ll only look at deals that were won (IsWon = true). Also, not all historical data will be useful, especially as sales processes change. It’s generally a good idea to look at more recent data which is more aligned to the current process, so in this example, we are going to filter for won opportunities closed after Jan 1, 2017. You may also consider filtering for a specific segment of your data as insights which drive faster sales for your enterprise customers may be completely different from your small business customers.
Step 5. Calculate additional fields
This step can be especially useful to build derived fields, but in this example we first need to create a field which shows the opportunity age otherwise Einstein Discovery has no way to analyze the sales process time. Using a recipe, you must first extract the “Day Epoch” value from both the Close Date and the Opportunity Created Date, then using a formula field, subtract the Epoch value of Created Date from Close Date.
Additional derived fields may provide some additional signals into the model. For example, with this model, we believe that a more seasoned sales reps will be able to sell faster. If the sales rep tenure does not already exist in Salesforce, it could be derived by subtracting the opportunity owner’s user record created date from the closed date on the opportunity (since we want to know how long they were an employee at the time the opportunity was closed).
Some ideas for calculated fields which may help
- Difference between dates
- Extract week or quarter from date field (Einstein Discovery can only extract day of week or month of year)
- Bucket large groups of text values into smaller categories
- Bucket fields into categories
Once the five steps have been completed you are ready to start analyzing with Einstein Discovery.
Final Thoughts
Data wrangling can be the most time consuming part for getting started with AI-augmented analytics. Einstein Analytics provides some great tools to help with the process, but any tool that can output a CSV file will be sufficient for working with Einstein Discovery. It’s important that you put some thought into the data before trying to analyze it. Try to keep the dataset lean as you don’t want to get lost looking at data columns that are not providing any value and remember that Einstein Discovery works best on historical data that has already reached either a binary or numerical outcome.
Wonderful post. Much appreciate the screenshots and animated gif’s to show the detail. Perhaps a future post on Stories including non-Salesforce data?
Definitely planning on doing more deep dives like this. Handling of external data will be coming soon.
Thanks, Bobby, very helpful. Agree with Steve on perhaps Salesforce data augmented with other data like log files, perhaps Churn? Thanks again.
Great blog! Thanks a ton for the deep dive and will be very useful for a lot of admins struggling to figure out where to start!