Einstein Discovery – Bring Your Own Model Deep Dive
Einstein Discovery machine learning helps you build powerful predictive models on your data using clicks, not code. With a simple, wizard-driven interface, you have the ability to rapidly create insights and predictive models. Einstein Discovery utilizes a number of industry-standard algorithms to facilitate building models, these include:
- GLM (linear and logistic regression)
- GBM (Gradient Boost Machine)
- XGBoost
- RandomForest
- Model Tournament – Einstein Discovery automatically selects one of the above options that has the best model performance with your data
These algorithms are specifically chosen to cover many common business use cases – while also providing rich explainability and interpretability. As the field of machine learning continues to grow rapidly, robust new algorithms will certainly emerge from academia and industry. Some of these will likely be added to the list of “out of the box” algorithms that Einstein Discovery supports as we continue to innovate for our customers. If you would like to understand more about how predictive modeling works in Einstein Discovery, please refer to this whitepaper.
There are certain situations you may encounter where it makes sense to move away from the ‘native’ modeling in Einstein Discovery and leverage an external model. The feature in Einstein Discovery which enables this functionality is called “Bring Your Own Model” (BYOM). BYOM allows you to upload Python-based, custom code models into the Salesforce cloud environment. BYOM enables you to take advantage of many inherent benefits of Einstein Discovery:
- Highly scalable, cloud-based platform for operationalization of the model
- Model deployment wizard
- REST API, APEX, and Lightning component integration in Salesforce to deliver insights in the flow of CRM
- Integration with Tableau.
Scenarios
At a high level, there are three typical scenarios where it may make sense to use BYOM – rather than Einstein Discovery native modeling.
Scenario #1: Your data science team has created a model that is sufficiently complex or unique such that it cannot be replicated (e.g. data pre-processing transformations that are not supported natively in Einstein Discovery).
Scenario #2: Your model requires a machine learning technique/algorithm that is not currently supported with native Einstein Discovery algorithms (e.g. Deep Learning/Neural Nets).
Scenario #3: Your model training data is stored in a location other than a CRM Analytics dataset and you are unable to move it (e.g. data in a cloud platform other than Salesforce/CRM Analytics).
Detailed Walk Through
In this blog article (part 1 of 2), we will step through an end-to-end example of creating a Python-based model (including data preparation), and uploading the model into Einstein Discovery. Full working code samples are provided to illustrate the concepts. In part two of this BYOM blog, we will walk through operationalizing the predictions into CRM.
The overall structure of this walk-through is contained in three phases – some of which have multiple stages and steps:
- Building a model in Python
- Preparing your files for upload
- Uploading the model files into Model Manager.
Phase One – Building and Preparing Your Model in Python
Building the model
This section will walk you through building the model for our exercise. If you happen to have an existing model you’d prefer to use, you can proceed to the next section to see how you can use your model with BYOM in Einstein Discovery. But please note that you will need to make sure it adheres to all the requirements explained in this article.
The files for the exercises are included for you to follow along with all the processes covered in this article. This includes the training dataset, python code, Jupyter notebook, validation.csv, and the BYOM .zIp file. The files are located here.
The data for our model-building exercise is a public fuel efficiency dataset. The model we are using in this example is a neural network with the TensorFlow framework (Version: 2.7.0). For the exercise, the choice of version of Python is (version 3.7.4). These versions of TensorFlow and Python are the specific versions that are supported for BYOM. Please see the current list of the supported libraries and version requirements here.
To get a feel for the sample data, below are the first ten rows of the dataset:
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model_Year | Origin |
18 | 8 | 307 | 130 | 3504 | 12 | 70 | 1 |
15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 |
18 | 8 | 318 | 150 | 3436 | 11 | 70 | 1 |
16 | 8 | 304 | 150 | 3433 | 12 | 70 | 1 |
17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 |
15 | 8 | 429 | 198 | 4341 | 10 | 70 | 1 |
14 | 8 | 454 | 220 | 4354 | 9 | 70 | 1 |
14 | 8 | 440 | 215 | 4312 | 8.5 | 70 | 1 |
14 | 8 | 455 | 225 | 4425 | 10 | 70 | 1 |
15 | 8 | 390 | 190 | 3850 | 8.5 | 70 | 1 |
Each of the rows corresponds to the unique features of a specific automobile. The first column being the fuel efficiency of the auto (in miles per gallon) with the rest of the columns containing other relevant measures.
The information below is broken down into three stages, which you’ll later observe directly correlate to how BYOM consumes your model. The stages are:
- Pre-processing (data cleaning),
- Model Creation,
- Post-processing.
Stage 1: Data cleaning/pre-processing of the input data
Before loading the dataset for model training, we will review additional steps we can take to clean our data to produce a better model. We refer to this step of transforming raw input data to well-prepared data that we eventually input into our model as preprocessing.
Step 1
import pandas as pd
raw_dataset = pd.read_csv('fuel_efficiency_data.csv', na_values='?', skipinitialspace=True)
This code performs three key tasks:
- First, we load the Python pandas library
- It then reads the dataset into our dataset into a dataframe (*you will need to alter the file path to point at the data.csv file on your system)
- We also removed extra trailing spaces in our dataset as a measure to keep the dataset clean.
Step 2
In our dataset, there are some rows with missing values represented by ‘?’ in the dataset. The code below will remove them and then print the dataset to the screen for your viewing pleasure.
digested_dataset = raw_dataset.dropna()
print(digested_dataset)
Step 3
You will notice in the dataset, the ‘Origin’ column consists of only three values. These values correspond to the three countries representing each automobile’s respective country of origin. For this example, the values are mapped to the countries as follows:
Origin value | Origin place |
1 | USA |
2 | Europe |
3 | Japan |
In this data, the number 1 represents ‘USA’ and so forth. And therefore, we know that Origin is not just actually intended to be a numeric value. Rather, it is a categorical value representing the respective country. Origin value ‘2’ doesn’t mean that it is double of origin value ‘1’ – it represents a different country of origin.
This mapping of business logic to metadata is something we see in datasets often. And it’s important to account for this in our model-building process. If not, the model will truly think Origin ‘2’ might be twice of Origin ‘1’, which makes no sense for the context of ‘Origin’ here.
To account for this in our model-building process, we will use a technique called ‘one-hot’ encoding for the Origin value. (You can learn more about the technique itself here). This allows us to transform Origin from a single column to three columns (USA, Europe, Japan based on the three values of origin) – but with only 1s and 0s.
For example, the table below shows how the left column is transformed to the right three columns.
Origin | USA | Europe | Japan |
1 | 1 | 0 | 0 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
Below is the code to complete this transformation:
value_to_region_map = {1: 'USA', 2: 'Europe', 3: 'Japan'}
digested_dataset = digested_dataset.replace({'Origin': value_to_region_map})
digested_dataset = pd.get_dummies(digested_dataset, prefix='', prefix_sep='')
Step 4
Steps 4 and 5 may seem unusual to you if you regularly create models in Python, but they are important in this context. These two steps are helpful to ensure our model is well prepared for the BYOM preprocessor. The reasons will become evident later in the article.
The following code covers scenarios where one of the encoded values used in model training happens to not be present in some of your data. Therefore, we are forcing the addition of a value for each category – even if it does not exist in the row.
To cover this, let’s add the following code:
if 'USA' not in digested_dataset.columns:
digested_dataset.loc[:,"USA"] = 0
if 'Europe' not in digested_dataset.columns:
digested_dataset.loc[:,"Europe"] = 0
if 'Japan' not in digested_dataset.columns:
digested_dataset.loc[:,"Japan"] = 0
We know that our dataset contains at least one entry for the countries, so is the above code really necessary? We will see that even though we don’t necessarily need to do this step for our model training (because there is at least one ‘USA’, ‘Europe’, and ‘Japan’), this step is necessary when defining the preprocessor for BYOM.
Step 5
One other important step in creating our model for use in BYOM is sorting the columns to order them in a consistent manner. As with step Step 4 above, the reasons for this approach will become clear when we talk about the BYOM preprocessor. The code for sorting the columns is below.
digested_dataset = digested_dataset.reindex(sorted(digested_dataset.columns), axis=1)
And there we have it; we have now prepared the data and our digested_dataset
is ready to be used for model training. Below, we have put all the steps from above into a single tidy Python function.
def read_dataset(file_path):
raw_dataset = pd.read_csv(file_path, na_values='?', skipinitialspace=True)
# Cleaning NAs
digested_dataset = raw_dataset.dropna()
# One hot encoding
value_to_region_map = {1: 'USA', 2: 'Europe', 3: 'Japan'}
digested_dataset = digested_dataset.replace({'Origin': value_to_region_map})
digested_dataset = pd.get_dummies(digested_dataset, prefix='', prefix_sep='')
# Fill back missing values during one-hot encoding
# As this is based on the data, make sure column is created for all possible values.
if 'USA' not in digested_dataset.columns:
digested_dataset.loc[:,"USA"] = 0
if 'Europe' not in digested_dataset.columns:
digested_dataset.loc[:,"Europe"] = 0
if 'Japan' not in digested_dataset.columns:
digested_dataset.loc[:,"Japan"] = 0
# Sorting columns to make sure the columns are always ordered the same way
digested_dataset = digested_dataset.reindex(sorted(digested_dataset.columns), axis=1)
return digested_dataset
Stage 2: Model Creation
Step 1 – Splitting the data
We are now ready to proceed with creating our Python model. The first thing we will do is create our prepared dataset by applying the preprocessing function we created above.
#you will need to enter the path to the original csv file on your system
#e.g. ('/Users/johndoe/Documents/fuel_efficiency.csv')
DATA_FILE = 'fuel_efficiency_data.csv'
#call the read_dataset function on the csv that we created at the end of Stage 1
dataset = read_dataset(DATA_FILE)
Next, in the code below, we split the dataset into training and test datasets.
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)
train_features = train_dataset.copy()
test_features = test_dataset.copy()
The code below is required in order to remove the outcome column from both the training and testing data. We are popping (removing) the ‘MPG’ column and saving it to another dataframe to be used for training labels.
train_labels = train_features.pop('MPG')
test_labels = test_features.pop('MPG')
Step 2 – Model Training
The model we will now create is a simple neural network with two layers – the first layer will be a normalization layer, and the second is a dense layer (standard neural network layer). Links are provided below if you’d like to further understand how these two layers work.
The Normalization layer is used to normalize continuous features in the model. You can read more about this on the TensorFlow website here.
In a neural net, the Dense layer is deeply connected from its preceding layers. It helps change the dimensions of the output with matrix-vector multiplication so the model can understand the relationship between the values in the data. You can read more about Dense Layer on the TensorFlow website here.
#First thing is to import the Tensorflow library. You may need to install it first.
import tensorflow as tf
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing
# Create a new normalizer
normalizer = preprocessing.Normalization()
normalizer.adapt(np.array(train_features))
# Add the normalizer as a layer
model = tf.keras.Sequential([normalizer,layers.Dense(units=1)])
We will now compile the model with the code below. In this example, we chose to use the Adam optimizer which uses a gradient descent method with MAE (mean absolute error) as the loss function. You can read more about the Adam optimizer here.
model.compile( optimizer=tf.optimizers.Adam(learning_rate=0.1), loss='mean_absolute_error')
The code below fits the model using the training data we have prepared in previous steps.
history = model.fit(
x=train_features.values, y=train_labels.values,
epochs=100,
validation_split=0.2)
Now that we have fit the model with our data, we will use the code below to see how our model performs against the test data.
print(model.evaluate(x=test_features.values, y=test_labels.values))
Your results will vary here every time you retrain the model, but your loss (MAE) result should be approximately 2.5 as shown in the screenshot below.
Lastly, we save our model with the code below. You will need to replace MODEL_PATH with the directory path you wish to save the model files to on your system. On a Mac that would look something like: linear_model.save(‘/Users/johndoe/Documents’)
#you will need to enter the path where you want to save the files on your system e.g. model.save('/Users/dshadravan/Downloads/test')
model.save(MODEL_PATH)
Step 3 – Post-processing of the results
To recap, our model is going to help predict the fuel efficiency of the car based on the supplied variables. The predictive output for each row of data is a number that should come reasonably close to the expected output (in miles per gallon).
One possible scenario is that we might want the predicted fuel efficiency in Kilometers per liter (rather than miles per gallon in the training data). We could easily do this by multiplying the miles per gallon value in each row of data with the conversion factor ( ~ 0.425) to get our prediction in Kilometers per liter. Note that this value is going to differ from the expected output because we multiplied our prediction with the conversion factor.
The snippet of code below demonstrates the conversion with postprocessing by transforming the raw predicted output (MPG) to a desired predicted format of the BYOM model (KMpL). Again, we refer to this step as post-processing.
# Store the predictions on the test dataset
raw_predictions = model.predict(test_features.values)
# post processing
final_predicted_ouput = []
for prediction in raw_predictions: # predictions is an array of array of strings
final_predicted_ouput.append(prediction[0] * 0.425144)
# print your model output
print(final_predicted_ouput)
Now we have an end to end flow for creating the Python model – including preprocessing, model creation, andpost-processing. Next, let’s see how we can use this model with Einstein Discovery BYOM.
Phase 2 – Getting started with BYOM
Requirements and Prerequisites
If you followed the model-building exercise in the previous section, you might have noticed that there are some extra files and folders that are generated when you saved the model. Do not delete these files and folders, you will need them as we proceed with the tutorial.
BYOM requires a zip file with the model and various other files that are necessary for the establishing the prediction pipeline in Einstein Discovery. The following section walks through adding and configuring the files that need to reside in the zip, which you will ultimately upload in Einstein Discovery.
Einstein Discovery currently supports both TensorFlow and Scikit learn models with Python. Because the files generated in the model creation process of the above frameworks differ slightly, the file requirements of each framework are somewhat unique. For the latest information on supported frameworks, versions and file requirements refer to the following help article.
Because BYOM has a pipeline designed to manage multiple files with specific attributes, you will need to adhere exactly to the naming conventions outlined below to avoid errors. The requirements for BYOM (as of the Winter ’23 release) are outlined below.
TensorFlow
At the time of writing, BYOM supports TensorFlow 2.7.0 using Python 3.7
Below are the zip file requirements for TensorFlow (must follow these exact names):
data_processor.py
validation.csv
saved_model.pb
- This is the model
- You can use the
save()
function to serialize their model (as we did in the model creation exercise above)
keras_metadata.pb
assets/
- this is a folder that is automatically generated when saving the model
variables/
- this is a folder that is automatically generated when saving the model
Sci-kit learn
BYOM supports Scikit-learn 1.0.2 using Python 3.7
Below are the zip file requirements for Scikit-learn. Again, you must use these exact names.
data_processor.py
validation.csv
saved_model.pkl
- This is the model
- You can use the
pickle.dump()
function to serialize their model
Required Files
You will need to group the model files and auxiliary files that the frameworks generate under one group/folder. The components required for the zip file are:
- Model output files:
- the saved_model.pb, keras_metadata.pb, assets, and variables for tensorflow;
- the saved_model.pkl for sci-kit learn
- data_processor.py
- validation.csv
Stage 1: Creating the data_processor.py file
The data_processor file needs to have three key things declared, they are listed below:
LABEL_COLUMN
preprocessor(file_path)
postprocessor(predictions)
Step 1 – Required variable declaration for Label_Column
The value of the label column is a string that is the same name of the output column in your data. You will see where this is in the data_processor.py file later in this phase. From the fuel efficiency example above, the necessary label column is shown below.
LABEL_COLUMN = "Fuel_Efficiency"
Step 2 – Pre-processing of the results
We will now discuss the functions that need to be declared in the data_processor.py
input: a string that represents the file path of the raw data
output: your prepared dataset for the model
Here is a list of things that your preprocessor function will accomplish.
- The preprocessor reads from raw data from the file path. (* this bit is potentially confusing – you do not need to be concerned with what the actual file path is. It will automatically be directed to the correct path for the input data)
- The preprocessor needs to transform this raw data into transformed data based on your business logic
- The output of the preprocessor is an array of data points that is supplied to your model for predictions
- You need to ensure that the raw data is processed to match the data you intend to supply to the model. This is required for every potential input (refer to the section below for clarification).
With those ideas in mind, we will now move on to creating the preprocessor.
If you refer back to Stage 1 above, the code block for read_dataset
does most of the preprocessing. We also pointed out that steps 4 and 5 are not typical steps in creating general Python models, but they are important in building the BYOM pipeline. We explain why below.
- We use the preprocessor script for any input that you may do in the future with your predictions
- It is absolutely crucial that the preprocessor matches any input (raw) data that you would send to the system in the future. This is why step 4 is important. Even though our training dataset does have at least one data point with USA (1) Europe (2) or Japan (3), it might not be true for any new input data.
- If we hadn’t put step 4 in place, our transformed data would have one less column in a case where the input data for your prediction doesn’t have USA (1) in the Origin for example.
- Therefore, it is important to write the preprocessor such that (for every possible input) we output the data points in the format your model is trained to understand.
- Step 5 ensures that all the columns were sorted in the same order during prediction as it was during the model creation.
- Finally, we need to remove the output column from the output of your preprocessor.
Using the read_dataset
function from above, our preprocessor
for the exercise should look similar to the following.
def read_dataset(file_path):
raw_dataset = pd.read_csv(file_path, na_values='?', skipinitialspace=True)
# Cleaning NAs
digested_dataset = raw_dataset.dropna()
# One hot encoding
value_to_region_map = {1: 'USA', 2: 'Europe', 3: 'Japan'}
digested_dataset = digested_dataset.replace({'Origin': value_to_region_map})
digested_dataset = pd.get_dummies(digested_dataset, prefix='', prefix_sep='')
# Fill back missing values during one-hot encoding
# As this is based on the data, make sure column is created for all possible values.
if 'USA' not in digested_dataset.columns:
digested_dataset.loc[:,"USA"] = 0
if 'Europe' not in digested_dataset.columns:
digested_dataset.loc[:,"Europe"] = 0
if 'Japan' not in digested_dataset.columns:
digested_dataset.loc[:,"Japan"] = 0
# Sorting columns to make sure the columns are always ordered the same way
digested_dataset = digested_dataset.reindex(sorted(digested_dataset.columns), axis=1)
return digested_dataset
def preprocessor(file_path):
"""
This is a mandatory function with a single argument for the file_path to the csv data.
The CSV file with data is RFC4180 formatted.
:param file_path: string
:return: array of data points to your model
"""
dataset = read_dataset(file_path)
if LABEL_COLUMN in dataset.columns:
dataset.pop(LABEL_COLUMN)
return dataset.values
In the above code, we leverage the read_dataset
function that we used to transform our raw input data into more meaningful and rich input data. An extra step is performed to make sure that the preprocessor is valid – that it accurately translates the raw input data to the intended format. Failure to do so will result in an inaccurate prediction.
Step 3 – Post-processing (predictions)
The primary purpose of the postprocessor is to transform the model output (raw predictions) into the final format that you desire (e.g. converting miles to km). The following list describes the basic structure of the postprocessor.
- input: an array of predictions (the exact format of it differs based on your models output)
- output: an array of processed outputs
In this particular scenario, declaring our postprocessor is relatively trivial. The code for the postprocessor is shown below.
def postprocessor(predictions):
"""
This is a mandatory function with a single input argument with array of predictions.
:param predictions: array of array of string
"""
final_results = []
for prediction in predictions:
# predictions is an array of array of strings
final_results.append(prediction[0] * 0.425144)
return final_results
The final finished product for the data_processor.py
should appear similar to the below block of code.
"""
This script is required as an input for Einstein Discovery - Bring your own model (BYOM)
This script is used to convert raw input into digested input that the model recognizes (preprocessor function),
and also convert raw predict to user readable output to be sent/stored by Einstein Discovery Predictions (postprocessor function).
The mandatory functions/variables in this file are marked by a comment.
"""
import pandas as pd
# This is a mandatory variable, and is required.
# This specifics the name of the field to be used as label column/ outcome column in validation.csv
LABEL_COLUMN = 'Fuel_Efficiency'
COLUMN_NAMES = ['Fuel_Efficiency', 'Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model_Year', 'Origin']
def read_dataset(file_path):
raw_dataset = pd.read_csv(file_path, na_values='?', skipinitialspace=True)
# Cleaning NAs
digested_dataset = raw_dataset.dropna()
# One hot encoding
value_to_region_map = {1: 'USA', 2: 'Europe', 3: 'Japan'}
digested_dataset = digested_dataset.replace({'Origin': value_to_region_map})
digested_dataset = pd.get_dummies(digested_dataset, prefix='', prefix_sep='')
# Fill back missing values during one-hot encoding
# As this is based on the data, make sure column is created for all possible values.
if 'USA' not in digested_dataset.columns:
digested_dataset.loc[:,"USA"] = 0
if 'Europe' not in digested_dataset.columns:
digested_dataset.loc[:,"Europe"] = 0
if 'Japan' not in digested_dataset.columns:
digested_dataset.loc[:,"Japan"] = 0
# Sorting columns to make sure the columns are always ordered the same way
digested_dataset = digested_dataset.reindex(sorted(digested_dataset.columns), axis=1)
return digested_dataset
def preprocessor(file_path):
"""
This is a mandatory function with a single argument for the file_path to the csv data.
The CSV file with data is RFC4180 formatted.
:param file_path: string
:return:
"""
dataset = read_dataset(file_path)
# Remove LABEL_COLUMN if is accidentally exists in the dataset
if LABEL_COLUMN in dataset.columns:
dataset.pop(LABEL_COLUMN)
return dataset.values
def postprocessor(predictions):
"""
This is a mandatory function with a single input argument with array of predictions.
:param predictions: array of array of string
"""
final_results = []
for prediction in predictions:
# predictions is an array of array of strings
final_results.append(prediction[0] * 0.425144)
return final_results
Stage 2: validation.csv
Before you start using the BYOM pipeline to generate and consume predictions, we need to ensure everything works end to end. This is where the validation.csv
file comes into play. Please understand that the validation.csv does not refer to a model validation dataset in any way (which is a common term in data science). This is often a point of confusion, understandably so.
The validation.csv should contain the raw data for all the input columns which are intended for use in the preprocessor function – and for the output column (the LABEL_COLUMN
in data_processor.py
). The values need to match the final desired output (including transformations performed by the postprocessor function).
Below are some important points to help further your understanding the purpose and function of the validation.csv
:
- We read the
validation.csv
to understand the raw input data that is provided to thepreprocessor
indata_processor.py
- It should represent the raw input, very similar to
data.csv
in our exercise (for input columns) - The output column however should be the same as the final output (based on the predictions of the model and the postprocessing script)
- We are not training the model based on validation.csv! So we don’t need this validation.csv to be more than a handful of rows. The maximum row count for validation.csv is 99.
- BYOM performs the following steps:
- Read the input data from validation.csv and run the preprocessor function on it
- Take the output of the preprocessor and predict using the supplied model
- Take the output of the model (the predictions) and run the postprocessor function on it
- Compare the output of the postprocessor function (final output) to the outcome column (
LABEL_COLUMN
)
in the validation.csv
For our sample exercise, the validation.csv is below.
Fuel_Efficiency | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model_Year | Origin |
6.8023 | 8 | 307 | 130 | 3504 | 12 | 70 | 1 |
5.95202 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 |
6.8023 | 8 | 318 | 150 | 3436 | 11 | 70 | 1 |
6.33772 | 8 | 304 | 150 | 3433 | 12 | 70 | 1 |
6.8023 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 |
4.67658 | 8 | 429 | 198 | 4341 | 10 | 70 | 1 |
4.25144 | 8 | 454 | 220 | 4354 | 9 | 70 | 1 |
4.67658 | 8 | 440 | 215 | 4312 | 8.5 | 70 | 1 |
3.8263 | 8 | 455 | 225 | 4425 | 10 | 70 | 1 |
10.6286 | 4 | 113 | 95 | 2372 | 15 | 70 | 3 |
10.6286 | 4 | 97 | 46 | 1835 | 20.5 | 70 | 2 |
A couple of important things to keep in mind:
- The output column in validation.csv is not the values from the training data, but the output of the model (predictions) transformed by the postprocessing script (in this case, Fuel_Efficiency is in kmpl while our input dataset is in mpg)
- The order of the columns is not critical as long as the
LABEL_COLUMN
is defined in the data_processor.py – and there is an inherent logic to order the columns in the preprocessor script in a fixed order.
Phase 3 – Uploading the BYOM .zip file
Now we have all the required files (model files, data_processor.py, and validation.csv), what remains to be done is to put these files in a folder and zip it up. You will upload this zip file in Analytics Studio (specifically in Model Manager).
In this section, we provide detailed instructions on where to upload the zip file you’ve created – and how to prepare for using the predictions pipeline. This will allow you to begin using the power of BYOM and the “predictions everywhere” approach with Einstein Discovery (covered in part two of this blog series).
Step 1:
On the left panel of your Analytics Home (as in the screenshot below), click on Model Manager
. This navigates to the model management page where all your deployed models are displayed. This is where you will upload the BYOM zip file.
Step 2:
On the Model Manager page, click on ‘Upload Model’ button.
Step 3:
After clicking the upload model button, you will see a modal pop up as show below where you can fill the details for your model. For the model we created in the exercise above, we would perform the following:
- Model Name: Choose a name for your model
- Description: Choose any description you want to associate with this particular model
- Model Runtime: Select the framework you used to build this model from the list of the available frameworks. For the model we built in the previous exercise, we will be choosing
Tensorflow: 2.7.0 Python: 3.7
. - Model Type: Choose the appropriate model type that matches the one you are uploading (for the exercise, we choose the Regression option)
Once you have made your selection, click Next.
Step 4:
Click on the Upload File
button and select the .zip file you have prepared. Once you have selected the file, click on the “Upload” button. (see the screenshots below)
Step 5:
Wait for the model to upload, and validation of your zip file to finish.
Recall that in the validation step, we perform the following 5 steps
- Take the input columns of your validation.csv
- Run a preprocessor script on it to get the parsed data for your model input
- Make predictions using your model with the cleansed data
- Run a postprocessor script to transform the predictions if necessary
- Match this final output to the output column in validation.csv
If the upload fails and returns an error message, it’s very likely due to a failure during one of these steps.
If your upload succeeds, you will find the uploaded model in the Uploaded
section of your model manager (see the screenshot below).
Congratulations! Your external Python model is now hosted in Einstein Discovery and ready to be operationalized for your users. In part two of this blog, we will walk through the full process of deploying your predictions in Salesforce CRM.
Great content, as always.
Very well written article! Easy to understand
Hi! thank you for the blog!!
Very well explained and useful links 🙂
I hope salesforce upgrades to tensorflow 2.10