Text Clustering in Einstein Discovery
It is common to build and deploy supervised machine learning models that are generally comprised of tabular datasets with numerical, categorical, and temporal (date/time) variables. Often though, there may be additional value to be gained by augmenting the model with insights derived from unstructured data (text). Some common examples of unstructured text in this context might include such things as:
- Salesforce notes – opportunities, service cases, activities, etc.
- Customer surveys and product reviews
- Meeting notes
- Product descriptions
- Emails and social media posts
- Live chat or chatbot data
Many online sources and industry experts claim that around 80% of data in business is unstructured (non-tabular data). Of course, that is just an estimate – but even if the real percentage is less, it is still a significant proportion of all new and existing data. All that unstructured text is just waiting for us to unlock additional value with machine learning to drive business value!
In the past, Einstein Discovery was unable to take advantage of that unstructured data for predictive models unless external pre-processing and transformations were performed. But now, in the Summer ’22 release, we are proud to release a new feature for Text Clustering in Einstein Discovery. This article will explore that new capability, explain at a high level how it works, and illustrate the functionality through a simple example.
As you can imagine, the value of a machine learning model trying to predict common business outcomes can be significantly improved with the extra information and context derived from the unstructured text. This text may help improve overall model metrics and create additional valuable insights for the user which can assist in driving appropriate actions or business process flow.
Algorithms
Before we take a look at the new text clustering feature in action, let’s briefly review the two primary machine learning algorithms that help enable this new functionality in Einstein Discovery. The first algorithm is tf-idf; it is a common option used in many text processing applications. The tf-idf algorithm creates a statistical weight that is comprised of two terms: the first term computed is the term frequency. This value represents the number of times a word appears in a document divided by the total number of words in that document. The word ‘document’ here refers to the chunk of text within each row of data.
The second term (idf) is the inverse document frequency. This is the logarithm of the total number of the ‘documents’ in the dataset, which is then divided by the number of documents where that particular term is found. Then, the final tf-idf score for each of the most significant words is calculated by taking the product of the tf and the idf values.
When tf-idf is applied to the data, all words are assigned a score between 0 and 1. Those being closer to 1 being most informative and relevant to that particular chunk of text. Essentially what the algorithm is looking for are terms that appear often in specific documents, but are relatively rare in the entire corpus. Fortunately, you don’t need to worry about any of the details of the math, Einstein Discovery handles all that for you behind the scenes.
Once the tf-idf has been calculated, then Einstein Discovery uses a popular unsupervised machine learning algorithm called K-means to create ten clusters (K=10) of the most informative 75 terms. And that group is then filtered to the three highest scoring words within each cluster for use in the model. By using the K-means clustering algorithm to group the terms together, you are able to analyze your machine learning model in a new way by understanding which text might be relevant to the outcome you’re trying to predict. If you’d like more detail on the technical workings of these two algorithms, there are many great references available with a quick online search. We will now walk through an example of this feature in action.
Real Example
Let’s now look at a functional example of how to use the new text clustering feature in Einstein Discovery.
For the purposes of this article, I am assuming you already have some data in CRM Analytics. This dataset could come from a single object or multiple sources (including data outside of Salesforce). In the dataset, you will need to have a column of unstructured text to use text clustering in your model. For this article, I’m using an open-source wine review dataset that can easily be found online.
In this fictional scenario, you can imagine we are a wine distributor who is attempting to use the insights and predictions to maximize sales profits or create more targeted marketing plans or any common business objective. You can see a few rows from our example dataset for this walk-through in image number one below. Notice the Customer Review column, that is our unstructured text.
Create Story
Step one begins with creating an Einstein Discovery story using our wine dataset. For the story (and the resulting model) the outcome goal is to maximize the average price of wine sold.
In the story creation, we first choose “Insight & Predictions”, and then “Manual” so we can see which columns we are adding to the model.
Story Settings
After clicking next, you will notice that the Customer Review text column initially appears unavailable in the story settings. This is because Einstein Discovery does not yet know what you’d like to do with this column. Prior to the text clustering transformation, this column would have been unavailable for use in the model.
While still in Story Settings, if you click directly on the Customer Review row, you will notice that the Transform drop-down menu appears. This allows you to select the Text Clustering transformation as shown in image number six below. Once you have selected Text Clustering, then you proceed to create the story in order to see the results of the transformation on your data.
Review the Story and Model
After the story is finished, you will be presented with the standard Einstein Discovery Insights page. But you will notice the Customer Review unstructured text column has been analyzed, transformed, and added as a variable in the story! In a single easy transformation, Einstein Discovery enables the text clustering insights at both the model training – and at the time when predictions are generated.
If you click on your text variable name, you will get a visualization that shows you the ten text clusters with the three key terms in each one. This allows you to explore and understand how your unstructured text impacts the outcome variable in your model (as shown below in image number seven).
And not only do we see the text clusters in the story Insights, but the clusters themselves are included in the predictive model. This means, that if a particular cluster of the words in your text has a strong influence on the score of a record in the dataset, it will be represented in the Model Predictions Examination (image eight below). And perhaps even more importantly, it will also be shown in the Einstein Discovery Lightning Web Component that you can deploy into Salesforce on each relevant record. For more on deployment options, please reference the excellent “Complete guide to Einstein Discovery model deployments” article.
Wrap Up
And just like that, we now have an Einstein Discovery story and predictive model which include the new text clustering transformation. The power of text analysis is now at your fingertips whenever you need to create a machine learning model for your Salesforce, Tableau, or other data that contains incredibly valuable unstructured text! This new Einstein Discovery feature is available now in all Summer ’22 customer orgs, so you can immediately begin creating new (or improving existing) models with the robust power of text clustering on your data. Best of luck!
I am unable to see text clustering option instead getting detect sentiments option in ED. Could you please guide me