Beyond Binary: Unlocking Business Insights with the Power of Multinomial Classifiers: “The Lab” – Part 2
This is a continuation of the previous blog post Beyond Binary: Unlocking Business Insights with the Power of Multinomial Classifiers. In this post, we’ll look at how you can prepare your data with a few clicks using Data Cloud’s Data Transform.
Part 2: Data Prep for an Ensemble of Binary Classifiers
Our goal is to prepare a training dataset that enables binary prediction models to identify specific churn reasons. By creating separate binary datasets for each churn category, we can train models to focus on one reason at a time, improving the precision of predictions.
To achieve this, we’ll perform the following steps:
- Organize the Data: The data for training is structured in a tabular format, with columns representing various metrics such as screen time, search success, and login frequency.
- Transform Multiclass Data to Binary: For each churn category—disengagement, content gap, prefer competition, boring content, and poor video quality—we’ll create a binary column. In each dataset, entries matching the target churn reason will be labeled with that reason, while other entries will be labeled as “Other.” This transformation will yield five separate binary prediction datasets.
- Prevent Data Leakage: To avoid data leakage, we will drop the original “Churn Reason” column from each transformed dataset before model training.
Assumption: The data is already loaded in Data Cloud and mapped to a DMO. In my case, I have mapped data to a custom DMO called ‘Streaming Media Data’.
Here’s a sample of the original data structure:
In the example transformation, column “G” represents the new binary label column for a specific churn reason, while the original “Churn Reason” column (column “A”) is dropped. This allows each model to focus on predicting one reason for churn at a time, enhancing model performance and actionable insights.
Let us now understand how we can create these buckets mentioned in column ‘G’ for each dataset using Data Transform.
Use Data Transform to create Churn Category Datasets
- On the Data Cloud home screen, click on the Data Transform tab, or if it is missing, click on “More” –> Data Transform.
- Click on “New” –> “Batch Data Transform”. As I have mapped my data to a Data Model Object (DMO), I have selected “Data Model Objects” to interface with the data. You can choose Daat Lake Object too if that fits your requirement better. Choose the “Data Space” where your DMO/DLO resides. Click on “Next”.
- In the Batch Data Transform screen click on “Add Input Data” –> Choose your data source and its fields –> Next.
- In the Batch Data Transform screen, you will be presented with your input data. Click on the plus sign and choose Transform node.
- Create buckets of “Churn Reason” column. We’ll transform the multicategorical data in the column named “Churn Reason” into a column with two (binary) categories. Starting with the “disengagement” category, the value “disengagement” will be retained as is, while all other values will be grouped into a category labeled “Other”. The following steps outline this process:
- Click on the “Churn Reason” column –> Click on the “bucket” icon.
- It is advisable to retain the original labels to ensure clarity and maintain consistency. Therefore, the values will be categorized as “disengagement” versus “Others.” Since the data is clean (as it was created specifically for this purpose), there is no need to employ the smart bucketing feature.
- The re-labeled data should be displayed in a new column, which will be named “disengagement bucket.” Additionally, the original column (“Churn Reason”) should be dropped. During training, the machine learning model will be trained to predict the labels in the “disengagement bucket” column. To prevent the model from using “Churn Reason” as an input variable, which would result in data leakage, the “Churn Reason” column must be removed from the training dataset.
- Click on “Apply”.
- Click on the “Churn Reason” column –> Click on the “bucket” icon.
- Once the transform node is successfully created, click on the “+” icon on the node and add an “Output” node.
- Proceed to fill in the details for how the dataset should be output. Note that you must choose a “Primary Key” and an “FQK” (Fully Qualified Key). Apart from creating consistently easy-to-remember names, it is advisable to leave all other settings unchanged. Finally, click on “Apply.”
And voila! You have created a dataset that has a binary classification called “disengagement bucket”. Repeat steps 4 to 7 with new categories, such as “Poor Video Quality” and so on. At the end of this exercise, you should have Data Transform graph that resembles the image below-
Click on “Save” and assign a unique, memorable name to this transformation.
Then, navigate back to the Data Transform tab. Your new transformation should appear in the list. It may take a few minutes for it to be visible. To execute it, click on the drop-down arrow at the far right end of the listing and select “Run Now.” The process will take some time to complete. You can refresh your browser to see the updated status. Once completed, the new datasets (all five of them) will be ready for use in training a new model.
With this step, you have created 5 different datasets that will be consumed in the next part of this series.