Data Cloud Segmentation Best Practices
“Is there a right way to create segments in Data Cloud?”, “I think my data quality is pretty good, and the volume of the data I am working with is also decent(nothing when compared to what Data Cloud can handle), why is my segment failing still?”, “I see that Resource Exhausted error frequently. How do I ensure this doesn’t happen very frequently?” “Can I get some guidelines to create the optimised segments in Data Cloud?”, and so on. If you are someone who is familiar with creating segments in Dtaa Cloud but has always had these questions, then this is the blog for you.
A couple of months ago, I joined Data Cloud’s Segmentation team as a Product Manager and had all the above-mentioned questions. I started looking for their answers and soon realised that while the help documents are pretty good, we still need to jot down the best practices one can follow to avoid some of the common issues I see our customers encountering today. And here we are.
Note: This blog conveys my and my colleagues’s point of view on how you can improve the performance of a segment today. Data Cloud is an evolving product, and these techniques may not work later on. However, I’ll try to keep this post as up-to-date as possible.
Let us jump into the Best Practices.
1. Choose the right “Segment On” Entity
When creating a segment in Salesforce Data Cloud, you choose a “Segment On” Data Model Object (DMO). Simply speaking, with further refinement, distinct members of this entity form your target audience. For example, you run a car insurance company and want to run a campaign for your existing customers who have a good insurance score to renew their policies with some additional benefits. In this case, your Segment On entity potentially can be the “Unified Individuals” DMO. But there’s also an “Individual” DMO. So which one should you use? More often than not, you need to choose between a unified object and its constituent object.
“Individual” DMO vs “Unified Individual” DMO:
If your data stems from various sources, I recommend using the Unified Individual DMO as the “Segment On” entity because Identity Resolution rules (defined and customized in Data Cloud) merge “similar” individual records into one. This means there’s a high possibility that the total number of records in the “Unified Individual” DMO is fewer than the total number of records in the “Individual” DMO. This also means that in your segment audience, you won’t have duplicate individuals.
As a general rule, if your data is coming in from different systems, use unified objects instead. This ensures higher performance and gives you more accurate results.
2. Use the Correct DMO Type
Data Cloud has three categories of data: Profile, Engagement, and Other. You can learn more about each here. In this section, I focus on the “engagement” type for the DMO. Engagement DMOs usually store large quantities of data and have higher influx when compared to the rest. As such, we partition this data by the “Event Time Field” to make read operations faster. Now, it’s possible that Engagement Data ingested in Data Cloud wasn’t mapped to an “Engagement Type” DMO. For example, you ingested email interaction data but mapped it to a DMO that’s of type “Other”. This operation doesn’t result in an error, but we don’t partition this data. This results in unoptimized read operations. So, always ensure that the DMO you’re using in your segment, especially while working with engagement data, is indeed mapped to the correct type of DMO.
3. Container and Attribute Paths – Choose the shortest paths and avoid cyclic paths.
To identify your target audience for a specific business purpose such as running a campaign, you can apply filters on “Direct” and “Related” attributes. What’s the difference between these? You can find out all about them here, but I’ll quickly summarize:
- Direct Attributes—Attributes of “Segment On” entity or Data model objects (DMOs) that have a 1:1 or a N:1 relationship with the Segment On object (on some path).
- Related Attributes—DMOs that have a 1:N relationship with the Segment On object (on some path).
When multiple paths exist between the “Segment On” entity and the “Attribute’s” DMO, the segment builder provides an option to select the needed path manually. And this is where you need to be cautious. You see, every time that you select a direct or related attribute, the segmentation engine uses the “DMO Relationships” defined at the Data Model level to determine how to reach the chosen attribute from the Segment On entity. Simply speaking, it’s a mechanism to join DMOs together. Each relationship (join) allows you to move between related DMOs. This string of joins forms “paths” in segmentation. The segmentation engine uses the same paths to move from the “Segment On” entity to direct and related attributes (which are also DMOs). By default, the engine chooses the only path available between two DMOs. However, there are instances where more than one path exists between two DMOs, in which case the engine defaults to the shortest path. In the image below, note the various paths that exist between the “Case” and “Unified Account main” DMOs.
As a best practice, when possible, always select the shortest path between two DMOs. The longer the path gets, the longer the join path is, which creates more work for the segmentation engine.
Cyclic Paths:
Also, avoid using cyclic paths when possible. In cyclic paths, you start from a DMO and along the join relationships, you end up reaching the same DMO. Basically, a→b→c→a , or a → b → c → b. So far, we haven’t encountered any scenario where cyclic paths were unavoidable. When you identify a cyclic path, check the alternative to reach the same results. For example, in the case of a→b→c→a, look for a direct attribute within a DMO. In a → b → c → b, look for an attribute in b, rather than fetching it from c. Some examples examples include sales_order_key → line_item_order_key → sales_order_key. This is a classic example of a cyclic path, which you can easily avoid.
Cyclic Paths lead to increased processing time and can result in query failures.
4. Limit the Amount of Data to Process
As we decrease the amount of data that the segmentation engine has to process, the overall performance of the segment improves. How do we do this though?
You can use Data spaces and/or you can define explicit filters in the segment for it. Let us discuss each approach in detail.
4.1 Use Data Spaces
Data spaces are a logical partitioning of data. If a data space is enabled in your org, you can control access to DMOs, Calculated Insights, Segments, and Activations through permission sets. See the diagram for a helpful overview:
DMOs are dataspace-specific. So, using a data space doesn’t only govern the access of a segment and its output, it also ensures the Dataspace-specific filter (if any) is applied at the segment level too. Think of a data space filter as a pushdown predicate that reduces the amount of data the segmentation engine has to process, thus reducing the processing time.
4.2 Use Explicit Filters on the Engagement Data
Use an explicit filter on Event Date Time columns of Engagement Data.
Yes, we apply a default filter of two years on standard segments and seven days on rapid segments on the Event Date Time column of Engagement Data. This filter is also called the “lookback period”. But whenever and wherever possible, you can further reduce the data to process by creating filters on the Event Date Time columns.
5. Merge Containers When Possible
When two containers with the same container path are joined by “Or” logic, merge them.
When possible, merge your containers while working with Related attributes. All the containers which use the exact same path and have an “Or” relationship between them, can most likely be merged. If you understand this statement, jump to the next section. If not, then keep reading. In the next paragraph, I explain how containers work and why am I encouraging you to merge them when possible.
Let’s say that I want to identify all those accounts that have at least one case in closed status or one low-priority case. Let me create two scenarios for this.
- Create two separate containers and join by OR:
This is what my segment rule looks like:
The segmentation engine identifies all the accounts that have cases in “Closed Status”, and then identifies all the accounts that have cases with “Low” priority. After this, it performs a union. This is what the resultant sample SQL looks like:
select distinct count(*)
from (
(
select distinct "ssot__Account__dlm__0"."ssot__Id__c", "ssot__Account__dlm__0"."KQ_Id__c"
from "ssot__Account__dlm" as "ssot__Account__dlm__0"
where exists (
select *
from "ssot__Case__dlm" as "ssot__Case__dlm__0"
where (
"ssot__Account__dlm__0"."ssot__Id__c" = "ssot__Case__dlm__0"."ssot__AccountId__c"
and trim(cast("ssot__Account__dlm__0"."ssot__Id__c" as varchar)) <> ''
and "ssot__Account__dlm__0"."KQ_Id__c" is not distinct from "ssot__Case__dlm__0"."KQ_AccountId__c"
and lower("ssot__Case__dlm__0"."ssot__CaseStatusId__c") = lower('Closed')
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 06:56:13.320 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 06:56:13.320 -07:00'
and "ssot__Case__dlm__0"."ssot__Id__c" is not null
)
)
)
union (
select distinct "ssot__Account__dlm__0"."ssot__Id__c", "ssot__Account__dlm__0"."KQ_Id__c"
from "ssot__Account__dlm" as "ssot__Account__dlm__0"
where exists (
select *
from "ssot__Case__dlm" as "ssot__Case__dlm__1"
where (
"ssot__Account__dlm__0"."ssot__Id__c" = "ssot__Case__dlm__1"."ssot__AccountId__c"
and trim(cast("ssot__Account__dlm__0"."ssot__Id__c" as varchar)) <> ''
and "ssot__Account__dlm__0"."KQ_Id__c" is not distinct from "ssot__Case__dlm__1"."KQ_AccountId__c"
and lower("ssot__Case__dlm__1"."ssot__CasePriorityId__c") = lower('Low')
and "ssot__Case__dlm__1"."ssot__CreatedDate__c" >= timestamp '2021-09-21 06:56:13.320 -07:00'
and "ssot__Case__dlm__1"."ssot__CreatedDate__c" >= timestamp '2021-09-21 06:56:13.320 -07:00'
and "ssot__Case__dlm__1"."ssot__Id__c" is not null
)
)
)
) as "alias_108427578"
As you can see, the query engine performs an extra subquery to identify accounts that have cases with low priority. Let’s look at the second approach to get a better idea of what I am referring to.
2. Create one container and join filter conditions by OR:
This is what my segment rule looks like:
The segmentation engine identifies all the accounts that have at least one case in either “Low” priority or “Closed” status.
Here’s what the final query looks like:
select distinct count(*)
from (
select distinct "ssot__Account__dlm__0"."ssot__Id__c", "ssot__Account__dlm__0"."KQ_Id__c"
from "ssot__Account__dlm" as "ssot__Account__dlm__0"
where exists (
select *
from "ssot__Case__dlm" as "ssot__Case__dlm__0"
where (
"ssot__Account__dlm__0"."ssot__Id__c" = "ssot__Case__dlm__0"."ssot__AccountId__c"
and trim(cast("ssot__Account__dlm__0"."ssot__Id__c" as varchar)) <> ''
and "ssot__Account__dlm__0"."KQ_Id__c" is not distinct from "ssot__Case__dlm__0"."KQ_AccountId__c"
and (
lower("ssot__Case__dlm__0"."ssot__CaseStatusId__c") = lower('Closed')
or lower("ssot__Case__dlm__0"."ssot__CasePriorityId__c") = lower('Low')
)
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 07:07:54.591 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 07:07:54.591 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 07:07:54.591 -07:00'
and "ssot__Case__dlm__0"."ssot__Id__c" is not null
)
)
) as "alias_725833"
As you can see, the segmentation query engine didn’t have to run an additional sub-query to identify needed accounts. So, when possible, merge your containers.
Note: You can choose to merge your containers where they’re joined by an “AND” condition, too. But use caution. Follow this help doc to understand how the AND behavior between containers works.
6. Use Nested Operators
Use nested operators to address complex segmentation requirements in one container.
In the previous section, we discussed the impact that adding a container has on the segment query. Reducing the number of containers helps improve segment performance. Using nested operators, you can address many complex scenarios where you may want to have a mix of operators (for example A AND (B OR C)).
Let’s return to our previous example. I want to identify all the accounts that have at least one low priority case or at least one case in closed status. Here, let’s say I want to look at only those low priority cases that were created this month.
Basically, this is what I’m trying to achieve:
(Case Status = ‘Closed’ OR (Case Priotiy = ‘Low’ AND CreatedDate Is This Month))
To address such scenarios, instead of using two separate containers, you can use nested operators and achieve the same results. Check out this video to understand how I created nested operators.
We support up to five levels of nesting for operators. So, you can address even more complex scenarios in your segment through one container.
In case you want to take a look at the SQL, here it is:
select distinct count(*)
from (
select distinct "ssot__Account__dlm__0"."ssot__Id__c", "ssot__Account__dlm__0"."KQ_Id__c"
from "ssot__Account__dlm" as "ssot__Account__dlm__0"
where exists (
select *
from "ssot__Case__dlm" as "ssot__Case__dlm__0"
where (
"ssot__Account__dlm__0"."ssot__Id__c" = "ssot__Case__dlm__0"."ssot__AccountId__c"
and trim(cast("ssot__Account__dlm__0"."ssot__Id__c" as varchar)) <> ''
and "ssot__Account__dlm__0"."KQ_Id__c" is not distinct from "ssot__Case__dlm__0"."KQ_AccountId__c"
and (
lower("ssot__Case__dlm__0"."ssot__CaseStatusId__c") = lower('Closed')
or (
lower("ssot__Case__dlm__0"."ssot__CasePriorityId__c") = lower('Low')
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" between timestamp '2023-08-19 15:20:53.212 -07:00' and timestamp '2023-10-23 15:20:53.212 -07:00'
and try(cast(date_diff(
'month',
date_trunc(
'month',
at_timezone(
cast(current_timestamp as timestamp),
'America/Los_Angeles'
)
),
date_trunc(
'month',
at_timezone(
cast("ssot__Case__dlm__0"."ssot__CreatedDate__c" as timestamp),
'America/Los_Angeles'
)
)
) as decimal(38, 18))) between cast('0' as decimal(38, 18)) and cast('0' as decimal(38, 18))
)
)
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 08:20:53.212 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 08:20:53.212 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 08:20:53.212 -07:00'
and "ssot__Case__dlm__0"."ssot__CreatedDate__c" >= timestamp '2021-09-21 08:20:53.212 -07:00'
and "ssot__Case__dlm__0"."ssot__Id__c" is not null
)
)
) as "alias_56795836"
7. Use Nested Segments
You can nest up to one segment inside another segment to either include or exclude one segment from another. When possible, use the segment membership mode to do so, as opposed to the definition mode. Segment membership mode uses the output of a segment’s last run, whereas segment definition mode has the filter criteria defined in a segment. When you nest a segment in membership mode, the segmentation engine doesn’t need to re-run filter criteria of the nested segment, thus improving the segment performance.
Note: You can read more about nested segments here.
8. Use Calculated Insights or Data Transform for Complex Operations
Calculated Insights and Data Transforms are strong tools that help you lessen the segmentation computation burden. Let’s return to our previous example. I want to identify all those accounts that have at least one low-priority case or at least one case in closed status that were created this month. Here, I want to focus on only those accounts that have a revenue >= 500k and a # of employees >= 5000.
Instead of adding those two filters in segment, I can create and use a CI that identifies these accounts for me. This is an easy example, but in real-world scenarios, I have seen customers use CIs extensively in their Segments.
Using CI, you can flag accounts that meet the above-mentioned criteria. And then you can use that CI in the segment instead. This eases the overall load on the segmentation engine.
Note: You can learn more about using Calculated Insights in segments here. Also, please be vary of the schedules of segments and CIs.
9. Avoid Using Skewed Engagement Data
Avoid using skewed engagement data in segments as it affects segment performance when the data volumes are high. In the very first point, we discussed how we use the “Event Date Time” column to partition engagement data. When the data is skewed, there’s a possibility of one partition containing a lot of data, which leads to increased operation time.
Special Thanks: This blog is here today because I’ve had inputs from Nishant, Piyush, Priyank, Sameer, Samyak, SD, Vartik, Vivek, and Naheeda.
I hope you found this blog helpful. Our product is constantly evolving, and I’ll try to keep this blog up-to-date with the latest recommendations. If you have any questions, please reach out to me in the comments section.