Getting the most out of Recipe data preview samples
One of the best features in Data Prep is the ability to preview transformations as you’re building a recipe. You can click on any node in the graph to view a preview of the transforms that you created up to that point. Without previews, you’d have to run the recipe (and wait while it runs) to check the work that you’ve done so far. Data previews are much faster, and allow you to stay in the editor as you build your recipe and make adjustments.
This post walks you through the ins-and-outs of previews, how to handle challenges you may run into, and describes new sampling features that were introduced in the Summer ‘22 release.
How data previews work
How are data previews so fast? We generate the preview using a sample from the data source instead of the full dataset. We take the first 2,000 rows from each input node (configurable up to 10,000 rows), and then perform transformations on the sampled data for display in the Preview tab.
This means that the preview won’t always display 2,000 records for every node. It’s possible that there are situations when the data preview is completely empty, like:
- Creating a filter that excludes everything from the sample
- Joining two samples together with no matching key values
- Aggregating data on a grouping field that doesn’t have many values in the sample.
One way to improve the data preview in these situations is to try increasing the sample size of each input node, in hopes of including more records to be transformed. It’s hard to tell if the additional records would have any effect on the data preview, though. The additional records would still follow the natural ordering from the data source.
No more empty data previews
Summer ‘22 introduces two new ways to customize data preview sampling. There are now three sample modes that you can use:
- Top N rows: Selects the first N records from the input source (2,000 by default, configurable up to 10,000 rows)
- Filtered rows (new): Selects records based on filters that you create
- All column values (new): Selects records based on a column that you select, including at least one of each column value in the sample.
When using the Filtered rows sampling mode, you create filters to define which records you want in the sample. These filters only affect the sample and what you see in the data preview. They have no effect on the output of the recipe.
How to handle empty data previews?
You already know different ways to sample data previews. Let’s now see how you can use those ways to handle empty data previews.
Filtered rows sampling
When working on a recipe, you may run into an empty data preview that could be improved by using a different sampling mode. In that case, you’ll see a suggestion to adjust sampling in the upstream input nodes. You can then go to each input node and create sample filters for more targeted samples.
In an input node, change the sampling mode to Filtered rows, and create filters producing a sample with records that can be displayed in downstream transform previews. For example, if there’s a filter node that selects records that were only created this year, create a similar sample filter in the input node. This ensures that there are records in the sample that satisfy the filter condition.
If there’s a join node, create sample filters in the input nodes that select records with column values that match the join condition. For example, if there’s a join between Sales and User, create sample filters in Sales and User that filter for the same join condition key values. This way, there will be matching records from each sample for the join.
If you run into a derived field, go to each of the input nodes that contribute to the field and adjust the sample mode in each node individually. Some scenarios where you may need to backtrack to find multiple input nodes are:
- A formula that operates on multiple input columns
- Multiple columns are concatenated together
- A column that is a result of an aggregation.
As you create new filters, joins, and aggregates in a recipe, you may find yourself going to the input nodes to adjust their sample filters accordingly to keep the preview populated with data. This way, you can keep your data samples finely-tuned as you build your recipe.
Note: We don’t automatically generate sample filters like this for you. This is a future enhancement that we’d love to work on, though!
Debugging a recipe – Filter Row sampling to the rescue
When building a recipe, sometimes you are interested in seeing a specific set of records in the data preview. You may be building a formula and want to see how it works with certain input values. Or you may want to see how your transforms react to null values. It can be hard to scroll the data preview, looking for specific records of interest, especially if you need to navigate between nodes and are looking for the same set of records.
With the Filtered rows sampling mode, it’s easy to define a set of records that you want to see in the data preview and follow that set throughout the entire recipe to see how they are processed at each node in the recipe. This is a handy way to find records for testing and verifying transforms as you are building them.
All column values sampling – See all column values when you need them
What if managing sample filters sounds like too much work? There’s another new sampling mode that you can use called All column values. You select a column, and we populate the sample for you based on that column. All column values will be represented in the sample, up to the sample size.
When building a Bucket transform, this is useful because you have access to all the column values when creating buckets. The column values you see when creating buckets are from the data preview, and this could be limiting if all the column values don’t appear in the first 2,000 records. Now, all you have to do is change the sample mode to All column values for the column you are bucketing, and you’ll be able to access all of those column values.
The All column values sample mode can also be useful to ensure that there are matching join column values for a join node. If there’s a join between Sales and User, change the sample mode for Sales and User for all column values on the join columns. This way, the preview of the join will show as many different matching join values as possible.
Supported input sources
The new sampling modes Filtered rows and All column values are available now in the Summer ‘22 release. They are available for use in datasets and direct datasets. Support for connected objects is coming in a future release.
Further reading
See the new sampling modes in action in the CRM Analytics Summer ’22 Recipe Feature Videos on Vidyard.
To learn more about the new sampling modes, check out the Summer ‘22 Release Notes. Please send me a message with questions or comments. We’ve been working hard to improve recipe data previews, and I’d love to hear feedback as you spend more time with it. Thanks for reading!
GARY!!! GARY!!!! GARY!!!! WOOOOO
Great work Gary and team. Filtered sample is a great enhancement.