Data Prep Scheduling – Data Orchestration Part 6
This blog is part of the Data Orchestration blog series. This part dives into the considerations you should have when scheduling your dataflows as well as some obstacles you may encounter.
Obstacles with Time-Based Scheduling for Data Prep
Now that we in part 5 covered the two ways of scheduling dataflows and recipes it is time to address a few challenges while scheduling a dataflow or recipe using the time-based approach.
Data Sync Duration
As mentioned previously in this blog series we should schedule the dataflow or recipe to run once the data sync has completed. But the major question arises of how do we know the time it takes for the data sync to complete? As of today, the way we estimate the time taken for the data sync to be complete is based on previous runs. Hence you would have to go to the data monitor screen in the data manager to look for the average duration of the data sync run.
Do note that if the data sync for some reason takes longer than the usual time, the dataflow would still run as scheduled. This means irrespective of whether the data sync is complete or not, the dataflow would run as per the schedule.
Impact of Data Sync
The implications of running a dataflow before the data sync means the dataset would not have the latest data from the source. Hence it becomes extremely important that we schedule the dataflow to a time when we know that the data sync would have completed.
Notifications
To make it easier to keep track of the data sync jobs and dataflow jobs, we have notifications. Notifications are essentially a way for users to set and get email ‘Notifications’ of the status of a scheduled job.
Data Sync Notifications
It is possible for us to set up notifications for each connection to be notified on its scheduled data sync. Setting it up means we get notified by email with data sync warnings, failures, or both.
The below image shows how we can set a Notification Alert for SFDC_Local Connection.
Once we have selected the notification option we can select what events we want to trigger a notification as seen in the image below.
Note: Check the Salesforce help documentation on how to set up notifications for data sync.
Dataflow and Recipe Notifications
We also have notifications for dataflow and recipe runs. Setting it up means we get notified by email with dataflow or recipe run warnings, failures, or both.
The below image shows how we can set notification for a dataflow.
Once we have selected the notification option we can select what events we want to trigger a notification as seen in the image below.
Note: Check the Salesforce help documentation on how to set up notifications for dataflow or recipe.
Concurrent Dataflow and Recipe Runs
Let’s now consider a situation where we have scheduled three dataflows or recipes to run at the same time. Technically, we can schedule all the dataflows to run at the same time. However, at any given time, we can have only two dataflows running concurrently, the other dataflows or recipes would be queued to run and only start once one of the running jobs has completed.
Note: Check the Salesforce help documentation for more information on concurrent data syncs and dataflow limits.
Let’s have a look at an example to understand this in more detail.
A Tale of Three Dataflows
Let’s imagine we have three dataflows, and let’s name them Astro, Codey, and Einstein. We will schedule all three dataflows to run at the same time, for instance, 9:00 AM. As mentioned, earlier we need to ensure that the data sync for the related objects is complete before the dataflows can run else we will end up having stale data.
The above image shows that we have scheduled Astro, Codey, and Einstein to run at 9:00 AM. Astro and Codey have started to run. Einstein, however, has to wait for either of them to finish before he can start as we can only have two dataflows running at the same time. Note that although Astro and Codey start at the same time, their individual instructions determine the time it will take to complete and reach the finish line.
Once the dataflow runs have completed they have each created or updated their respective datasets. In this scenario Astro creates Dataset 1, Codey creates Dataset 2 and Einstein creates Dataset 3.
An analytics developer, therefore, needs to keep concurrent job limits in mind.
Note: The maximum number of concurrent dataflow runs: 2 (2 for production orgs with the Tableau CRM Plus platform license, 1 for production orgs with the Tableau CRM Growth platform license or sandbox orgs).
Priority Scheduling of Dataflows and Recipes
In cases where the sequence of the dataflows or recipes is not a concern, we can use the inbuilt analytics settings to automatically manage the dataflow or recipe run queue – priority scheduling.
Priority scheduling for recipes and dataflows automatically manages your run queue. It prioritizes smaller and faster runs while ensuring that larger and longer runs are completed on time. Priority is automatically calculated based on factors such as historic runtime, dataset input size, and CSV file size. Priority scheduling is most helpful to smooth out occasional queue-time spikes. If you never or frequently see long queue times, then priority scheduling isn’t as helpful.
You should activate the feature in advance to manage your queue, not during a problem when your queue is already overloaded. But do note that this feature doesn’t increase your maximum number of concurrent runs.
Note: Check the Salesforce help documentation for more information on Priority Scheduling.
In the next part of this blog series, we will take a look at how you can work with dependent dataflows and recipes. Or head back to review the other blogs in the Data Orchestration blog series.