Best Practices for using Zero Copy Data Federation in Data Cloud

In today’s data-driven world, enterprises are continually seeking ways to maximize the value of their data assets. The ability to leverage customer data stored in various data silos is crucial for effective data utilization and obtaining meaningful insights to drive business outcomes. However, many enterprises face challenges in fully utilizing their data due to it being trapped in silos, making it fragmented and difficult to access. A recent IDC Market Research firm report highlighted that companies are losing up to 30% in revenue annually due to inefficiencies caused by data silos. If properly harnessed, this siloed data can be combined to create a comprehensive 360-degree view of the customer.

Salesforce Data Cloud connects, harmonizes, and unifies data in real time across multiple sources. It resolves identities, standardizes records, and creates a single customer view. With Zero Copy integration, it queries external data without duplication, reducing costs and ensuring compliance. This enables businesses to activate trusted, real-time insights across their ecosystem.

The intent of this blog is to help you optimize your Zero Copy Data Federation use-cases and ensure that you take advantage of all the benefits that the feature has to offer. Let’s begin with an understanding of Zero Copy Data Federation use-case and what it offers.

Understanding Zero Copy Data Federation

Traditional data extraction processes typically involve copying data from the source into a platform like Data Cloud for further processing. This approach presents two major challenges: first, it leads to the creation of multiple copies of the data, making it difficult to maintain a single source of truth. Second, the data used for analysis is only as fresh as the last update, which can lead to outdated insights. Additionally, the cost of copying and transferring large volumes of data adds to operational expenses, leading to suboptimal resource utilization. A recent Gartner study revealed that relying on outdated data costs SMEs up to $15 million annually in lost business opportunities.

Modern customers, however, prioritize a single source of truth, along with security, compliance, and ease of use, all without the need to copy data into a centralized cloud. Zero Copy data federation allow enterprises to securely access and query data in place, reducing the overhead associated with data duplication and movement. It also eliminates the complexity of traditional ETL processes, enabling real-time access to data stored across platforms like Snowflake, Databricks, Google BigQuery, and Redshift, while maintaining control and data integrity.This approach also simplifies compliance with data regulations by minimizing duplication and streamlining governance.

Salesforce Zero Copy Data Federation offers bidirectional capabilities, enabling enterprises to access their customer data kept in their own lake houses directly from Data Cloud and vice versa, without moving the data. To fully benefit from this feature, businesses must carefully plan and follow best practices, ensuring an optimized, secure, and efficient data federation process.

You can learn more about Zero Copy Data Federation by perusing this blog.

Zero Copy Data Federation – Best Practices and Optimization Techniques

When using Zero Copy in Data Cloud, you can gain immediate advantages by seamlessly accessing data from your existing lakehouse setup. However, to fully unlock its potential, it’s advisable to apply some optimization techniques that enhance performance, maintain data efficiency, and control costs. Here are some best practices to help you make the most of Zero Copy for your enterprise workloads:

Understand your use-case(s)

One of the most common patterns observed among Zero Copy users is the adoption of the feature without fully considering their specific use cases. Understanding the use cases is crucial for ensuring optimal performance. This involves identifying the necessary data sets, defining the goal(s) of the initiative, and assessing the resources required for effective optimization.

For instance, imagine an organization aiming to create a customer segment of individuals who shopped in the last 60 days, accessing data from a table containing over 10 billion rows. Scanning the entire table to retrieve a small subset would be inefficient. Instead, creating a view that includes only the relevant data would serve as a more effective source, significantly enhancing performance and reducing the cost incurred.

Additionally, network latency and bandwidth play critical roles in cloud performance. Data frequently traverses various networks or crosses regional boundaries, which can introduce delays due to these constraints. To mitigate these effects, it’s advisable to keep both the instances in the same region.Keeping your data systems in close regional proximity promotes faster, more efficient data access and query processing, leading to smoother and quicker interactions across workloads.

Establish a robust Data Model

When implementing Data Cloud, establishing the data model from the outset is essential for achieving success. This foundational step ensures that your data is consistently structured, regardless of how or where it will be used by downstream applications. A thorough understanding of the data sets being mapped to Data Cloud is critical during this process. For instance, if a business use case requires data manipulation like concatenation, it’s more efficient to perform this operation within the data lakehouse before accessing into the Data Cloud. Cleaner data leads to better performance. It’s also important to ensure that you are using the correct data types when defining your Data Model Objects (DMOs). For example, mapping a field with the “Date” data type in your database to the same “Date” type in Data Cloud—and not as “Datetime”, helps prevent unnecessary error messages later in the process.

Understanding the distinction between Profile and Engagement data when defining your data models is equally important, as it helps businesses optimize insights and query performance. Slow moving data, such as person profile or account data, is ideal for caching and should be tagged as “Profile” to ensure high performance. In contrast, data that changes frequently, such as transaction or web engagement data like customer interaction or behaviours, should be tagged as “Engagement” and accessed via live queries for logical and efficient handling and “Other” for any data that doesn’t fit neatly into either of the previous categories.

Selecting the right Primary Key is essential for maintaining data integrity and ensuring efficient query performance during ingestion and harmonisation. Misconfigured primary keys can result in inefficient full table scans and duplicate records. Ensure that the primary key used in the source system aligns with Data Cloud’s architecture. For example, when integrating a billing system, using the billing_account_id as the primary key helps connect billing data with account profiles. If multiple identifiers are needed to ensure uniqueness, consider creating a composite primary key, such as combining customer_id and billing_id, to avoid duplicate records during data unification.

Sorting data is another powerful technique to enhance query performance by reducing the amount of scanned data and improving retrieval times. For instance, if you frequently query transactions from the last 30 days, sorting data by transaction date allows for faster results. To implement sorting effectively, analyze your most common query patterns and sort the data accordingly to optimize performance and reduce latency.

Choose between Live Query and Caching

Data Cloud provides two key modes for Query Federation: Live Query and Caching. With Live Query, data is accessed directly from your lakehouse via a JDBC driver, ensuring that the latest data is always retrieved. In contrast, Caching allows data to be cached in Data Cloud with configurable refresh frequency. Based on the scenario or use-cases, customers can choose either of the models to get best performance for their data-cloud setup.

Consider a scenario, where your lakehouse data isn’t refreshed frequently (i.e. it can be past purchase data, historical health records of patients underwent trials ), but needs to be accessed in Data cloud frequently. Using a Live Query to fetch the entire data every time would result in fetching the similar data mostly. Data Cloud offers Caching to improve performance in such kind of scenarios. By enabling Caching, Data Cloud can use the data kept in cache and refresh the cache basis the business need (can range from 15 minutes to 7 days), will help in reducing cost as the compute is not used every time, reducing network congestion and improving overall latency. It is important to understand that in cases where data in your lakehouse is frequently changing (i.e. web activity data, cart abandonment data etc), using Live Query will be more optimal. The caching feature in Data Cloud is fully configurable, allowing you to control how and when data is cached . You can seamlessly switch between caching and live-query modes without impacting downstream processes, offering flexibility in data management.

Optimizing your Data Lake

While we can ensure adherence to above practices to get best performance, customer’s Network topology and Data-lakehouse configurations are equally important and plays a critical role in getting the best output. Below are some of the best practices with respect to Network topology, which helps in getting the optimal performance:

Latency & bandwidth: Ensure a low-latency, high-bandwidth network is available to avoid delays in retrieving large datasets. Regional proximity of Data Cloud and the lakehouse also helps to reduce latency and enhance query performance.
Data transfer costs: Optimize the network configuration to minimize data transfer costs, especially for large-scale Live Query operations.
Security: Use private endpoints, VPNs, or other secure communication channels to protect data in transit during both Live Query and Cache refresh operations.

The customer Data lakehouse setup can also significantly influence the efficiency of Zero Copy Data Federation:

Concurrency limit: It is important to handle multiple simultaneous queries without bottlenecks, in case of live queries. Use cases, such as CRMA dashboards sends multiple queries at once (10-12 queries for fetching visuals), it’s essential to ensure that the data lakehouse has sufficient concurrency capabilities. The concurrency limit helps to manage simultaneous queries efficiently, avoiding performance slowdowns when multiple queries are being executed at the same time
Compute capacity: It is important that lakehouse is flexible enough to manage high workloads to ensure optimum performance. Under-provisioned resources can lead to slower queries or failures during high demand.
Monitoring query workload: Use query tagging to track and analyze query workloads effectively. Performance monitoring tools provide insights into execution times, scan sizes, and bottlenecks. This helps optimize resource-intensive queries and improve overall efficiency.
Optimization techniques: To optimize filtering, sorting, and aggregation use indexing on frequently queried columns and distribution keys to minimize data movement. Leverage partitioning to reduce scan sizes and avoid costly wildcard searches. For aggregations, precompute results using summary tables or materialized views to enhance query performance

Conclusion

While Zero Copy offers powerful features and the capability to provide deep customer insights by utilizing data spread across different lake houses, its true potential lies in how effectively it is implemented. When used optimally and in line with best practices, organizations can unlock significant advantages such as streamlined data access, improved processing efficiency, and reduced costs. Properly leveraging Zero Copy ensures seamless integration without the need for data duplication, which not only saves storage but also enhances performance and ensures you have latest copy of data to work on. By embracing these best practices, enterprises can fully harness the value of their data while maintaining scalability and flexibility across their data architecture. To learn more, head over to the official webpage.

Check out these articles to understand specific Zero Copy integrations between Data Cloud and Snowflake, Databricks, Google BigQuery, AWS Redshift federation, AWS Redshift sharing to get more product perspective.

Written by

Gaurav Garg

Gaurav Garg is a Senior Product Manager on the Salesforce Data Cloud team. With over a decade of experience in product management, he has held key roles at Amazon India, JioMart, OLACabs, and Oracle. In his most recent role at Amazon, he led the video shopping platform for Amazon India and emerging markets. Currently, he is part of the Bring Your Own Lake – Zero Copy Data Federation product team, collaborating closely with leading data lake partners. You can follow him on LinkedIn.

See author's posts

Best Practices for using Zero Copy Data Federation in Data Cloud

Understanding Zero Copy Data Federation

Zero Copy Data Federation – Best Practices and Optimization Techniques

Understand your use-case(s)

Establish a robust Data Model

Choose between Live Query and Caching

Optimizing your Data Lake

Conclusion

Written by

Gaurav Garg

Related

Leave a Comment Cancel Reply

Understanding Zero Copy Data Federation

Zero Copy Data Federation – Best Practices and Optimization Techniques

Understand your use-case(s)

Establish a robust Data Model

Choose between Live Query and Caching

Optimizing your Data Lake

Conclusion

Written by

Gaurav Garg

Share this:

Related

Leave a Comment Cancel Reply