Exporting CDS data to Azure Data Lake is Generally Available
We are super excited to announce the general availability of the Export to data lake (code name: Athena) to our Common Data Service customers. The Export to data lake service enables continuous replication of Common Data Service entity data to Azure Data Lake Gen 2 which can then be used to run analytics such as Power BI reporting, ML, Data Warehousing and other downstream integration purposes.
Export to data lake simplifies the technical and administrative complexity of operationalizing entities for analytics and managing schema and data. Within a few clicks, customers will be able to link their Common Data Service environment to a data lake in their Azure subscription, select standard or customer entities and export it to data lake. All data or metadata changes (initial and incremental) in the Common Data Service are automatically pushed to the Azure data lake gen 2 without any additional actions.
Our vision is to empower our customers to gain comprehensive insights and drive business actions based on their data in the Common Data Service (CDS). To enable this, we are building a new service called Export to data lake which is a pipeline to continuously export data from the Common Data Service to Azure data lake gen 2; designed for enterprise big data analytics, cost-effective, scalable, has high availability/disaster recover capabilities and enables best in class analytics performance. Data is stored in the Common Data Model (CDM) format which provides semantic consistency across apps and deployments. The standardized metadata and self-describing data in an Azure data lake gen 2 facilitates metadata discovery and interoperability between data producers and consumers such as Power BI, Azure Data Factory, Azure Databricks, and Azure Machine Learning service.
Prerequisites for using the Export to Data Lake service
Before you can export Common Data Service data to a data lake, you must create and configure an Azure data lake Gen 2 storage account:
- Follow the steps in the Create an Azure Data Lake Storage Gen2 storage account article.
- Set your storage as Storagev2 (general purpose v2).
- The storage account must have the Hierarchical Name Space feature enabled.
- You must be granted an Owner role on the storage account.
- The storage account must be created in the same Azure AD tenant as your PowerApps tenant.
- It is recommended that the storage account is created in the same region as the PowerApps environment you plan to use it in.
- It is recommended to set replication setting to Read-access geo-redundant storage (RA-GRS).
- Simple and intuitive interface to enable and administer replicated entities.
- Ability to link/unlink the Common Data Service environment to a data lake in customer’s Azure subscription.
- Continuous replication of entities to Azure data lake.
- Support for initial and incremental writes for data and metadata.
- Support for replicating standard and custom entities.
- Support for replicating create, update and delete operations.
- Continuous snapshot updates for large analytics scenarios.
Step-by-step to export CDS entity data to Azure data lake gen2
If you already have a Common Data Service environment and an Azure data lake storage account with appropriate permissions as mentioned above, here are some quick steps to start exporting entity data to data lake.
From the PowerApps maker portal, select Export to data lake service in the left-hand pane and launch the New link to data lake wizard.
At the Select Storage Account step, pick your Azure subscription and resource group and then select the storage account that you want to link to the Common Data Service environment.
At the Add entities step, select the Common Data Service entities whose data you want to push to the lake.
After you hit Save, your Common Data Service environment will be linked to the Azure data lake storage account you provided in earlier step and we will create the file system in the Azure storage account with a folder for each entity you chose to replicate to the data lake (Go to https://portal.azure.com, select your storage account and you will see the file system with the replicated entities in their corresponding folders)
Under the Linked data lake you just created, you can view a dashboard showing the status (initial sync status, count of records replicated and last synchronized time stamp) for each of the entities.
That’s it – You just linked your Common Data Service environment to the Azure data lake storage account in your subscription and all set to continuously export data to data lake.
- You need be a CDS administrator to link the CDS environment to Azure data lake gen 2.
- As part of linking the Common Data Service environment to a data lake, you are granting the Export to data lake service access to your storage account. Please ensure that you have followed the pre-requisites mentioned about to create and assign appropriate permissions to the Azure data lake storage account. Additionally, you are also granting the Power Platform Dataflows service access to your storage account. For more information, please refer to the Dataflows documentation.
- Please note that only entities enabled for change tracking will be visible in under the Add entities list.
Linked data lake management and administration
Under the Linked data lake you just created, you can view the status (initial sync status, count of records replicated and last synchronized time stamp) for each of the entities.
You can use the Link to data lake wizard to link additional data lakes to this environment and Unlink data lake to unlink your environment. While unlinking the environment, you also get the option to delete the data in the lake in case you need to start over.
For ongoing administration, use the Manage entities wizard to add/remove entities.
Viewing your data in Azure data lake gen 2
The replicated data is store in the Azure data lake in the Common Data Model format. You can view your replicated data in Azure data lake storage by logging into https://portal.azure.com. After you login, select the storage account and under Storage Explorer\File System you would see a container with your environment name under which you would see a folder for each of the entities you chose to replicate to the lake along with the model.json file. The metadata file (or model.json) in a Common Data Model folder describes the data in the folder, metadata and location.
Here is an example of the Lead entity replicated to the lake along with the model.json file.
The model.json file, along with name and version, provides a list of entities that have been pushed to the lake and its attributes. We also annotate the model.json file to provide the initial sync status and completed time. When we write data to the lake, in order to expedite and make it more efficient to consume files, we partition the data. The model.json also shows the partitioning strategy (yearly for now, will be smart partitioned based on amount of data in the future) along with the location of the snapshot files. This makes it more efficient and faster to consume data.
Here is an example of the model.json file showing Lead entity and its attributes along with the annotations and location of the partitioned snapshot files. For more information, please refer to the Common Data Model documentation.
Continuous snapshot updates for large analytics scenarios
While we continuously read the changes from the source, our users don’t expect the data to be constantly refreshed on the destination Ex: Think of a user reading a Power BI report wherein the data is constantly refreshed. This can be counter productive wherein user is not provided with a reliable snapshot of data. To solve this problem, we are introducing a new feature called Snapshots wherein we provide a read-only Snapshot copy of data which is updated at a regular interval (currently 1 hour, will be user configurable in the future.) This ensures that at any given point, a user can reliably consume data in the lake.
Support for initial and incremental writes for data and metadata
Export to data lake service support initial and incremental writes for data and metadata. Any data or metadata changes in the Common Data Service is automatically pushed to the lake without any additional actions. This is a push (vs. pull) based writes where changes are pushed to the destination without the need to setup refresh intervals.
We support replicating standard and custom entities. It is important to note that we use the change tracking feature in the Common Data Service to provide a way to keep the data synchronized in an efficient manner by detecting what data has changed since the data was initially extracted or last synchronized.
Please ensure that your entities have been enabled for change tracking. Please click here for more details on change tracking.
Support for replicating CUD operations
We support replicating all CUD (create, update and delete) operations from Common Data Service to the data lake. Ex: If you delete a record in Account entity in Common Data service, it will be replicated to the destination, in the data lake.
Step by step documentation
For details on how to export Common Data Service entity data to Azure data lake gen 2, please refer to our official step by step documentation.
As always, thank you for being our customer and we sincerely appreciate your time in testing this feature and providing us feedback – Please a drop a line at AthenaPreview@service.microsoft.com.