arrow_back

Exploring the Lineage of Data with Cloud Data Fusion

Join Sign in

Exploring the Lineage of Data with Cloud Data Fusion

1 hour 30 minutes 7 Credits

GSP812

Google Cloud Self-Paced Labs logo

Overview

This lab shows how to use Cloud Data Fusion to explore data lineage: the data's origins and its movement over time.

Cloud Data Fusion data lineage helps you:

  • Detect the root cause of bad data events
  • Perform an impact analysis prior to making data changes

Cloud Data Fusion provides lineage at the dataset level and field level, and is time-bound to show lineage over time.

  • Dataset level lineage shows the relationship between datasets and pipelines in a selected time interval.

  • Field level lineage shows the operations that were performed on a set of fields in the source dataset to produce a different set of fields in the target dataset.

For the purpose of this lab, you will use two pipelines that demonstrate a typical scenario in which raw data is cleaned then sent for downstream processing. This data trail from raw data to the cleaned shipment data to analytic output can be explored using the Cloud Data Fusion lineage feature.

Note: Currently, the Cloud Data Fusion Lineage feature is only available with the Cloud Data Fusion Enterprise Edition.

Objectives

  • Run sample pipelines to produce lineage

  • Explore dataset and field level lineage

  • Learn how to pass handshaking information from the upstream pipeline to the downstream pipeline

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time. There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  1. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  2. Click Open Google Console.

  3. Click Use another account and copy/paste credentials for this lab into the prompts. If you use other credentials, you'll receive errors or incur charges.

  4. Accept the terms and skip the recovery resource page.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options

  • Do not sign up for free trials

  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue. It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:

gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:

gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Prerequisites

In this lab, you will work with two pipelines:

  • The Shipment Data Cleansing pipeline reads raw shipment data from a small sample dataset and applies transformations to clean the data.
  • The Delayed Shipments USA pipeline reads the cleansed shipment data, analyzes it, and finds shipments within the USA that were delayed by more than a threshold.

Use the following links to download these sample datasets to your local machine:

Add necessary permissions for your Cloud Data Fusion instance

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. In the Cloud Console, from the Navigation menu select Data Fusion > Instances. You should see a Cloud Data Fusion instance already setup and ready for use.

  2. Click on the instance name. On the Instance details page, copy the Service Account to your clipboard.

ent-service-account.png

  1. In the Cloud Console navigate to the IAM & admin > IAM.

  2. On the IAM Permissions page, click Add.

  3. In the New principals field paste the service account.

  4. Click into the Select a role field and start typing Cloud Data Fusion API Service Agent, then select it.

add-service-account.png

  1. Click SAVE.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Permissions tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Open the Cloud Data Fusion UI

  1. In the Console return to Navigation menu > Data Fusion, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in. If prompted to take a tour of the service click on No, Thanks. You should now be in the Cloud Data Fusion UI.

  2. Click Studio from the left navigation panel to open the Cloud Data Fusion Studio page.

studio_fusion_UI.png

Import, Deploy and Run Shipment Data Cleansing pipeline

  1. Import the raw Shipping Data. Click Import in the top-right of the Studio page, then select and import the Shipment Data Cleansing pipeline that you downloaded earlier.
If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.

06-import-shipping.png

  1. Now deploy the pipeline. Click Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run in the top-center of the Pipeline page to run the pipeline.

Click Check my progress to verify the objective. Import, Deploy and Run Shipment Data Cleansing pipeline

Import, Deploy, and Run the Delayed Shipments data pipeline

After the status of the Shipping Data Cleansing shows Succeeded, you will proceed to import and deploy the Delayed Shipments USA data pipeline that you downloaded earlier.

  1. Click Studio from the left navigation panel to return to the Cloud Data Fusion Studio page.

  2. Click Import in the top-right of the Studio page, then select and import the Delayed Shipments USA data pipeline that you downloaded earlier.

If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.
  1. Deploy the pipeine by clicking Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run in the top-center of the Pipeline page to run the pipeline.

After this second pipeline successfully completes, you can continue to perform the remaining steps below.

Click Check my progress to verify the objective. Import, Deploy, and Run the Delayed Shipments data pipeline

Discover datasets

You must discover a dataset before exploring its lineage.

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.
  2. Since the Shipment Data Cleansing dataset specified "Cleaned-Shipments" as the reference dataset, enter shipment in the Search box. The search results include this dataset.

07-metadata-search.png

Using tags to discover datasets

A Metadata search discovers datasets that have been consumed, processed, or generated by Cloud Data Fusion pipelines. Pipelines execute on a structured framework that generates and collects technical and operational metadata. The technical metadata includes dataset name, type, schema, fields, creation time, and processing information. This technical information is used by the Cloud Data Fusion metadata search and lineage features.

Although the Reference Name of sources and sinks is a unique dataset identifier and an excellent search term, you can use other technical metadata as search criteria, such as a dataset description, schema, field name, or metadata prefix.

Cloud Data Fusion also supports the annotation of datasets with business metadata, such as tags and key-value properties, which can be used as search criteria. For example, to add and search for a business tag annotation on the Raw Shipping Data dataset:

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.

  2. Enter Raw shipping data in the search page of metadata option.

  3. Click on Raw_Shipping_Data.

  4. Under Business tags, click + then insert a tag name (alphanumeric and underscore characters are allowed) and press Enter.

08-add-tag.png

You can perform a search on a tag by clicking the tag name or by entering tags: tag_name in the search box on the Metadata search page.

Explore lineage

Dataset level lineage

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page, and enter shipment in the Search box.

  2. Click on the Cleaned-Shipments dataset name listed on the Search page

  3. Then click the Lineage tab. The lineage graph shows that this dataset was generated by the Shipments-Data-Cleansing pipeline, which had consumed the Raw_Shipping_Data dataset.

09-lineage.png

Field level lineage

Cloud Data Fusion field level lineage shows the relationship between the fields of a dataset and the transformations that were performed on a set of fields to produce a different set of fields. Like dataset level lineage, field level lineage is time-bound, and its results change with time.

  1. Continuing from the Dataset level lineage step, click the Field Level Lineage button in the top right of the Cleaned Shipments dataset-level lineage graph to display its field level lineage graph.

10-field-lineage.png

  1. The field level lineage graph shows connections between fields. You can select a field to view its lineage. Select View, then select Pin field to view that field's lineage only.

011-pin-field.png

  1. Locate the hazardous_goods field under the Cleaned-Shipments dataset, Select View, then select View cause to perform a cause analysis.

hazardous-good-field-lineage.png

The field level lineage shows how this field has tansformed over time. In the raw data format, the hazardous_goods field was a Y/N field. The transformation steps that were applied did a find and replace where a value of Y was replaced with true and a value of N was replaced with false. Furthermore, another transformation was applied to change this to a boolean type column before writing it into a BigQuery table.

The lineage exposes the history of changes the particular field has gone through. Other examples include concatenation of few fields to compose a new field (like first name and last name combined to produce name, or computations done on a field (like converting a number to a percentage against total count).

The cause and impact links show the transformations performed on both sides of a field in a human-readable ledger format.

Congratulations

You have learned how to explore the lineage of your data. This information can be essential for reporting and governance. It can help different audiences understand how data came about to be in the state it is in.

CloudDataFusion_advanced_125x135.png

Continue Your Quest

This self-paced lab is part of the Qwiklabs Building Advanced Codeless Pipelines on Cloud Data Fusion Quest. A Quest is a series of related labs that form a learning path. Completing this Quest earns you the badge above, to recognize your achievement. You can make your badge (or badges) public and link to them in your online resume or social media account. Enroll in this Quest and get immediate completion credit if you've taken this lab. See other available Qwiklabs Quests.

Explore other Quests

Check out these suggestions:

Manual Last Updated October 22, 2021
Lab Last Tested October 22 , 2021

©2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.