arrow_back

Exploring the Lineage of Data with Cloud Data Fusion

登录 加入
访问 700 多个实验和课程

Exploring the Lineage of Data with Cloud Data Fusion

实验 1 小时 30 分钟 universal_currency_alt 7 个积分 show_chart 高级
info 此实验可能会提供 AI 工具来支持您学习。
访问 700 多个实验和课程

GSP812

Google Cloud Self-Paced Labs logo

Overview

This lab will show you how to use Cloud Data Fusion to explore data lineage: The data's origins and its movement over time.

Cloud Data Fusion data lineage helps you:

  • Detect the root cause of bad data events.
  • Perform an impact analysis prior to making data changes.

Cloud Data Fusion provides lineage at the dataset level and field level, as well as being time-bound to show lineage over time.

  • Dataset level lineage shows the relationship between datasets and pipelines in a selected time interval.
  • Field level lineage shows the operations that were performed on a set of fields in the source dataset to produce a different set of fields in the target dataset.

For the purpose of this lab, you will use two pipelines that demonstrate a typical scenario in which raw data is cleaned then sent for downstream processing. This data trail -- from raw data to the cleaned shipment data to analytic output -- can be explored using the Cloud Data Fusion lineage feature.

Note: Currently, the Cloud Data Fusion Lineage feature is only available with the Cloud Data Fusion Enterprise Edition.

Objectives

In this lab, you will explore how to:

  • Run sample pipelines to produce lineage.
  • Explore dataset and field level lineage.
  • Learn how to pass handshaking information from the upstream pipeline to the downstream pipeline.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Google Cloud Skills Boost using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

    Note: Once you click Start lab, it will take about 15 - 20 minutes for the lab to provision necessary resources and create a Data Fusion instance. During that time, you can read through the steps below to get familiar with the goals of the lab.

    When you see lab credentials (Username and Password) in the left panel, the instance is created and you can continue logging into the console.
  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.

  5. Click Open Google console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Note: Do not click End lab unless you have finished the lab or want to restart it. This clears your work and removes the project.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options
  • Do not sign up for free trials
  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue.
    It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:
gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:
gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Prerequisites

In this lab, you will work with two pipelines:

  • The Shipment Data Cleansing pipeline, which reads raw shipment data from a small sample dataset and applies transformations to clean the data.
  • The Delayed Shipments USA pipeline, which reads the cleansed shipment data, analyzes it, and finds shipments within the USA that were delayed by more than a threshold.

Use the Shipment Data Cleansing and Delayed Shipments USA links to download these sample datasets to your local machine.

Task 1. Add the necessary permissions for your Cloud Data Fusion instance

  1. In the Google Cloud console, from the Navigation menu select Data Fusion > Instances.
Note: Creation of the instance takes around 20 minutes. Please wait for it to be ready.

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. From the Google Cloud console, navigate to the IAM & Admin > IAM.

  2. Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com is present, copy the Service Account to your clipboard.

  3. On the IAM Permissions page, click +Grant Access.

  4. In the New principals field paste the service account.

  5. Click into the Select a role field and start typing Cloud Data Fusion API Service Agent, then select it.

  6. Click ADD ANOTHER ROLE.

  7. Add the Dataproc Administrator role.

  8. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Principals with access tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Task 2. Open the Cloud Data Fusion UI

  1. In the Console, return to Navigation menu > Data Fusion, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in. If prompted to take a tour of the service, click on No, Thanks. You should now be in the Cloud Data Fusion UI.

  2. Click Studio from the left navigation panel to open the Cloud Data Fusion Studio page.

Cloud Fusion Studio UI

Task 3. Import, deploy, and run the Shipment Data Cleansing pipeline

  1. Next, you need to import the raw Shipping Data. Click Import in the top-right of the Studio page, then select and import the Shipment Data Cleansing pipeline that you downloaded earlier.
Note: If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.

Shipment Data Cleansing pipeline

  1. Now deploy the pipeline. Click Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run at the top of the center on the Pipeline page to run the pipeline.

Note: If the pipeline fails, please re-run it

Click Check my progress to verify the objective. Import, Deploy and Run Shipment Data Cleansing pipeline

Task 4. Import, deploy, and run the Delayed Shipments data pipeline

After the status of the Shipping Data Cleansing shows Succeeded, you will proceed to import and deploy the Delayed Shipments USA data pipeline that you downloaded earlier.

  1. Click Studio from the left navigation panel to return to the Cloud Data Fusion Studio page.

  2. Click Import in the top right of the Studio page, then select and import the Delayed Shipments USA data pipeline that you downloaded earlier.

Note: If a pop-up asks you to upgrade pipeline plugins, click Fix All to upgrade the plugins to the latest versions.
  1. Deploy the pipeline by clicking Deploy in the top right of the Studio page. After deployment, the Pipeline page opens.

  2. Click Run in the top center of the Pipeline page to run the pipeline.

Note: If the pipeline fails, please re-run it

After this second pipeline successfully completes, you can continue to perform the remaining steps below.

Click Check my progress to verify the objective. Import, Deploy, and Run the Delayed Shipments data pipeline

Task 5. Discover some datasets

You must discover a dataset before exploring its lineage.

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.
  2. Since the Shipment Data Cleansing dataset specified "Cleaned-Shipments" as the reference dataset, enter shipment in the Search box. The search results include this dataset.

Cleaned shipments metadata search results

Task 6. Use tags to discover datasets

A Metadata search discovers datasets that have been consumed, processed, or generated by Cloud Data Fusion pipelines. Pipelines execute on a structured framework that generates and collects technical and operational metadata. The technical metadata includes dataset name, type, schema, fields, creation time, and processing information. This technical information is used by the Cloud Data Fusion metadata search and lineage features.

Although the Reference Name of sources and sinks is a unique dataset identifier and an excellent search term, you can use other technical metadata as search criteria, such as a dataset description, schema, field name, or metadata prefix.

Cloud Data Fusion also supports the annotation of datasets with business metadata, such as tags and key-value properties, which can be used as search criteria. For example, to add and search for a business tag annotation on the Raw Shipping Data dataset:

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.

  2. Enter Raw shipping data in the search page of metadata option.

  3. Click on Raw_Shipping_Data.

  4. Under Business tags, click + then insert a tag name (alphanumeric and underscore characters are allowed) and press Enter.

Business tags name field

You can perform a search on a tag by clicking the tag name or by entering tags: tag_name in the search box on the Metadata search page.

Task 7. Explore data lineage

Dataset level lineage

  1. Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page, and enter shipment in the Search box.

  2. Click on the Cleaned-Shipments dataset name listed on the Search page.

  3. Then click the Lineage tab. The lineage graph shows that this dataset was generated by the Shipments-Data-Cleansing pipeline, which had consumed the Raw_Shipping_Data dataset.

Cloud Data Fusion Lineage tab

Field level lineage

Cloud Data Fusion field level lineage shows the relationship between the fields of a dataset and the transformations that were performed on a set of fields to produce a different set of fields. Like dataset level lineage, field level lineage is time-bound, and its results change with time.

  1. Continuing from the Dataset level lineage step, click the Field Level Lineage button in the top right of the Cleaned Shipments dataset-level lineage graph to display its field level lineage graph.

Cloud Data Fusion field level lineage

  1. The field level lineage graph shows connections between fields. You can select a field to view its lineage. Select View, then select Pin field to view that field's lineage only.

Data Fusion pin field lineage selection

  1. Locate the time_to_ship field under the Cleaned-Shipments dataset, Select View, then select View impact to perform a impact analysis.

view impact

The field level lineage shows how this field has tansformed over time. Notice the transformations for the time_to_ship field: (i)converting it to a float type column, (ii) determining whether the value is redirected into the next node or down the error path.

The lineage exposes the history of changes the particular field has gone through. Other examples include concatenation of few fields to compose a new field (like first name and last name combined to produce name, or computations done on a field (like converting a number to a percentage against total count).

The cause and impact links show the transformations performed on both sides of a field in a human-readable ledger format.

Congratulations!

In this lab, you have learned how to explore the lineage of your data. This information can be essential for reporting and governance. It can help different audiences understand how data came about to be in the state it is in.

Manual Last Updated November 14, 2022

Lab Last Tested August 08, 2023

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

准备工作

  1. 实验会创建一个 Google Cloud 项目和一些资源,供您使用限定的一段时间
  2. 实验有时间限制,并且没有暂停功能。如果您中途结束实验,则必须重新开始。
  3. 在屏幕左上角,点击开始实验即可开始

使用无痕浏览模式

  1. 复制系统为实验提供的用户名密码
  2. 在无痕浏览模式下,点击打开控制台

登录控制台

  1. 使用您的实验凭证登录。使用其他凭证可能会导致错误或产生费用。
  2. 接受条款,并跳过恢复资源页面
  3. 除非您已完成此实验或想要重新开始,否则请勿点击结束实验,因为点击后系统会清除您的工作并移除该项目

此内容目前不可用

一旦可用,我们会通过电子邮件告知您

太好了!

一旦可用,我们会通过电子邮件告知您

一次一个实验

确认结束所有现有实验并开始此实验

使用无痕浏览模式运行实验

请使用无痕模式或无痕式浏览器窗口运行此实验。这可以避免您的个人账号与学生账号之间发生冲突,这种冲突可能导致您的个人账号产生额外费用。