
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Add Cloud Data Fusion API Service Agent role to service account
/ 30
Import, Deploy and Run Shipment Data Cleansing pipeline
/ 35
Import, Deploy, and Run the Delayed Shipments data pipeline
/ 35
This lab will show you how to use Cloud Data Fusion to explore data lineage: The data's origins and its movement over time.
Cloud Data Fusion data lineage helps you:
Cloud Data Fusion provides lineage at the dataset level and field level, as well as being time-bound to show lineage over time.
For the purpose of this lab, you will use two pipelines that demonstrate a typical scenario in which raw data is cleaned then sent for downstream processing. This data trail -- from raw data to the cleaned shipment data to analytic output -- can be explored using the Cloud Data Fusion lineage feature.
In this lab, you will explore how to:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Google Cloud Skills Boost using an incognito window.
Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.
Click Open Google console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Since this is a temporary account, which will last only as long as this lab:
Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud
is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.
Click the Activate Cloud Shell button () at the top right of the console.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.
(Output)
(Example output)
(Output)
(Example output)
Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), click IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
is present and has the editor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.
If the account is not present in IAM or does not have the editor
role, follow the steps below to assign the required role.
In the Google Cloud console, on the Navigation menu, click Cloud overview.
From the Project info card, copy the Project number.
On the Navigation menu, click IAM & Admin > IAM.
At the top of the IAM page, click Add.
For New principals, type:
Replace {project-number}
with your project number.
For Select a role, select Basic (or Project) > Editor.
Click Save.
In this lab, you will work with two pipelines:
Use the Shipment Data Cleansing and Delayed Shipments USA links to download these sample datasets to your local machine.
Next, you will grant permissions to the service account associated with the instance, using the following steps.
From the Google Cloud console, navigate to the IAM & Admin > IAM.
Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com
is present, copy the Service Account to your clipboard.
On the IAM Permissions page, click +Grant Access.
In the New principals field paste the service account.
Click into the Select a role field and start typing Cloud Data Fusion API Service Agent, then select it.
Click ADD ANOTHER ROLE.
Add the Dataproc Administrator role.
Click Save.
Click Check my progress to verify the objective.
In the console, on the Navigation menu, click IAM & admin > IAM.
Select the Include Google-provided role grants checkbox.
Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com
and then copy the service account name to your clipboard.
Next, navigate to the IAM & admin > Service Accounts.
Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com
, and select the Principals with access tab on the top navigation.
Click on the Grant Access button.
In the New Principals field, paste the service account you copied earlier.
In the Role dropdown menu, select Service Account User.
Click Save.
In the Console, return to Navigation menu > Data Fusion, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in. If prompted to take a tour of the service, click on No, Thanks. You should now be in the Cloud Data Fusion UI.
Click Studio from the left navigation panel to open the Cloud Data Fusion Studio page.
Now deploy the pipeline. Click Deploy in the top-right of the Studio page. After deployment, the Pipeline page opens.
Click Run at the top of the center on the Pipeline page to run the pipeline.
Click Check my progress to verify the objective.
After the status of the Shipping Data Cleansing shows Succeeded, you will proceed to import and deploy the Delayed Shipments USA data pipeline that you downloaded earlier.
Click Studio from the left navigation panel to return to the Cloud Data Fusion Studio page.
Click Import in the top right of the Studio page, then select and import the Delayed Shipments USA data pipeline that you downloaded earlier.
Deploy the pipeline by clicking Deploy in the top right of the Studio page. After deployment, the Pipeline page opens.
Click Run in the top center of the Pipeline page to run the pipeline.
After this second pipeline successfully completes, you can continue to perform the remaining steps below.
Click Check my progress to verify the objective.
You must discover a dataset before exploring its lineage.
shipment
in the Search box. The search results include this dataset.A Metadata search discovers datasets that have been consumed, processed, or generated by Cloud Data Fusion pipelines. Pipelines execute on a structured framework that generates and collects technical and operational metadata. The technical metadata includes dataset name, type, schema, fields, creation time, and processing information. This technical information is used by the Cloud Data Fusion metadata search and lineage features.
Although the Reference Name of sources and sinks is a unique dataset identifier and an excellent search term, you can use other technical metadata as search criteria, such as a dataset description, schema, field name, or metadata prefix.
Cloud Data Fusion also supports the annotation of datasets with business metadata, such as tags and key-value properties, which can be used as search criteria. For example, to add and search for a business tag annotation on the Raw Shipping Data dataset:
Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page.
Enter Raw shipping data
in the search page of metadata option.
Click on Raw_Shipping_Data.
Under Business tags, click + then insert a tag name (alphanumeric and underscore characters are allowed) and press Enter.
You can perform a search on a tag by clicking the tag name or by entering tags: tag_name in the search box on the Metadata search page.
Select Metadata from the Cloud Data Fusion UI left navigation panel to open the metadata Search page, and enter shipment
in the Search box.
Click on the Cleaned-Shipments dataset name listed on the Search page.
Then click the Lineage tab. The lineage graph shows that this dataset was generated by the Shipments-Data-Cleansing pipeline, which had consumed the Raw_Shipping_Data dataset.
Cloud Data Fusion field level lineage shows the relationship between the fields of a dataset and the transformations that were performed on a set of fields to produce a different set of fields. Like dataset level lineage, field level lineage is time-bound, and its results change with time.
The field level lineage shows how this field has tansformed over time. Notice the transformations for the time_to_ship field: (i)converting it to a float type column, (ii) determining whether the value is redirected into the next node or down the error path.
The lineage exposes the history of changes the particular field has gone through. Other examples include concatenation of few fields to compose a new field (like first name and last name combined to produce name, or computations done on a field (like converting a number to a percentage against total count).
The cause and impact links show the transformations performed on both sides of a field in a human-readable ledger format.
In this lab, you have learned how to explore the lineage of your data. This information can be essential for reporting and governance. It can help different audiences understand how data came about to be in the state it is in.
Manual Last Updated November 14, 2022
Lab Last Tested August 08, 2023
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one