
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
This tutorial shows you how to use the Wrangler and Data Pipeline features in Cloud Data Fusion to clean, transform, and process taxi trip data for further analysis.
In this lab, you will:
Often times, data needs to go through a number of pre-processing steps before analysts can leverage the data to glean insights. For example, data types may need to be adjusted, anomalies removed, and vague identifiers may need to be converted to more meaningful entries. Cloud Data Fusion is a service for efficiently building ETL/ELT data pipelines. Cloud Data Fusion uses Cloud Dataproc cluster to perform all transforms in the pipeline.
The use of Cloud Data Fusion will be exemplified in this tutorial by using a subset of the NYC TLC Taxi Trips dataset on BigQuery.
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
Output:
Example output:
Output:
Example output:
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
is present and has the editor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
editor
role, follow the steps below to assign the required role.729328892908
).{project-number}
with your project number.Thorough directions for creating a Cloud Data Fusion instance can be found in the Creating a Cloud Data Fusion instance Guide. The essential steps are as follows:
Your output says that the operation finished successfully.
Next, restart the connection to the Cloud Data Fusion API.
In the Google Cloud Console, enter Cloud Data Fusion API in the top search bar. Click on the result for Cloud Data Fusion API.
On the page that loads click Enable.
When the API has been enabled again, the page will refresh and show the option to disable the API along with other details on the API usage and performance.
On the Navigation menu, select Data Fusion.
To create a Cloud Data Fusion instance, click Create an Instance.
Enter a name for your instance.
Select Basic for the Edition type.
Under Authorization section, click Grant Permission.
Leave all other fields as their defaults and click Create.
Once the instance is created, you need one additional step to grant the service account associated with the instance permissions on your project. Navigate to the instance details page by clicking the instance name.
Copy the service account to your clipboard.
In the GCP Console navigate to the IAM & Admin > IAM.
On the IAM Permissions page, click +Grant Access add the service account you copied earlier as a new principals and grant the Cloud Data Fusion API Service Agent role.
Once the Cloud Data Fusion instance is up and running, you can start using Cloud Data Fusion. However, before Cloud Data Fusion can start ingesting data you have to take some preliminary steps.
Wrangler is an interactive, visual tool that lets you see the effects of transformations on a small subset of your data before dispatching large, parallel-processing jobs on the entire dataset. On the Cloud Data Fusion UI, choose Wrangler. On the left side, there is a panel with the pre-configured connections to your data, including the Cloud Storage connection.
Under GCS, select Cloud Storage Default.
Click on the bucket corresponding to your project name.
Select ny-taxi-2018-sample.csv. The data is loaded into the Wrangler screen in row/column form.
In the Parsing Options window, set Use First Row as Header as True
. The data splits into multiple columns.
Click Confirm.
Now, you will perform some transformations to parse and clean the taxi data.
Click the Down arrow next to the trip_distance
column, select Change data type and then click on Float. Repeat for the total_amount
column.
Click the Down arrow next to the pickup_location_id
column, select Change data type and then click on String.
If you look at the data closely, you may find some anomalies, such as negative trip distances. You can avoid those negative values by filtering out in Wrangler. Click the Down arrow next to the trip_distance
column and select Filter. Click if Custom condition and input >0.0
Basic data cleansing is now complete and you've run transformations on a subset of your data. You can now create a batch pipeline to run transformations on all your data.
Cloud Data Fusion translates your visually built pipeline into an Apache Spark or MapReduce program that executes transformations on an ephemeral Cloud Dataproc cluster in parallel. This enables you to easily execute complex transformations over vast quantities of data in a scalable, reliable manner, without having to wrestle with infrastructure and technology.
On the upper-right side of the Google Cloud Fusion UI, click Create a Pipeline.
In the dialog that appears, select Batch pipeline.
In the Data Pipelines UI, you will see a GCSFile source node connected to a Wrangler node. The Wrangler node contains all the transformations you applied in the Wrangler view captured as directive grammar. Hover over the Wrangler node and select Properties.
.
extra
column by pressing the red trashcan icon beside its name. Click Validate on the top right corner to check for any errors. To close the Wrangler tool click the X button in the top right corner.The taxi data contains several cryptic columns such as pickup_location_id
, that aren't immediately transparent to an analyst. You are going to add a data source to the pipeline that maps the pickup_location_id
column to a relevant location name. The mapping information will be stored in a BigQuery table.
In a separate tab, open the BigQuery UI in the Cloud Console. Click Done on the 'Welcome to BigQuery in the Cloud Console' launch page.
In the Explorer section of the BigQuery UI, click the three dots beside your GCP Project ID (it will start with qwiklabs).
On the menu that appears click on Create dataset.
In the Dataset ID field type in trips
.
Click on Create dataset.
To create the desired table in the newly created dataset, navigate to More > Query Settings. This process will ensure you can access your table from Cloud Data Fusion.
Select the item for Set a destination table for query results. For Dataset input trips
and select from the dropdown. For Table Id input zone_id_mapping
. Click Save.
You can see that this table contains the mapping from zone_id
to its name and borough.
Now, you will add a source in your pipeline to access this BigQuery table. Return to the tab where you have Cloud Data Fusion open, from the Plugin palette on the left, select BigQuery from the Source section. A BigQuery source node appears on the canvas with the two other nodes.
Hover over the new BigQuery source node and click Properties.
To configure the Reference Name, enter zone_mapping
, which is used to identify this data source for lineage purposes.
trips
and zone_id_mapping
. For Temporary Bucket Name input the name of your project followed by "-temp", which corresponds to the bucket you created in Task 2.To populate the schema of this table from BigQuery, click Get Schema. The fields will appear on the right side of the wizard.
Click Validate on the top right corner to check for any errors. To close the BigQuery Properties window click the X button in the top right corner.
Now you can join the two data sources—taxi trip data and zone names—to generate more meaningful output.
Under the Analytics section in the Plugin Palette, choose Joiner. A Joiner node appears on the canvas.
To connect the Wrangler node and the BigQuery node to the Joiner node: Drag a connection arrow > on the right edge of the source node and drop on the destination node.
Click Properties of Joiner.
Leave the label as Joiner.
Change the Join Type to Inner
Set the Join Condition to join the pickup_location_id
column in the Wrangler node to the zone_id
column in the BigQuery node.
To generate the schema of the resultant join, click Get Schema.
In the Output Schema table on the right, remove the zone_id
and pickup_location_id
fields by hitting the red garbage can icon.
Click Validate on the top right corner to check for any errors. Close the window by clicking the X button in the top right corner.
You will store the result of the pipeline into a BigQuery table. Where you store your data is called a sink.
In the Sink section of the Plugin Palette, choose BigQuery.
Connect the Joiner node to the BigQuery node. Drag a connection arrow > on the right edge of the source node and drop on the destination node.
Open the BigQuery2 node by hovering on it and then clicking Properties. You will next configure the node as shown below. You will use a configuration that's similar to the existing BigQuery source. Provide bq_insert
for the Reference Name field and then use trips
for the Dataset and the name of your project followed by "-temp" as Temporary Bucket Name. You will write to a new table that will be created for this pipeline execution. In Table field, enter trips_pickup_name
.
Click Validate on the top right corner to check for any errors. Close the window by clicking the X button in the top right corner.
At this point you have created your first pipeline and can deploy and run the pipeline.
When you run a pipeline, Cloud Data Fusion provisions an ephemeral Cloud Dataproc cluster, runs the pipeline, and then tears down the cluster. This could take a few minutes. You can observe the status of the pipeline transition from Provisioning to Starting and from Starting to Running to Succeeded during this time.
To view the results after the pipeline runs:
Return to the tab where you have BigQuery open. Run the query below to see the values in the trips_pickup_name
table:
BQ RESULTS
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one