arrow_back

Building Transformations and Preparing Data with Wrangler in Cloud Data Fusion

Join Sign in
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Building Transformations and Preparing Data with Wrangler in Cloud Data Fusion

Lab 1 hour 30 minutes universal_currency_alt 5 Credits show_chart Introductory
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Overview

Data integration is all about your data. When working with data, it’s handy to be able to see what the raw data looks like so that you can use it as a starting point for your transformation. With Wrangler, you can take a data-first approach to your data integration workflow.

The most common source of data for ETL (Extract-Transform-Load) applications is typically data stored in comma-separated value (CSV) format text files, as many database systems export and import data in this fashion. For the purposes of this lab you're using a CSV file, but the same techniques can be applied to database sources and any other data source that you have available in Cloud Data Fusion.

Objectives

In this lab you learn how to perform the following tasks:

  • Create a pipeline to ingest from a CSV file.
  • Use Wrangler to apply transformations by using point-and-click and the CLI interfaces.

For most of this lab, you're working with Wrangler Transformation Steps which are used by the Wrangler plugin so that your transformations are encapsulated in one place and you can group transformation tasks into manageable blocks. This data-first approach will let you quickly visualize your transformations.

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Google Cloud Skills Boost using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

    Note: Once you click Start lab, it will take about 15 - 20 minutes for the lab to provision necessary resources and create a Data Fusion instance. During that time, you can read through the steps below to get familiar with the goals of the lab.

    When you see lab credentials (Username and Password) in the left panel, the instance is created and you can continue logging into the console.
  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.

  5. Click Open Google console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Note: Do not click End lab unless you have finished the lab or want to restart it. This clears your work and removes the project.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options
  • Do not sign up for free trials
  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue.
    It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:
gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:
gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Task 1. Add necessary permissions for your Cloud Data Fusion instance

  1. In the Google Cloud console, from the Navigation menu select Data Fusion > Instances.
Note: Creation of the instance takes around 10 minutes. Please wait for it to be ready.

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. From the Google Cloud console, navigate to the IAM & Admin > IAM.

  2. Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com is present, copy the Service Account to your clipboard.

  3. On the IAM Permissions page, click +Grant Access.

  4. In the New principals field paste the service account.

  5. Click into the Select a role field and start typing "Cloud Data Fusion API Service Agent", then select it.

  6. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Permissions tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Task 2. Load the data

Next you will create a Cloud Storage bucket in your project so that you can load some sample data for Wrangling. Cloud Data Fusion will later read data out of this storage bucket

  1. In Cloud Shell, execute the following commands to create a new bucket:
export BUCKET=$GOOGLE_CLOUD_PROJECT gcloud storage buckets create gs://$BUCKET

The created bucket name is your Project ID.

  1. Run the command to copy the data file (a CSV file) into your bucket:
gcloud storage cp gs://cloud-training/OCBL163/titanic.csv gs://$BUCKET

Click Check my progress to verify the objective. Load the data

Now you're ready to proceed further.

Task 3. Navigate the Cloud Data Fusion UI

In the Cloud Data Fusion UI you can use the various pages, such as Pipeline Studio or Wrangler, to use Cloud Data Fusion features.

To navigate the Cloud Data Fusion UI, follow these steps:

  1. In the Console return to Navigation menu > Data Fusion.
  2. Then click the View Instance link next to your Data Fusion instance.
  3. Select your lab credentials to sign in.
Note: If you seen a 500 error message, close your tabs and retry steps 2-3 again.

The Cloud Data Fusion web UI comes with its own navigation panel (on the left) to navigate to the page you need.

  1. In the Cloud Data UI, click the Navigation menu on the top left to expose the navigation panel.
  2. Then choose Wrangler.

Task 4. Working with Wrangler

Wrangler is an interactive, visual tool that lets you see the effects of transformations on a small subset of your data before dispatching large, parallel-processing jobs on the entire dataset.

  1. When Wrangler loads, on the left side is a panel with the pre-configured connections to your data, including the Cloud Storage connection.

  2. In the GCS, select Cloud Storage Default.

  3. Click the bucket corresponding to your Project ID.

  4. Click titanic.csv.

  5. In the parsing options select text format from the drop-down.

parseoption.png

  1. Click Confirm. The data is loaded into the Wrangler.

ParseCSV4.png

  1. The first operation is to parse the raw CSV data into a tabular representation that is split into rows and columns. To do this you will select the drop-down icon from the first column heading, then select the Parse menu item, and CSV from the submenu.

Expanded dropdown menu

  1. In the raw data we can see that the first row consists of column headings, so you need to select the option to Set first row as header in the dialog box for Parse as CSV that is presented to you, then click Apply.

  2. At this stage, the raw data is parsed and you can see the columns generated by this operation on the right of the body column

  3. You no longer need the body column, so remove it by selecting the drop-down icon next to the body column heading, and select the Delete column menu item.

Note: To apply transformations, you can also use the command line interface (CLI). The CLI is the black bar at the bottom of the screen (with the green $ prompt). As you start typing commands the autofill feature kicks in and presents you with a matching option. For example, to drop the body column, you could have alternatively used the directive: drop: body.

drop directive

  1. Click the Transformation steps tab on the far right of your Wrangler UI. You will see the two transformations you have applied so far.
Note: Both the menu selections and the CLI create directives that are visible on the Transformation steps tab on the right of the screen. Directives are individual transformations that are collectively referred to as a recipe.

In a later part of the lab, you will add more transformation steps using the CLI.

As you apply Transformation Steps to your dataset, the transformations affect the sampled data and provide visual cues that can be explored through the Insights browser.

  1. Click the Insights tab in the top middle area, to see how the data is distributed across the various columns.

Insights tabbed page

  1. Explore the interface to discover new ways of analyzing your data. Drag-and-drop the Age field to the x encoding to see how your data perspectives change.

Insights tabbed page

  1. You can click the Create Pipeline button to transition to the pipeline development mode, where you can check the directives that you created within the Wrangler plugin.

createpipline15.png

  1. When presented with the next dialog select Batch pipeline to continue.

batch16.png

  1. Once the Pipeline Studio opens, point to the Wrangler node and click Properties

  2. Following Directives review the recipe of directives that were added by you earlier. In the next section, you will add more transformation steps using the CLI.

 Directives section

Task 5. Working with Transformation Steps

In this section, you will continue working in the Wrangler UI to explore the CSV dataset and apply transformations through CLI.

  1. Click the Wrangle button under the Directives section of your Wrangler node's Properties box. You will be back in the Wrangler UI.

  2. Click the Transformation steps on the far right of your Wrangler UI to expose the directives. Verify that you currently have two transformation steps.

You will now add more transformation steps using the CLI and see how they modify the data. The CLI is the black bar at the bottom of the screen (with the green $ prompt).

  1. Copy the directives, and paste them into your CLI at the $ prompt. You will see the Transformation Steps on the right of your screen get updated.
fill-null-or-empty :Cabin 'none' send-to-error empty(Age) parse-as-csv :Name ',' false drop Name fill-null-or-empty :Name_2 'none' rename Name_1 Last_Name rename Name_2 First_Name set-type :PassengerId integer parse-as-csv :First_Name '.' false drop First_Name drop First_Name_3 rename First_Name_1 Salutation fill-null-or-empty :First_Name_2 'none' rename First_Name_2 First_Name send-to-error !dq:isNumber(Age) || !dq:isInteger(Age) || (Age == 0 || Age > 125) set-type :Age integer set-type :Fare double set-column Today_Fare (Fare * 23.4058)+1 generate-uuid id mask-shuffle First_Name

Following is an explanation of what the directives do to your data. DO NOT enter them again in the CLI as you have just already done so.

a. fill-null-or-empty :Cabin 'none' fixes the Cabin column so that it’s 100% complete.

b. send-to-error empty(Age) fixes the Age column so there are no empty cells

c. parse-as-csv :Name ',' false splits the Name columns into two separate columns containing the first name and last name

d. rename Name_1 Last_Name and rename Name_2 First_Name rename the newly created columns, Name_1 and Name_2, into Last_Name and First_Name

e. drop Name removes the Name column as it’s no longer needed

f. set-type :PassengerId integer converts the PassengerId column into an integer

g. The directives extract the salutation from the First_Name column, delete the redundant column and rename the newly created columns accordingly:

parse-as-csv :First_Name '.' false drop First_Name drop First_Name_3 rename First_Name_1 Salutation fill-null-or-empty :First_Name_2 'none' rename First_Name_2 First_Name

h. the send-to-error !dq:isNumber(Age) || !dq:isInteger(Age) || (Age == 0 || Age > 125) directive performs data quality checks on the Age column while the set-type :Age integer sets it as an Integer column

i. set-type :Fare double converts the Fare column to a Double so you can perform some arithmetic with the column values

j. set-column Today_Fare (Fare * 23.4058)+1 multiplies the Fare column by the inflation rate of the Dollar since 1912 to get the adjusted Dollar value

k. generate-uuid id creates an identity column to uniquely identify each record

l. mask-shuffle First_Name will mask the Last_Name column to de-identify the person, i.e. PII

  1. Click More link on the top right above your Transformation steps, and then click on View Schema to examine the schema that the transformations generated, and click the download icon to download it to your computer.

    Schema list generated by the transformation

  2. Click X to close the Schema page.

  3. You can click the download icon under Transformation steps to download the directives recipe to your computer to keep a copy of the transformation steps for future use.

    Highlighted Download icon on the Transformation steps page

  4. Click the Apply button on the top right to ensure all the newly entered transformation steps get added to the Wrangler node's configuration. You will then be redirected back to the properties box of the Wrangler node.

  5. Click X to close it. You're back in the Pipeline Studio.

Task 6. Ingestion into BigQuery

In order to ingest the data into BigQuery, create a dataset.

  1. In a new tab, open the BigQuery in the Google Cloud Console or right-click on the Google Cloud console tab and select Duplicate, then use the Navigation menu to select BigQuery. If prompted click Done.

  2. In the Explorer pane, click the View actions icon next to your Project ID (it will start with qwiklabs) and then select Create dataset.

    Explorer pane with the View Action icon and Create datasaet option highlighted

a. Dataset ID: demo_cdf

b. Click Create dataset. Record the name to use later in the lab.

  1. Navigate back to the Cloud Data Fusion UI tab

a. To add the BigQuery sink to the pipeline navigate to the Sink section on the left panel and click the BigQuery icon to place it on the canvas.

b. Once the BigQuery sink has been placed on the canvas, connect the Wrangler node with the BigQuery node. Do this by dragging the arrow from the Wrangler node to connect to the BigQuery node as illustrated.

GCSFile, Wrangler, and BigQuery nodes on the canvas

c. Point your mouse over your BigQuery node, click Properties and enter the following configuration settings:

Field Value
Reference Name DemoSink
Dataset Project ID Your Project ID
Dataset demo_cdf(the dataset you created in the previous step)
Table Enter an appropriate name (like titanic)

The table will be automatically created.

d. Click the Validate button to check if everything is set up correctly.

e. Click X to close it. And you're back in the Pipeline Studio.

  1. Now you’re ready to execute your pipeline.

a. Give your pipeline a name (like DemoBQ)

b. Click Save and then click Deploy on the top right to deploy the pipeline.

c. Click Run to start the pipeline execution. You may click the Summary icon to look at some statistics.

After execution completes, the status changes to Succeeded. Navigate back to your BigQuery Console to query your results.

Status: Succeeded

Click Check my progress to verify the objective. Ingestion into BigQuery

Congratulations!

In this lab, you explored the Wrangler UI. You learned how to add transformation steps (directives) through the menu as well as using the CLI. Wrangler allows you to apply many powerful transformations to your data iteratively and you can use the Wrangler UI to view how it affects the schema of your data before you deploy and run your pipeline.

End your lab

When you have completed your lab, click End Lab. Qwiklabs removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.