arrow_back

Building Batch Pipelines in Cloud Data Fusion

Join Sign in
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

Building Batch Pipelines in Cloud Data Fusion

Lab 1 hour 30 minutes universal_currency_alt 5 Credits show_chart Introductory
Test and share your knowledge with our community!
done
Get access to over 700 hands-on labs, skill badges, and courses

GSP807

Google Cloud Self-Paced Labs logo

Overview

ETL stands for Extract Transform and Load. There are various other mutations of this concept including EL, ELT, and ELTL.

This lab will teach you how to use the Pipeline Studio in Cloud Data Fusion to build an ETL pipeline. Pipeline Studio exposes the building blocks and built-in plugins for you to build your batch pipeline, one node at a time. You will also use the Wrangler plugin to build and apply transformations to your data that goes through the pipeline.

The most common source of data for ETL applications is typically data stored in comma separated value (CSV) format text files, as many database systems export and import data in this fashion. For the purposes of this lab you’ll be using a CSV file, but the same techniques can be applied to DB sources as well as any other data source that you have available.

The output will be written to a BigQuery table and you’ll use standard SQL to perform data analytics against this target data set.

Objectives

In this lab, you will learn how to:

  • Create a batch pipeline with Pipeline Studio in Cloud Data Fusion.
  • Use Wrangler to interactively transform data.
  • Write output into BigQuery.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Google Cloud Skills Boost using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

    Note: Once you click Start lab, it will take about 15 - 20 minutes for the lab to provision necessary resources and create a Data Fusion instance. During that time, you can read through the steps below to get familiar with the goals of the lab.

    When you see lab credentials (Username and Password) in the left panel, the instance is created and you can continue logging into the console.
  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.

  5. Click Open Google console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Note: Do not click End lab unless you have finished the lab or want to restart it. This clears your work and removes the project.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options
  • Do not sign up for free trials
  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue.
    It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:
gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:
gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Task 1. Load the data

Next, you will create a GCS bucket in your project and stage the CSF file. Cloud Data Fusion will later read data out of this storage bucket.

  1. In Cloud Shell, execute the following commands to create a new bucket and copy the relevant data into it:
export BUCKET=$GOOGLE_CLOUD_PROJECT gcloud storage buckets create gs://$BUCKET

The created bucket name is your Project ID.

  1. Run the command below to copy the data files (a CSV and an XML file) into your bucket:
gsutil cp gs://cloud-training/OCBL163/titanic.csv gs://$BUCKET

Click Check my progress to verify the objective. Load the data

Task 2. Add the necessary permissions for your Cloud Data Fusion instance

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. In the Cloud Console, from the Navigation menu select Data Fusion > Instances. You should see a Cloud Data Fusion instance already setup and ready for use.
Note: Creation of the instance takes around 10 minutes. Please wait for it to be ready.
  1. From the Google Cloud console, navigate to the IAM & Admin > IAM.

  2. Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com is present, copy the Service Account to your clipboard.

  3. On the IAM Permissions page, click +Grant Access.

  4. In the New principals field paste the service account.

  5. Click into the Select a role field and start typing "Cloud Data Fusion API Service Agent", then select it.

  6. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Permissions tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Task 3. Build a batch pipeline

When working with data it’s always handy to be able to see what the raw data looks like so that we can use it as a starting point for our transformation. For this purpose you’ll be using Data Fusion’s Wrangler component for preparing and cleaning data. This data-first approach will allow us to quickly visualize our transformations and the real-time feedback ensures us that you’re on the right track.

  1. In the Console return to Navigation menu > Data Fusion > Instances, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in. If prompted to take a tour of the service click on No, Thanks. You should now be in the Cloud Data Fusion UI.

Data Fusion view instance menu selection

  1. The Cloud Data Fusion web UI comes with its own navigation panel (on the left) to navigate to the page you need. In the Cloud Data UI, click on the Navigation menu on the top left to expose the navigation panel, then choose Wrangler.

  2. On the left side is a panel with the pre-configured connections to your data, including the Cloud Storage connection.Under (GCS) Google Cloud Storage, select Cloud Storage Default.

  3. Click on the bucket corresponding to your project name.

  4. Click on titanic.csv.

    Cloud Data Fusion cloud storage bucket titanic1

  5. In the Parsing Options dialog box that comes up, select text under the Format drop-down list.

  6. Then click CONFIRM.

Parsing options dialog

The data is loaded into the Wrangler screen. You can now start applying the data transformations iteratively.

  1. The first operation is to parse the raw CSV data into a tabular representation that is split into rows and columns. To do this, select the dropdown icon from the first column heading (body), and select the Parse menu item, and CSV from the submenu.

    Google Cloud storage parse > CSV menu selection

  2. In the raw data, you can see that the first row consists of column headings, so you’ll select the option to Set first row as header in the dialog box for Parse as CSV that is presented to us.

  3. Next, click Apply.

Parse as CSV dialog

Note: You may ignore the deprecation warning next to the Set first row as header checkbox.
  1. At this stage, the raw data is parsed and you can see the columns generated by this operation (columns to the right of the body column). On the far right, you see the list of all column names.

Cloud data fusion Wrangler UI

  1. Now it’s time for some cleanup. You no longer need the body column that represents the raw CSV data, so remove it using the menu option; select the dropdown icon from the first column heading body, and select the Delete column menu item.

Delete column menu selection

Note: To apply transformations, you can also use the command line interface (CLI). The CLI is the black bar at the bottom of the screen (with the green $ prompt). As you start typing commands the autofill feature kicks in and presents you with a matching option. For example, to drop the body column, you could have alternatively used the directive: drop :body.

CLI drop body column syntax

  1. Click on the Transformation steps tab on the far right of your Wrangler UI. You will see the two transformations you have applied so far.

Titanic.csv drop :body column

Note: Both the menu selections and the CLI create directives that are visible on the Transformation steps tab on the right of the screen. Directives are individual transformations that are collectively referred to as a recipe.

For the purpose of this lab, the two transformations, or recipe, are sufficient to create the ETL pipeline. The next step is to bring this recipe into a pipeline building step where the recipe represents the T in ETL.

  1. Click on the Create a Pipeline button to jump into the next section to create a pipeline, where you’ll see how the ETL pipeline comes together.

    Create a pipeline button

  2. When presented with the next dialog, select Batch pipeline to continue.

Batch pipeline selection

Note: A batch pipeline can be run interactively or scheduled to run as frequently as every 5 minutes or as little as once a year.

Task 4. Take in the pipeline studio

The rest of the pipeline building tasks will take in the pipeline studio, the UI that lets you compose data pipelines visually. You should now see the primary building blocks of your ETL pipeline in the studio.

At this point, you will see two nodes in your pipeline: the GCS File plugin that will read the CSV file from Google Cloud Storage and the Wrangler plugin that contains the recipe with the transformations.

Note: A node in a pipeline is an object that is connected in a sequence to produce a Directed Acyclic Graph. E.g. Source, Sink, Transform, Action, etc.

These two plugins (nodes) represent the E and the T in your ETL pipeline. To complete this pipeline add the BigQuery sink, the L portion of our ETL.

Pipeline architecture diagram

  1. To add the BigQuery sink to the pipeline, navigate to the Sink section on the left panel and click on the BigQuery icon to place it on the canvas.

BigQuery sink section

  1. Once the BigQuery sink has been placed on the canvas, connect the Wrangler node with the BigQuery node. Do this by dragging the arrow from the Wrangler node to connect to the BigQuery node as illustrated below. All that’s left to do now is to specify some configuration options so that you can write the data to the dataset you want.

    Wrangler node connection with BigQuery node

Task 5. Configure the pipeline

Now it’s time to configure the pipeline. You can do this by opening each of the nodes' properties to verify its settings and/or make any additional changes.

  1. Hover your mouse over the GCS node and a Properties button will be displayed. Click on this button to open up the configuration settings.

GCS node properties dialog

Each plugin has a few mandatory fields that must be present and they are marked with an asterisk ( * ). Depending on the plugin you are using, you may see an Input Schema, on the left, a Configuration section in the middle, and an Output Schema on the right.

You will notice that Sink plugins don’t have an output schema and Source plugins don’t have an input schema. Both Sink and Source plugins will have a mandatory Reference Name field as well to identify the data source/sink for lineage.

Each plugin will have a Label field. This is the label of the node you see on the canvas where your pipeline is displayed.

  1. Click the X on the top right of the Properties box to close it.

  2. Next, hover over your Wrangler node and click on Properties.

Wrangler node properties dialog

Note: Plugins like Wrangler that contain an input schema. These are the fields that are being passed in to the plugin to be processed. Once they are processed by the plugin outgoing data may be sent out in the output schema to the next node in the pipeline, or in the case of a sink written to a dataset.
  1. Click on X on the top right of the Properties box to close it.

  2. Hover your mouse over your BigQuery node, click on Properties and enter the following configuration settings:

    • For Reference Name, enter Titanic_BQ.

    • For Dataset, enter demo.

    • For Table, enter titanic.

  3. Click on X on the top right of the Properties box to close it.

BigQuery properties dialog

Task 6. Test the pipeline

All that’s left to do now is to test your pipeline to see that it works as expected, but before you do that make sure to name and save your draft so you won’t lose any of your work.

  1. Now click Save from the upper-right corner menu. You will be prompted to give a Name and add a description to the pipeline.

    • Enter ETL-batch-pipeline as the name of your pipeline.
    • Enter ETL pipeline to parse CSV, transform and write output to BigQuery for the description.
  2. Click on Save.

  3. To test your pipeline, click on the Preview icon. The button bar will now show a run icon that you can click to run the pipeline in preview mode.

  4. Click on the Run icon. While the pipeline is running in preview mode, no data is actually written to the BigQuery table, but you will be able to confirm that data is being read properly and that it will be written as expected once the pipeline is deployed. The preview button is a toggle so be sure to click it again to get out of preview mode when done.

    Pipeline being prepared to run

  5. When your pipeline completes running, hover your mouse over your Wrangler node and click on Properties.Then click on Preview tab. If all went well you should see the raw data that came in from the input, the node to the left, and the parsed records that will be emitted as output, to the node on the right.Click on X on the top right of the Properties box to close it.

Wrangler node output

Note: Each node that operates on data should show you similar output. This is a good way to prove your work to make sure you are on the right track before deploying your pipeline. If you encounter any errors you can easily fix it while in draft mode.
  1. Click on the Preview icon again, this time to toggle out of Preview mode.

  2. If everything looks good so far, you can proceed to deploy the pipeline. Click on the Deploy icon on the top right Deploy icon to deploy the pipeline.

You will see a confirmation dialog that your pipeline is being deployed:

Deploying pipeline confirmation

  1. Once your pipeline has successfully deployed, you’re now ready to run your ETL pipeline and load some data into BigQuery.

  2. Click on the Run icon to execute the ETL job.

  3. When done, you should see pipeline status changes to Succeeded indicating that the pipeline ran successfully.

    Cloud data fusion pipeline success

  4. As data is processed by the pipeline, you will see metrics being emitted by each node in the pipeline indicating how many records have been processed. In the parse operation it displays 892 records whereas in the source there were 893 records, so what happened? The parse operation took the first row and consumed it to set the column headings, so the remaining 892 records are what was left to process.

Pipeline parse CSV diagram

Click Check my progress to verify the objective Deploy and execute batch pipeline

Task 7. View the results

The pipeline writes output into a BigQuery table. You can verify that using the following steps.

  1. In a new tab, open the BigQuery UI in the Cloud Console or right-click on the Console tab and select Duplicate, then use the Navigation menu to select BigQuery. When prompted, click Done.

  2. In the left pane, in the Explorer section, click your Project ID (it will start with qwiklabs).

  3. Under the demo dataset in your project, click the titanic table and click Compose New Query, then run a simple query, such as:

SELECT * FROM `demo.titanic` LIMIT 10

Query results

Click Check my progress to verify the objective View the results

Congratulations!

You have now learned to use the building blocks available in Cloud Data Fusion's Pipeline Studio to build a batch pipeline. You also learned to use Wrangler to create transformation steps for your data.

Continue your quest

This self-paced lab is part of the Cloud Data Fusion quest. A quest is a series of related labs that form a learning path. Completing this quest earns you a badge to recognize your achievement. You can make your badge or badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. Refer to the Google Cloud Skills Boost catalog for all available quests.

Take your next lab

Continue your quest with Building Transformations and Preparing Data with Wrangler in Cloud Data Fusion.

Manual Last Updated February 03, 2023

Lab Last Tested July 20, 2023

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.