arrow_back

Redacting Confidential Data within your Pipelines in Cloud Data Fusion

Join Sign in

Redacting Confidential Data within your Pipelines in Cloud Data Fusion

1 hour 30 minutes 7 Credits

GSP811

Google Cloud Self-Paced Labs logo

Overview

In this lab you wil learn how to use the Cloud Data Fusion plugin for Cloud DLP to redact sensitive data.

Consider the following scenario, in which some sensitive customer information needs to be redacted:

Your support team documents the details of each support case they handle in a support ticket. All of the information in the support tickets is pulled into a CSV file. The support technicians are not supposed to document any customer information that's considered sensitive, but sometimes they mistakenly do so. You notice that in the CSV file some customers' phone numbers appear.

You want to go through the CSV file and hide all phone numbers. You create a Cloud Data Fusion pipeline that redacts the sensitive customer data by using the Cloud DLP plugin.

You will create a pipeline that does the following:

  • Redacts customer phone numbers and emails by masking them with the # character.

  • Stores the masked sensitive data and the non-sensitive data in a Cloud Storage.

Objectives

  • Connect Cloud Data Fusion to a Cloud Storage source.

  • Deploy the Cloud DLP plugin.

  • Create a custom Cloud DLP template.

  • Use the Redact transform plugin to mask sensitive customer data.

  • Write the output data to Cloud Storage.

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time. There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  1. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  2. Click Open Google Console.

  3. Click Use another account and copy/paste credentials for this lab into the prompts. If you use other credentials, you'll receive errors or incur charges.

  4. Accept the terms and skip the recovery resource page.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options

  • Do not sign up for free trials

  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue. It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:

gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:

gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Setup Cloud Storage bucket

Next you will create a Cloud Storage bucket in your project so your pipeline can store output data.

  1. In Cloud Shell, execute the following commands to create a new bucket:

export BUCKET=$GOOGLE_CLOUD_PROJECT gsutil mb gs://$BUCKET

The created bucket name has the same name as your Project ID.

Click Check my progress to verify the objective. Setup Cloud Storage bucket

Add necessary permissions for your Cloud Data Fusion instance

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. In the Cloud Console, from the Navigation menu select Data Fusion > Instances. You should see a Cloud Data Fusion instance already setup and ready for use.

  2. Click on the instance name. On the Instance details page copy the Service Account to your clipboard.

basic-service-account

  1. In the Console navigate to the IAM & admin > IAM.

  2. On the IAM Permissions page, click Add.

  3. In the New principals field paste the service account.

  4. Click into the Select a role filed and start typing Cloud Data Fusion API Service Agent, then select it.

add-service-account-basic

  1. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Permissions tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Get Cloud DLP permissions

  1. In the Cloud Console, go to Navigation menu > IAM.

  2. At the top right of the Permissions table, look for the Include Google-provided role grants box and click it:

google-provided-role-grants.png

  1. In the permissions table, in the Principal column, find the service account that matches the format service-project-number@gcp-sa-datafusion.iam.gserviceaccount.com.

01-service-account.png

  1. Click the Edit button to the right of the service account.

  2. Click Add Another Role.

  3. Click the dropdown that appears.

  4. Use the search bar to search and then select DLP Administrator.

02-add-role.png

  1. Click Save.

  2. Check that DLP Administrator appears in the Role column.

03-role-added.png

Click Check my progress to verify the objective. Get Cloud DLP permissions

Navigate to the Cloud Data Fusion UI

  1. In the Console return to Navigation menu > Data Fusion, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in, if required. If prompted to take a tour of the service click on No, Thanks. You should now be in the Cloud Data Fusion UI.

  2. In the Cloud Data Fusion UI, click the Navigation menu on the top left and navigate to the Studio page.

Next, you will create a pipeline.

Create the pipeline

The pipeline you build does the following:

  • Reads the input data using the Cloud Storage source plugin.

  • Deploys the Cloud DLP plugin from the Hub and applys the Redact transform plugin.

  • Writes the output data using a Cloud Storage sink plugin.

  1. In the left panel of your Studio page, under the Source menu, click the GCS plugin.

04-click-gcs-source.png

  1. Hold the pointer over the GCS node that appears and click Properties.

  2. Under Reference Name, enter a reference name.

  3. This lab uses the input dataset SampleRecords.csv, provided in a publicly available Cloud Storage bucket. Under Path, enter gs://cloud-training/OCBL167/SampleRecords.csv.

  4. Under Format, select csv.

  5. Under Output Schema, under Name, enter the following by clicking the + button for each data type. Remove all existing data types if any.

  • Date

  • Bank

  • State

  • Zip

  • Notes

  1. Make sure all data types are of type string. To change the type, click Type and select String from the dropdown.

  2. Select the checkbox for each data type. This ensures that the pipeline doesn't fail when it encounters a null (empty) value.

05-gcs-source-properties.png

  1. Click Validate to ensure that there are no errors.

  2. Click the X button in the upper-right corner of the dialog box.

Redact sensitive data

The Redact transform plugin identifies sensitive records in your input stream of data and applies transformations that you define to those records. A record of data is considered sensitive if it matches pre-defined Cloud DLP filters you choose or a custom template you define.

For this lab you want to redact customer phone numbers that some support technicians on your team accidentally took note of. They entered the sensitive information in the Notes section of the support tickets, which appears as the Notes column in the CSV file. You create a custom Cloud DLP template, and then provide the template ID in the properties menu of the Redact transform plugin.

Deploy the Cloud DLP plugin

  1. In the Cloud Data Fusion UI, click Hub in the upper right.

  2. Click the Data Loss Prevention plugin.

  3. Click Deploy.

  4. Click Finish.

  5. Click the X button in the upper-right corner of the Data Loss Prevention | Deploy dialog box.

  6. Click the X button to exit the Hub.

Create a custom template

  1. In the Cloud Console, open Cloud DLP. Go to Navigation Menu > Security > Data Loss Prevention .

  2. Click on Configuration tab, and then click Create Template. 06-create-template.png

  3. Under Define template, in the Template ID field, enter an ID for your template. You will need the template ID later in the tutorial.

  4. Click Continue.

  5. Under Configure detection, click Manage infotypes.

  6. In the Built-in tab, use the filter to search for "phone number".

07-phone-number-only.png

  1. Select PHONE_NUMBER.

  2. Click Done.

  3. Click Create.

Click Check my progress to verify the objective. Create a custom template

Apply the Redact transform

  1. Back in the Cloud Data Fusion UI, on the Studio page, click to expand the Transform menu.

  2. Click the Redact transform plugin.

08-pii-filter-transform.png

  1. Drag a connection arrow from the GCS node to the Redact node.

09-gcs-to-redact.png

  1. Hold the pointer over the Redact node and click Properties.
  • Set Custom Template to Yes.

  • Under Template ID, enter the template ID of the custom template you created.

  • Under Matching, apply Masking on Custom template within Notes.

Note: In addition to masking, there are other Cloud DLP transformations available with the Cloud Data Fusion Redact plugin. To learn more, see the Documentation tab in the properties menu of the Redact plugin.
  • Under Masking Character, enter "#".

10-redact-properties.png

  • Click Validate to ensure that there are no errors.

  • Click the X button in the upper-right corner of the dialog box.

Store the output data

Store the results of your pipeline in a Cloud Storage file.

  1. In the Cloud Data Fusion UI, on the Studio page, click to expand the Sink menu.

  2. Click GCS.

  3. Drag a connection arrow from the Redact node to the GCS2 node.

11-redact-to-gcs2.png

  1. Hold the pointer over the GCS2 node and click Properties.

  • Under Reference Name, enter a reference name.

  • Under Path, enter the path of the Cloud Storage bucket you created at the beginning of this lab

  • Under Format, select CSV.

12-gcs-sink-properties.png

  • Click Validate to ensure that there are no errors.

  • Click the X button in the upper-right corner of the dialog box.

Run the pipeline in preview mode

Next, run the pipeline in preview mode before deploying it.

  1. Click Preview, and then click Run. 13-preview-run.png

The Run button displays the pipeline status, which starts with Starting, then turns to Stop, and then to Run.

  1. When the preview run completes, on the Redact node, click Preview Data to see a side-by-side comparison of the input and output data. Confirm that phone numbers have been masked with the # character.

14-preview-mode-results.png 3. Click the X button to close Preview Data.

Note: If you are not able to see the phone numbers in the Notes column, then hover over the entries to verify the result.

Redact another data type

While examining the preview run results, you notice that other sensitive information appears in the Notes column: email addresses. Go back and edit the Cloud DLP template to redact email addresses as well.

  1. In the Cloud Console, navigate to Cloud DLP. Go to Navigation Menu > Security > Data Loss Prevention .

  2. In the Configuration tab, select your template.

  3. Click Edit.

  4. Click Manage infotypes.

  5. In the Built-in tab, use the filter to search for "OR" "email address".

15-phone-number-or-email.png

  1. Select all and click Done.

  2. Click Save.

  3. Once again, run your pipeline in preview mode. Cloud Data Fusion will automatically use the updated Cloud DLP template.

  4. Confirm that both phone numbers and email addresses have been masked with the # character.

16-preview-mode-results-with-email.png

Note: If you are not able to see the phone numbers and email addresses in the Notes column, then hover over the entries to verify the result.

Click Check my progress to verify the objective. Redact another data type

Deploy and run the pipeline

  1. Make sure Preview mode is unchecked.

  2. Click Save. Clicking Save prompts you to name your pipeline. Give your pipeline a name, then, click Save.

17-name-your-pipeline.png

  1. Click Deploy.

  2. When deployment completes, click Run. Running your pipeline can take a few minutes. While you wait, you can observe the Status of the pipeline transition from Provisioning to Starting to Running to Succeeded.

Click Check my progress to verify the objective. Deploy and run the pipeline

View the results

  1. In the Cloud Console, go to Cloud Storage.

  2. In the Storage browser, navigate to the Cloud Storage bucket you specified in the sink Cloud Storage plugin properties.

  3. In Authenticated URL, copy the link and paste it into a new browser tab to download the CSV file with the results. Confirm that phone numbers and email addresses have been masked with the # character.

18-gcs-results.png

Congratulations

In this lab, you learned how to use Cloud DLP to mask certain parts of your data running through the Data Fusion pipeline. Ths helps when you need to remove/mask PII information embedded within your data before sharing it with your audience.

Learn more about creating Cloud DLP templates.

CloudDataFusion_advanced_125x135.png

Continue Your Quest

This self-paced lab is part of the Qwiklabs Building Advanced Codeless Pipelines on Cloud Data Fusion Quest. A Quest is a series of related labs that form a learning path. Completing this Quest earns you the badge above, to recognize your achievement. You can make your badge (or badges) public and link to them in your online resume or social media account. Enroll in this Quest and get immediate completion credit if you've taken this lab. See other available Qwiklabs Quests.

Take Your Next Lab

Continue your Quest with Exploring the Lineage of Data with Cloud Data Fusion

Manual Last Updated October 23, 2021
Lab Last Tested October 23, 2021

©2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.