arrow_back

Working with Cloud Dataprep on Google Cloud

Join Sign in

Working with Cloud Dataprep on Google Cloud

1 hour 1 Credit

This lab was developed with our partner, Trifacta. Your personal information may be shared with Trifacta, the lab sponsor, if you have opted-in to receive product updates, announcements, and offers in your Account Profile.

GSP050

Google Cloud self-paced labs logo

Overview

Cloud Dataprep is Google's self-service data preparation tool built in collaboration with Trifacta. In this lab you will learn how to clean and enrich multiple datasets using Cloud Dataprep. The lab exercises are based on a mock use case scenario.

Use case scenario:

You work for a technical services company that sells three monthly subscription products:

  • Silver (price: $9.99/month)
  • Gold (price: $14.99/month)
  • Platinum (price: $29.99/month)

The company occasionally offers promotional discounts, so some product prices may be slightly lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code over the course of three years.

To do this you'll need to join your customer contact datasource (where the zip code information resides) with sales data from your purchases datasource. Once you've joined the data, you'll aggregate the results.

What you'll learn

In this lab, you will learn how to perform the following tasks:

  • Cleaning and profiling data with Cloud Dataprep

  • Combining multiple datasets using Cloud Dataprep

  • Computing the results of formulas in Cloud Dataprep

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.

  4. Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  5. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Navigation menu icon

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

  1. Click Activate Cloud Shell Activate Cloud Shell icon at the top of the Google Cloud console.

  2. Click Continue.

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. The output contains a line that declares the PROJECT_ID for this session:

Your Cloud Platform project in this session is set to YOUR_PROJECT_ID

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  1. (Optional) You can list the active account name with this command:

gcloud auth list

Output:

ACTIVE: * ACCOUNT: student-01-xxxxxxxxxxxx@qwiklabs.net To set the active account, run: $ gcloud config set account `ACCOUNT`
  1. (Optional) You can list the project ID with this command:

gcloud config list project

Output:

[core] project = <project_ID>

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: For full documentation of gcloud, in Google Cloud, refer to the gcloud CLI overview guide.

Task 1. Open Google Cloud Dataprep

  1. In the Cloud Console go to the Navigation menu, and under ANALYTICS select Dataprep.

  2. To get into Cloud Dataprep, check that you agree to Google Dataprep Terms of Service, and then click Accept.

  3. Click the checkbox and then click Agree and Continue when prompted to share account information with Trifacta.

  4. Click Allow to give Trifacta access to your project.

  5. Select your lab credentials to sign in and click Allow.

  6. Check the box and click Accept to agree to Trifacta Terms of Service.

  7. If prompted to use the default location for the storage bucket, click Continue.

  8. For new users, a tutorial will launch, asking you to select datasets. Quit out of this screen by clicking Cancel or exiting out.

  9. Click on the Dataprep icon on the top left corner to go to the home screen.

Task 2. Retrieve dataset files

In this section you will add the sales activity files to a storage bucket that Dataprep created for you.

  1. Go back to the Cloud Console.
Note: If you have closed the Cloud Console you can open it by clicking the Google Cloud icon in the bottom left corner.
  1. Get your bucket name. From the Navigation menu, select Cloud Storage > Browser.

  2. Note the Dataprep bucket name to use in the next step.

  3. In the Cloud Shell command line, execute the following command, substituting [YOUR-BUCKET-NAME] with the Dataprep bucket name:

gsutil cp -r gs://spls/gsp050 gs://[YOUR-BUCKET-NAME]

You should receive a similar output:

Copying gs://spls/gsp050/lab_customers.csv [Content-Type=text/csv]... \ [4 files][ 8.5 MiB/ 8.5 MiB] Operation completed over 4 objects/8.5 MiB.

Click Check my progress to verify the objective. Retrieve dataset files

Task 3. Create a Flow

Go back to the Cloud Dataprep tab. To wrangle your data, you need to create a Flow. A Flow is a set of related datasets and the connections between them.

  1. Click Create Flow in the upper right-hand corner:
  1. Name your Flow Qwiklab1, leave the flow description blank, then click Ok.

At this point your Flow is created. To guide you, Dataprep creates some placeholders in the flow to help you get started. The first step is to import and add data to Dataprep and the Flow.

  1. Click on the + under Dataset to add a new data source and click the Import Datasets link.

  2. In the left navigation menu, click Cloud Storage > dataprep-staging-xxx... > gsp050 to access the sample data you stored in the previous section.

  3. Click the + next to each file listed. When you click on a file, it will move to the right side of the screen. Click Import & Add to Flow to add the datasets to your Flow:

Cloud Dataprep brings you back to the Flow View page, which now contains the datasets you added. Additional placeholders for the recipe and output are created.

The Flow View page containing the added datasets

Task 4. Clean customer data

Now that you have the data, the next step is to design a data preparation recipe to clean the customers dataset. From the template, there already is a recipe added using the lab_2013_transactions.csv data. Let’s skip this recipe for now and create our own.

  1. To create a new recipe:
  • Click the plus icon (+) next to the lab_customers.csv.
  • Then click Add new Recipe.

2 datasets, lab_2015_transactions.csv and lab_customers.csv, with the expanded menu displayed for lab_customers.csv

  1. Right-click on this new Recipe node.
  • From the drop-down menu, choose Edit name and description.

  • Change the name to lab_customers and click OK.

  1. A new Recipe node is created and a panel will open on the right side, displaying information about the recipe, including the data and any existing transformation steps.

  2. Hit the blue Edit Recipe button. (Alternatively, you can double click the recipe node itself.)

Cloud Dataprep opens the "Transformer Grid". This is a worksheet-like interface where you can design the steps in your data preparation recipe. The Transformer page is where you build your transformation recipe and see the results applied to the sample. When you are satisfied with what you see, execute the job against your dataset.

Each of the columns have a Name and an icon that specify the inferred data type. Possible data types are shown when you click the icon to the left of the column name:

The expanded menu for the column_id column, with the More types option highlighted, and the associated submenu options displayed

When you click on a column option, a Details panel opens on the right.

This Details panel is dynamic and contains information about whatever you have selected, including column information or suggested transformations. Click X in the top right of the Details panel to close the Details panel for now.

In the following steps, you explore data in the grid view and apply transformation steps to your recipe.

When you open the Transformer Grid, Cloud Dataprep automatically profiles the contents of your dataset and generates column-level histograms and data quality indicators. This profile information can be used to guide your data preparation process.

Apply a filter

  1. Scroll all the way to the right to the start_date column. Examine the horizontal bar at the top column:

A tricolored horizontal bar above the start date column. Colors include blue, red, and grey.

Note: You can move ahead if the red bar does not appear in the `start_date` column.

This is the data quality bar. The green part represents valid values, the gray represents missing or null values. A red bar indicates data that does not match the data type. Clicking on the sections of the data quality bar will generate suggestions that contain data quality conditionals. These conditionals test whether each record is valid, empty, or invalid, depending on the section of the bar that you clicked.

Using start_date and end_date as a filter, you will add a transformation to remove contacts where the start_date column is empty.

  1. Click on the grey part of the data quality bar for the start_date column.

Cloud Dataprep generates a list of suggested transformations on the right, based on your selection. You can hover your mouse over any of the suggestion cards and Dataprep will show you a preview of how your data will look if the suggestion is applied. If you select a card, Cloud Dataprep updates the grid to show you a preview of this transformation.

  1. Click Add on the "Delete rows with missing values in start_date" suggestion card on the right.

The rows that were highlighted in red have been removed from your dataset.

Fill in missing values

Look at the end_date column. Based on the data quality bar, there are a large number of rows with missing values. To easily work with this column, you'll insert an empty value–January 01, 2050–in those empty rows.

  1. Click the gray section of the data quality bar for the end_date column.

This will generate another set of suggested transformations. There should be a suggestion to Set missing values to NULL(). In this case, Dataprep does not know what exact value you may want to fill with, so it creates a template for you to modify.

  1. On any suggestion card, you then click Edit.

This opens the Add Step builder. Cloud Dataprep's suggested transformation has already been populated, but you can make adjustments to the code.

  1. In the Formula box, replace the NULL() with '2050/1/1' (with the quotes). The full formula should look like:

IFMISSING($col, '2050/01/01')

The Edit with formula box displaying the full formula in the Formula text box

  1. Click Add.

Now the data quality problems in the lab_customers dataset have been addressed and the gray part of the data quality bar is gone.

Task 5. Union multiple transactions datasets

Now switch gears and work on the transactions datasets.

  • Click on the QWIKLAB1 flow name at the top of the screen:

Flow title with Qwiklab1 highlighted

This brings you back to the Flow view.

Create a single dataset that unions the transactions datasets from 2013, 2014, and 2015.

  1. Click on the lab_2013_transactions dataset.

  2. Click the plus sign (+), and then click Add new Recipe.

Cloud Dataprep creates a new recipe and wrangled dataset named Untitled recipe.

  1. Right-click on this new wrangled dataset. From the drop-down menu, choose Edit name and description.

  2. Change the name to Combined Transactions and click OK.

Flow view for 2 datasets: lab_2013_transactions.csv and lab_2014_transactions.csv, as well as the new Combined Transactions recipe

  1. Double-click Combined Transactions to edit the recipe. This opens the recipe in the Transformer Grid. Notice that the data in the grid is the structured data from the lab_2013_transactions.csvdataset.

Combine multiple datasets with the same schema using a Union transform.

  1. Before you make any transformations, look at the bottom left of the Transformer Grid. Here you can see the metadata display:

Metadata display showing 5 columns, 26,476 rows, and 3 data types

The metadata shows a quick summary of what data is loaded in the grid. Remember that the data loaded is a sample of the entire dataset, up to 10MB.

  1. Look at the top of Transformer, near the recipe name. Here you can see the sample that is currently visible.
Note: Depending on the version, this may display differently.

Recipe name section that shows initial data is visible

OR

Recipe name section that shows full data is visible

The sample shown is the initial data from the source. For small datasets (under 10MB), Dataprep loads the entire dataset into the initial data sample.

  1. Click the Recipe icon at the top.

Recipe icon

  1. Click Add New Step.

  2. Type in "Union" in the search field then click on the result to get to the Union tool.

The Union Output field displays the output schema for your dataset. Each box represents a column. Cloud Dataprep bases the output schema on the schema of the dataset from which you initiated the union transform. In this case, the columns in the "Combined Transactions" dataset determine the columns that will appear in the combined output.

  1. Click Add Data.

  2. Check the lab_2014_transactions. In the drop down at the bottom left, select Align By Name and click Apply.

  3. Click Add to Recipe to combine the datasets. After adding the union to your script, look at the transaction_date column.

This dataset now includes records from January 2013 through December 2014.

  1. You have now unioned 2 of 3 datasets together. Check the metadata, you should also see additional rows.

Metadata display showing 5 columns, 68,597 rows, and 3 data types

Task 6. Modifying recipe steps

Over the course of working with data, you may often find yourself tweaking or removing certain transformations. Dataprep makes it very easy to edit your work. In this case, you have unioned 2 of the 3 datasets, but instead of adding a brand new step to union the remaining dataset, you can simply edit what you've already done.

  1. Click the Undo icon to rewind the last action, in this case, the Union step.

Notice that your recipe is now empty. The grid and metadata are also updated to reflect the original state.

  1. Click the Redo icon to bring back the Union step.

  2. In the Recipe panel, right click the Union step and choose Edit.

  3. This will open up the Union tool again. This time, click Add data again and check the lab_2015_transactions. In the drop down at the bottom left, select Align By Name and click Apply.

  4. Examine the column-to-column mappings. Click Add to Recipe to combine all three datasets.

  5. After adding the union to your script, look at the transaction_date column.

This dataset now includes records from January 2013 through December 2015. How many rows now show in the metadata?

  1. Click on the QWIKLAB1 flow name to return to the Flow View.

The flow visualization is updated to show how the three transactions datasets combine to form the Combined Transactions dataset:

Flow view of the Combined Transactions recipe for the three datasets

Task 7. Join transactions data to customers data

Now that the datasets are combined, you will enrich the transactions data with information about where each purchase was made. To do this, join the customer data to the transactions data. When performing a join, treat the larger dataset as the master dataset, or the "left side" of the join. The smaller dataset should be the detail dataset, or the "right side" of the join. In Cloud Dataprep, the dataset from which you initiate a join automatically becomes the master dataset.

  1. Double-click Combined Transactions to edit the recipe again.

  2. Click on the Join icon in the Transformer toolbar to open the Join tool.

Join icon

Note: You can also activate the Join tool like you did previously, using the Recipe panel and New Step and searching for ‘Join’. There are many ways of creating transformations in Dataprep.
  1. Click on the lab_customers dataset to bring in the other dataset, then click Accept.

  2. On the next screen, edit the Join keys and conditions. On the left is a preview of the join key matches. On the right are options to edit the join type, join keys and the preview statistics of the join. Dataprep will try to automatically infer the correct join keys based on common values between the datasets.

Optionally, you can edit the join keys. Hover in the Join keys section, then click on the pencil (Edit icon) to modify the join key or Add to add additional join keys.

For these datasets, Cloud Dataprep chose an inside join on column customer_id. This means the output dataset will be those records that have the same customer_id.

  1. Click Next.

  2. On the next screen, you can choose which columns to keep or drop after the join. In the Output Columns panel, put a check next to the following fields to add those columns to the Join:

  • customer_id (current)
  • transaction_date
  • ticket_price
  • product
  • address_state
  • address_zip
  • region
  • start_date
  • end_date

All unchecked columns will be dropped. Your results will look like this:

The Output columns panel displaying various checked and unchecked columns titles

  1. Click Review to preview the result of your join in the Transformer Grid.

  2. Click Add to Recipe.

Task 8. Create new columns and rename

As a final step, you will want to do some additional cleanup of the data for your report. You will need to create some columns with the necessary values that you want to visualize by.

Let’s see another way of authoring Transformations in Dataprep - from the column menus.

  1. Click the drop-down arrow next to transaction_date > Extract > Datetime > Year (YYYY).

A new formula builder opens, pre-populated with the selected actions. A preview is also generated in the grid.

  1. Click Add.

Notice that a new column is created, called year_transaction_date. In the previous step, you could have named the new column while editing the transformation. If you leave out the name, Dataprep will generate a new column based on the transformation step you took, or as column# if no source columns were selected.

  1. You will manually rename this column. Click the drop-down arrow next to year_transaction_date > Rename.

  2. Enter activity_year in the field. Click Add to accept the changes.

Note that you can rename multiple columns with this transformation by hitting Add to add more mappings.

Task 9. Publish the results to BigQuery

You just finished preparing your data and you're ready to produce a results file in Cloud Storage. Cloud Dataprep executes your data transformation recipe to produce your output file using the BigQuery engine.

  1. Click Run in the top right of the Transformer Grid.

  2. In the Run Job dialog, you can configure your job execution settings and output destination. By default, Cloud Dataprep will create a CSV file on Cloud Storage.

  3. Hover your mouse over the existing Publishing Action and hit Edit on the right.

  4. Click on the BigQuery tab on the left.

  5. Select the Dataprep database, and click the Create a new table button on the right.

  6. Enter transactions_by_customer as the new table name and select Append to this table every run as the write option.

  7. Click Update on the bottom to update the output settings.

  8. Now click Run to kick off your BigQuery job. This will take a few minutes. You can see the job processing on the Dataprep "Jobs" page. When it completes, you'll see a success message that resembles the following, and your data will be loaded in the new BigQuery table.

The Status displaying as Completed for the Combined Transactions recipe on the Flow jobs page

  1. Hover over the finished job and click Profile to see your data organized. It should resemble the following:

The Profile tabbed page displaying sections for All data, and Results profile by column

  1. Your results are visible by querying BigQuery directly. From the Google Cloud Console, navigate to ANALYTICS > BigQuery. Click on the Dataprep dataset.

  2. Enter select * from Dataprep.transactions_by_customer; into the Query Editor. Click Run to see the data that was published.

Cloud Dataprep is that simple! It's easy to cleanse and enrich multiple data sources using an intuitive, visual interface.

Click Check my progress to verify the objective. Publish the results to BigQuery

Congratulations!

This concludes the Working With Google Cloud Dataprep lab. In this lab, you started off by creating new flows and transforming data. You then learned how to use the Dataprep UI to filter messy data, union and join multiple files, and create and rename columns. Lastly, you produced a results file and exported it to Google Cloud Storage.

Finish your quest

This self-paced lab is part of the Transform and Clean your Data with Dataprep by Trifacta on Google Cloud quest. A quest is a series of related labs that form a learning path. Completing this quest earns you a badge to recognize your achievement. You can make your badge or badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. Refer to the Google Cloud Skills Boost catalog for all available quests.

Next steps / learn more

You can continue exploring Dataprep Professional Edition with a free 30-day trial available here. Please make sure to sign out from your temporary lab account and re-sign with your Google Cloud valid email. Advanced features, such as additional connectivity, pipeline orchestration, and adaptive data quality are also available in the Premium edition that you can explore in the Google Cloud Marketplace.

Read the how-to guides to learn how to discover, cleanse, and enhance data with Google Dataprep.

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated September 6, 2022

Lab Last Tested May 2, 2022

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.