
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Execute the pipeline
/ 5
In this lab, you learn how to load data into BigQuery and run complex queries. Next, you will execute a Dataflow pipeline that can carry out Map and Reduce operations, use side inputs and stream into BigQuery.
In this lab, you learn how to use BigQuery as a data source into Dataflow, and how to use the results of a pipeline as a side input to another pipeline.
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
is present and has the editor
role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
editor
role, follow the steps below to assign the required role.729328892908
).{project-number}
with your project number.If the account does not have the Dataflow Developer
role, follow the steps below to assign the required role.
On the Navigation menu, click IAM & Admin > IAM.
Select the default compute Service Account {project-number}-compute@developer.gserviceaccount.com
.
Select the Edit option (the pencil on the far right).
Click Add Another Role.
Click inside the box for Select a Role. In the Type to filter selector, type and choose Dataflow Developer.
Click Save.
On the Google Cloud Console title bar, click Activate Cloud Shell. If prompted, click Continue.
Run the following commands to ensure that the Dataflow API is enabled cleanly in your project. If prompted, click Authorize:
You will be running all code from a curated training VM.
In the Console, on the Navigation menu (), click Compute Engine > VM instances.
Locate the line with the instance called training-vm.
On the far right, under Connect, click on SSH to open a terminal window. If prompted, click Authorize.
In this lab, you will enter CLI commands on the training-vm.
Follow these instructions to create a bucket.
Property | Value (type value or select option as specified) |
---|---|
Name | |
Location type > Region |
Click Create.
If you get the Public access will be prevented
prompt, select Enforce public access prevention on this bucket
and click Confirm.
In the training-vm SSH terminal enter the following to create three environment variables. One named "BUCKET", another named "PROJECT", and the last named "REGION". Verify that each exists with the echo command:
What is being returned?
The BigQuery table cloud-training-demos.github_repos.contents_java
contains the content (and some metadata) of all the Java files present in GitHub in 2016.
How many files are there in this dataset?
Is this a dataset you want to process locally or on the cloud?
/training-data-analyst/courses/data_analysis/lab2/python
and view the file JavaProjectsThatNeedHelp.py
.View the file with Nano. Do not make any changes to the code. Press Ctrl+X to exit Nano.
Refer to this diagram as you read the code. The pipeline looks like this:
The program requires BUCKET, PROJECT, and REGION values and whether you want to run the pipeline locally using --DirectRunner
or on the cloud using --DataFlowRunner
.
Execute the pipeline locally by typing the following into the training-vm SSH terminal:
Once the pipeline has finished executing, On the Navigation menu (), click Cloud Storage > Buckets and click on your bucket. You will find the results in the javahelp folder. Click on the Result object to examine the output.
Execute the pipeline on the cloud by typing the following into the training-vm SSH terminal:
Return to the browser tab for Console. On the Navigation menu (), click View All Products, and select Dataflow from the Analytics section.
Click on your job to monitor progress.
Once the pipeline has finished executing, On the Navigation menu () click Cloud Storage > Buckets and click on your bucket. You will find the results in the javahelp folder. Click on the Result object to examine the output. The file name will be the same but you will notice that the file creation time is more recent.
Click Check my progress to verify the objective.
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one