Create a table partitioned by date
Correctly add appropriate columns to table
Populate the population column
Populate the country_area column
Populate the mobility STRUCT
Query missing data in population & country_area columns
Build and Optimize Data Warehouses with BigQuery: Challenge Lab
In a challenge lab you’re given a scenario and a set of tasks. Instead of following step-by-step instructions, you will use the skills learned from the labs in the quest to figure out how to complete the tasks on your own! An automated scoring system (shown on this page) will provide feedback on whether you have completed your tasks correctly.
When you take a challenge lab, you will not be taught new Google Cloud concepts. You are expected to extend your learned skills, like changing default values and reading and researching error messages to fix your own mistakes.
To score 100% you must successfully complete all tasks within the time period!
This lab is recommended for students enrolled in the Build and Optimize Data Warehouses with BigQuery Quest.
- Use BigQuery to access public COVID and other demographic datasets.
- Create a new BigQuery dataset which will store your tables.
- Add a new date partitioned table to your dataset.
- Add new columns to this table with appropriate data types.
- Run a series of JOINS to populate these new columns with data drawn from other tables.
Before you click the Start Lab button
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
- Access to a standard internet browser (Chrome browser recommended).
- Time to complete the lab---remember, once you start, you cannot pause a lab.
You are part of an international public health organization which is tasked with developing a machine learning model to predict the daily case count for countries during the Covid-19 pandemic. As a junior member of the Data Science team you've been assigned to use your data warehousing skills to develop a table containing the features for the machine learning model.
You are expected to have the skills and knowledge for these tasks, so don't expect step-by-step guides to be provided.
Your first step is to create a new dataset and table. The starting point for the machine learning model will be the
oxford_policy_tracker table in the COVID 19 Government Response public dataset which contains details of different actions taken by governments to curb the spread of Covid-19 in their jurisdictions.
Given the fact that there will be models based on a range of time periods, you are instructed to create a new dataset and then create a date partitioned version of the
oxford_policy_tracker table in your newly created dataset, with an expiry time set to 720 days.
You have also been instructed to exclude the United Kingdom (
alpha_3_code='GBR'), Brazil (
alpha_3_code='BRA'), Canada (
alpha_3_code='CAN') & the United States of America (
alpha_3_code='USA) as these will be subject to more in-depth analysis through nation and state specific analysis.
Then, in terms of additional information that is required, you have been told to add columns for population, country_area and a record column (named mobility) that will take six input fields representing average mobility data from the last six columns of the
mobility_report table from the Google COVID 19 Mobility public dataset.
A colleague working on an ancillary task has provided you with the SQL they used for updating the daily new case data in a similar data partitioned table through a JOIN with the
covid_19_geographic_distribution_worldwide table from the European Center for Disease Control COVID 19 public dataset.
This is a useful table that contains a range of data, including recent national population data, that you should use to populate the population column in your table:
The above template updates a daily new case column so you must modify it before you can use it to populate the population data from the European Center for Disease Control COVID 19 public dataset but the final query will be very similar.
In addition to population date you must also add in country area data to your table. The data for geographic country areas can be found in the
country_names_area table from the Census Bureau International public dataset.
The last data ingestion task requires you to extract average values for the six component fields that comprise the
mobility record data from the
mobility_report table from the Google COVID 19 Mobility public dataset .
You need to be aware that the mobility information might be broken down by sub-regions for some countries so there may be more than one daily record for each country. However the machine learning model you are working on will only operate on a country level so you must extract a daily average for these mobility fields that aggregates all daily records for each country into a single average for each mobility record element.
In order to ensure you are aligned with the rest of the team the following column names and data types have been specified that you must use when updating the schema for your table:
Your coworker has also given you a SQL snippet that is currently being used to analyze trends in the Google Mobility data daily mobility patterns. You should be able to use this as part of the query that will add the daily country data for the mobility record in your table.
When performing the JOINs between these various tables, you will need to use either the
alpha_3_code column which is the 3 letter country code, or the
country_name column in your table which contains the full official country name. The corresponding column names in the secondary data tables may be different.
For your final task you must identify data issues that will need to be resolved by another member of your team. Once you have your columns populated, please run a query that returns a combined list of the DISTINCT countries that do not have any population data and countries that do not have country area information, ordered by country name. If a country has neither population nor country area it should appear twice. This will give you an idea of problematic countries.
Task 1. Create a table partitioned by date
Create a new dataset
and create a table in that dataset partitioned by date, with an expiry of 720 days. The table should initially use the schema defined for the
oxford_policy_trackertable in the COVID 19 Government Response public dataset .
You must also populate the table with the data from the source table for all countries except the United Kingdom (
GBR), Brazil (
BRA), Canada (
CAN) and the United States (
Task 2. Add new columns to your table
- Update your table to add new columns to your table with the appropriate data types to ensure alignment with the specification provided to you:
Task 3. Add country population data to the population column
Add the country population data to the population column in your table with
covid_19_geographic_distribution_worldwidetable data from the European Center for Disease Control COVID 19 public dataset table.
Check that the population column has been correctly populated.
Task 4. Add country area data to the country_area column
Add the country area data to the country_area column in your table with
country_names_areatable data from the Census Bureau International public dataset.
Check that the country_area column has been correctly populated.
Task 5. Populate the mobility record data
Populate the mobility record in your table with data from the Google COVID 19 Mobility public dataset .
Check that the mobility record has been correctly populated.
Task 6. Query missing data in population & country_area columns
- Run a query to find the missing countries in the population and country_area data. The query should list countries that do not have any population data and countries that do not have country area information, ordered by country name. If a country has neither population or country area it must appear twice.
Check query for missing data was correctly run.
The Activity tracking section needs to be completed once the above scenario is approved.
Tips and tricks
Tip 1. Remember that you must exclude the United Kingdom (
GBR), Brazil (
BRA), Canada (
CAN) and the United States (
USA) data from your initial table.
Tip 2. When updating the schema for a BigQuery table you can use the console to add the columns and record elements or you can use the command line
bqutility to update the schema by providing a JSON file with all of field definitions as explained in the BigQuery Standard SQL reference documentation.
Tip 3. The
covid19_ecdctable in the European Center for Disease Control COVID 19 public dataset contains a
populationcolumn that you can use to populate the
populationcolumn in your dataset.
Tip 4. The
country_names_areatable from the Census Bureau International public dataset does not contain a three letter country code column, but you can join it to your table using the full text
country_namecolumn that exists in both tables.
Tip 5. When updating the mobility record remember that you must select (and average) a number of records for each country and date combination so that you get a single average of each child column in the mobility record. You must join the resulting data to your working table using the same combination of country name and date that you used to group the source mobility records to ensure there is a unique mapping between the averaged source mobility table results and the records in your table that have a single entry for each country and date combination.
TIP 6. The UNION option followed by the ALL keyword combines the results of two queries where each query lists distinct results without combining duplicate results that arise from the union into a single row.
Earn your next skill badge quest
This self-paced lab is part of the Build and Optimize Data Warehouses with BigQuery skill badge quest. Completing this skill badge quest earns you the badge above, to recognize your achievement. Share your badge on your resume and social platforms, and announce your accomplishment using #GoogleCloudBadge.
This skill badge is part of Google Cloud’s Data Analyst learning path. If you have already completed the other skill badge quests in this learning path, search the catalog for 20+ other skill badge quests in which you can enroll.
Google Cloud training and certification
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated October 4, 2022
Lab Last Tested October 4, 2022
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.