BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
This lab is an in-depth walkthrough of working with semi-structured data (ingesting JSON, array data types) inside of BigQuery. Denormalizing your schema into a single table with nested and repeated fields can yield performance improvements, but the SQL syntax for working with array data can be tricky. You will practice loading, querying, troubleshooting, and unnesting various semi-structured datasets.
Objectives
In this lab you learn how to:
Ingest JSON datasets.
Create ARRAYs and STRUCTs.
Unnest semi-structured data for insights.
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Open BigQuery Console
In the Google Cloud Console, select Navigation menu > BigQuery.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
Click Done.
Task 1. Create a new dataset
To create a dataset, click on the View actions icon next to your Project ID and then select Create dataset.
Name the new dataset fruit_store. Leave the other options at their default values (Data location, Default table expiration).
Click CREATE DATASET.
Task 2. Practice working with arrays in SQL
Normally in SQL you will have a single value for each row like this list of fruits below:
Row
Fruit
1
raspberry
2
blackberry
3
strawberry
4
cherry
What if you wanted a list of fruit items for each person at the store? It could look something like this:
Row
Fruit
Person
1
raspberry
sally
2
blackberry
sally
3
strawberry
sally
4
cherry
sally
5
orange
frederick
6
apple
frederick
In traditional relational database SQL, you would look at the repetition of names and immediately think to split the above table into two separate tables: Fruit Items and People.
In BigQuery, you're going to take a faster approach which involves the use of the array data type to achieve the below:
Row
Fruit (array)
Person
1
raspberry
sally
blackberry
strawberry
cherry
2
orange
frederick
apple
Look at the differences in this table compared to other tables you've seen before.
It's only two rows.
There are multiple field values for Fruit in a single row.
The people are associated with all of the field values.
An easier way to interpret the Fruit array:
Row
Fruit (array)
Person
1
[raspberry, blackberry, strawberry, cherry]
sally
2
[orange, apple]
frederick
Both of these tables are exactly the same. There are two key learnings here:
An array is simply a list of items in brackets [ ].
BigQuery (in Standard SQL mode) displays arrays as flattened. It simply lists the value in the array vertically (note that all of those values still belong to a single row).
Try it yourself. Enter the following in the BigQuery Query Editor:
#standardSQL
SELECT
['raspberry', 'blackberry', 'strawberry', 'cherry'] AS fruit_array
Click Run.
Now try executing this one:
#standardSQL
SELECT
['raspberry', 'blackberry', 'strawberry', 'cherry', 33] AS fruit_array
You should get an error that looks like the following:
Error: Array elements of types {INT64, STRING} do not have a common supertype at [3:1]
Arrays can only share one data type (all strings, all numbers). You might ask at this point, can you have an array of arrays? Yes, they can! This will be covered later.
Here's the final table to query against:
#standardSQL
SELECT person, fruit_array, total_cost FROM `data-to-insights.advanced.fruit_store`;
Click Run.
After viewing the results, click the JSON tab to view the nested structure of the results.
Uploading JSON files
What if you had a JSON file that you needed to ingest into BigQuery? You'll try this now.
Create a new table in the fruit_store dataset.
To create a table, click on the View actions icon next to the fruit_store dataset and select Open.
Then click Create table from the right panel.
Note: You may have to widen your browser window to see the Create table option.
Add the following details for the table:
Source: Choose Google Cloud Storage in the Create table from dropdown.
Select file from GCS bucket:cloud-training/data-insights-course/labs/optimizing-for-performance/shopping_cart.json.
File format: JSONL (Newline delimited JSON).
Call the new Table namefruit_details.
Under Schema, click on the checkbox of Auto detect.
Click Create table.
Click on the table fruit_details.
In the schema, note that fruit_array is marked as REPEATED, which means it is an array.
Task 3. Store data about an entity that has different data types
As you saw earlier, arrays can only have one data type. There is, however, a data type which supports having both multiple field names and types – the STRUCT data type.
The next dataset will be lap times of runners around the track. Each lap will be called a "split".
With this query, try out the STRUCT syntax and note the different field types within the struct container:
#standardSQL
SELECT STRUCT("Rudisha" as name, 23.4 as split) as runner
Row
runner.name
runner.split
1
Rudisha
23.4
What do you notice about the field aliases? Since there are fields nested within the struct (name and split are a subset of runner) you end up with a dot notation.
What if the runner has multiple splits in one record? How could you have multiple split times within a single record? Hint: The splits all have the same numeric datatype.
Answer: With an array of course!
Run the below query to confirm:
#standardSQL
SELECT STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner
Row
runner.name
runner.split
1
Rudisha
23.4 26.3 26.4 26.1
To recap:
Structs are containers that can have multiple field names and data types nested inside.
Arrays can be one of the field types inside of a Struct (as shown above with the splits field).
Load in race results from other racers into a new table
Create a new dataset titled racing.
Create a new table titled race_results.
Ingest this Google Cloud Storage JSON file:
cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json.
Source: select Google Cloud Storage under Create table from dropdown.
Select file from GCS bucket:cloud-training/data-insights-course/labs/optimizing-for-performance/race_results.json.
File format: JSONL (Newline delimited JSON) and set Table name as race_results.
Move the Edit as text slider and add the following:
After the load job is successful, preview the schema for the newly created table.
Which field is the STRUCT? How do you know?
Answer: The participants field is the STRUCT because it is of type RECORD.
Which field is the ARRAY?
Answer: The participants.splits field is an array of floats inside the parent participants struct. It has a REPEATED Mode which indicates an array. Values of that array are called nested values since they are multiple values inside of a single field.
Practice querying nested and repeated fields
Let's see all of our racers for the 800 meter race:
#standardSQL
SELECT * FROM racing.race_results
What if you wanted to list the name of each runner and the type of race?
Run the below schema and see what happens:
#standardSQL
SELECT race, participants.name
FROM racing.race_results
Error: Cannot access field name on a value with type ARRAY<STRUCT<name STRING, splits ARRAY<FLOAT64>>>> at [2:27]
Much like forgetting to GROUP BY when you use aggregation functions, here there are two different levels of granularity. One row for the race and three rows for the participants names. So how do you change this:
Row
race
participants.name
1
800M
Rudisha
2
???
Makhloufi
3
???
Murphy
...to this:
Row
race
participants.name
1
800M
Rudisha
2
800M
Makhloufi
3
800M
Murphy
In traditional relational SQL, if you had a races table and a participants table what would you do to get information from both tables? You would JOIN them together. Here the participant STRUCT (which is conceptually very similar to a table) is already part of your races table but is not yet correlated correctly with your non-STRUCT field "race".
Can you think of what two word SQL command you would use to correlate the 800M race with each of the racers in the first table?
Answer: CROSS JOIN
Great!
Now try running this:
#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN
participants # this is the STRUCT (it's like a table within a table)
Error: Table name "participants" missing dataset while no default dataset is set in the request.
Even though the participants STRUCT is like a table, it is still technically a field in the racing.race_results table.
Add the dataset name to the query:
#standardSQL
SELECT race, participants.name
FROM racing.race_results
CROSS JOIN
race_results.participants # full STRUCT name
And Run.
Wow! You've successfully listed all of the racers for each race!
Row
race
name
1
800M
Rudisha
2
800M
Makhloufi
3
800M
Murphy
4
800M
Bosse
5
800M
Rotich
6
800M
Lewandowski
7
800M
Kipketer
8
800M
Berian
You can simplify the last query by:
Adding an alias for the original table
Replacing the words "CROSS JOIN" with a comma (a comma implicitly cross joins)
This will give you the same query result:
#standardSQL
SELECT race, participants.name
FROM racing.race_results AS r, r.participants
If you have more than one race type (800M, 100M, 200M), wouldn't a CROSS JOIN just associate every racer name with every possible race like a cartesian product?
Answer: No. This is a correlated cross join which only unpacks the elements associated with a single row. For a greater discussion, refer to the working with arrays and structs reference documentation.
Recap of STRUCTs:
A SQL STRUCT is simply a container of other data fields which can be of different data types. The word struct means data structure. Recall the example from earlier: STRUCT("Rudisha" as name, [23.4, 26.3, 26.4, 26.1] as splits) AS runner.
STRUCTs are given an alias (like runner above) and can conceptually be thought of as a table inside of your main table.
STRUCTs (and arrays) must be unpacked before you can operate over their elements. Wrap an UNNEST() around the name of the struct itself or the struct field that is an array in order to unpack and flatten it.
Lab question: STRUCT()
Answer the below questions using the racing.race_results table you created previously.
Task: Write a query to COUNT how many racers were there in total.
To start, use the below partially written query:
#standardSQL
SELECT COUNT(participants.name) AS racer_count
FROM racing.race_results
Hint: Remember you will need to cross join in your struct name as an additional data source after the FROM.
Possible solution:
#standardSQL
SELECT COUNT(p.name) AS racer_count
FROM racing.race_results AS r, UNNEST(r.participants) AS p
Row
racer_count
1
8
Answer: There were 8 racers who ran the race.
Task 4. Unpack arrays with UNNEST( )
Now that you are familiar working with structs, it's time to apply that same knowledge of unpacking arrays to some traditional arrays.
Recall that the UNNEST operator takes an array and returns a table, with one row for each element in the array.
This will allow you to perform normal SQL operations like:
Aggregating values within an array.
Filtering arrays for particular values.
Ordering and sorting arrays.
As a reminder, an array is an ordered list of elements that share a data type.
You can create arrays in BigQuery by adding brackets [ ] and comma separating values.
Try the below query and be sure to note how many rows are output. Will it be 8 rows?
#standardSQL
SELECT
['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian'] AS normal_array
Answer: It is a single row with 8 array elements.
Row
normal_array
1
Rudisha
Makhloufi
Murphy
Bosse
Rotich
Lewandowski
Kipketer
Berian
Tip: If you already have a field that isn't in an array format you can aggregate those values into an array by using ARRAY_AGG().
In order to find the racers whose names begin with the letter M, you need to unpack the above array into individual rows so you can use a WHERE clause.
Unpacking the array is done by wrapping the array (or the name of the array) with UNNEST() as shown below.
Run the below query and note how many rows are returned:
#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names
And you should see:
Row
unnested_array_of_names
1
Rudisha
2
Makhloufi
3
Murphy
4
Bosse
5
Rotich
6
Lewandowski
7
Kipketer
8
Berian
You have successfully unnested the array. This is also called flattening the array.
Now add a normal WHERE clause to filter these rows, and run the query:
#standardSQL
SELECT * FROM
UNNEST(['Rudisha','Makhloufi','Murphy','Bosse','Rotich','Lewandowski','Kipketer','Berian']) AS unnested_array_of_names
WHERE unnested_array_of_names LIKE 'M%'
Row
unnested_array_of_names
1
Makhloufi
2
Murphy
Lab question: Unpacking arrays with UNNEST( )
Write a query that will list the total race time for racers whose names begin with R. Order the results with the fastest total time first. Use the UNNEST() operator and start with the partially written query below.
Complete the query:
#standardSQL
SELECT
p.name,
SUM(split_times) as total_race_time
FROM racing.race_results AS r
, r.participants AS p
, p.splits AS split_times
WHERE
GROUP BY
ORDER BY
;
Hint:
You will need to unpack both the struct and the array within the struct as data sources after your FROM clause
Be sure to use aliases where appropriate
Possible solution:
#standardSQL
SELECT
p.name,
SUM(split_times) as total_race_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_times
WHERE p.name LIKE 'R%'
GROUP BY p.name
ORDER BY total_race_time ASC;
Row
name
total_race_time
1
Rudisha
102.19999999999999
2
Rotich
103.6
Task 5. Filter within array values
You happened to see that the fastest lap time recorded for the 800 M race was 23.2 seconds, but you did not see which runner ran that particular lap. Create a query that returns that result.
Complete the partially written query:
#standardSQL
SELECT
p.name,
split_time
FROM racing.race_results AS r
, r.participants AS p
, p.splits AS split_time
WHERE split_time = ;
Possible solution:
#standardSQL
SELECT
p.name,
split_time
FROM racing.race_results AS r
, UNNEST(r.participants) AS p
, UNNEST(p.splits) AS split_time
WHERE split_time = 23.2;
Row
name
split_time
1
Kipketer
23.2
Congratulations!
You've successfully ingested JSON datasets, created arrays and structs, and unnested semi-structured data for insights.
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Les ateliers créent un projet Google Cloud et des ressources pour une durée déterminée.
Les ateliers doivent être effectués dans le délai imparti et ne peuvent pas être mis en pause. Si vous quittez l'atelier, vous devrez le recommencer depuis le début.
En haut à gauche de l'écran, cliquez sur Démarrer l'atelier pour commencer.
Utilisez la navigation privée
Copiez le nom d'utilisateur et le mot de passe fournis pour l'atelier
Cliquez sur Ouvrir la console en navigation privée
Connectez-vous à la console
Connectez-vous à l'aide des identifiants qui vous ont été attribués pour l'atelier. L'utilisation d'autres identifiants peut entraîner des erreurs ou des frais.
Acceptez les conditions d'utilisation et ignorez la page concernant les ressources de récupération des données.
Ne cliquez pas sur Terminer l'atelier, à moins que vous n'ayez terminé l'atelier ou que vous ne vouliez le recommencer, car cela effacera votre travail et supprimera le projet.
Ce contenu n'est pas disponible pour le moment
Nous vous préviendrons par e-mail lorsqu'il sera disponible
Parfait !
Nous vous contacterons par e-mail s'il devient disponible
Un atelier à la fois
Confirmez pour mettre fin à tous les ateliers existants et démarrer celui-ci
Utilisez la navigation privée pour effectuer l'atelier
Ouvrez une fenêtre de navigateur en mode navigation privée pour effectuer cet atelier. Vous éviterez ainsi les conflits entre votre compte personnel et le compte temporaire de participant, qui pourraient entraîner des frais supplémentaires facturés sur votre compte personnel.
In this lab you will work with semi-structured data (ingesting JSON, Array data types) inside of BigQuery. You will practice loading, querying, troubleshooting, and unnesting various semi-structured datasets.
Durée :
0 min de configuration
·
Accessible pendant 75 min
·
Terminé après 60 min