
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Create and use an external table
/ 60
Import data via the console
/ 40
As a cloud data analyst, you'll likely encounter three main data architectures: data lakes, data warehouses, and lakehouses.
A data lake is a database system that stores large amounts of raw data in its original format until it’s needed. A guiding principle is to collect data first, and decide what is useful later.
A data warehouse is a database that consolidates data from multiple source systems for data consistency, accuracy, and efficient access. A guiding principle is to select and organize data that is known to be useful and can be trusted.
A lakehouse is a hybrid data architecture that combines the features of a data lake with those of a data warehouse.
Cloud Storage is a great tool for a data lake. BigQuery can be configured as a data warehouse, as a data lake, or as a lakehouse.
In this exercise, you'll combine data stored in a Cloud Storage data lake with data stored in a BigQuery data warehouse to learn more about how to integrate data from different sources and store it in a way that is optimized for analysis. Then, you'll explore this hybrid architecture to learn more about how lakehouses work.
You are a cloud data analyst at TheLook eCommerce, a global clothing company that specializes in innovative fashion and ethical and sustainable sourcing.
Meredith, the head merchandiser, has asked you to help them complete a series of tasks to understand which products are selling, which products are being returned, and ensure that the costs of each product is up to date.
To get the job done, you’ll need to combine data from a data lake and a data warehouse. By doing that, you’ll use BigQuery as a lakehouse. You can use this hybrid architecture to find the data to address Meredith’s business concerns.
After receiving your task assignment from Meredith, you ask Artem, the data architect, for some guidance.
Artem explains that while an external table only points to data stored elsewhere, like in a data lake, in this case, you would likely want the data to be stored in a standard BigQuery table because it will be easier to manage and update.
Meredith has also asked you to import data from a CSV file to update the product costs, so you ask Artem if you should use an external table for that task too.
Artem explains that while an external table only points to data that is stored someplace else, like a data lake, in this case you probably want the data stored in a standard BigQuery table, since it will be easier to manage and update.
You thank Artem and feel ready to take on Meredith’s tasks.
Here's how you'll do this task. First, you’ll create an external table that points to data stored in a data lake. Then, you’ll join the external table with a standard BigQuery table and explore the combined data. Finally, you’ll import data from a CSV file into a standard BigQuery table to better understand the difference between external and standard tables in BigQuery.
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. On the left is the Lab Details panel with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.
Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.
If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.
You can also find the Google Cloud username in the Lab Details panel.
You can also find the Google Cloud password in the Lab Details panel.
After a few moments, the Console opens in this tab.
After some preliminary data analysis, Meredith is concerned that there are too many pairs of jeans being returned. To help gain a better understanding of the situation and which distribution centers have the most returns, Meredith has asked you to create a report that details how many product returns have been made for jeans at each distribution center. The return information is stored as a data lake using Cloud Storage. The distribution information is stored as a data warehouse in a BigQuery standard table.
In this task, you'll create an external table to point to the data stored in Cloud Storage and then join that table with the BigQuery standard table. This will combine data from the data lake and data warehouse. Then, you’ll query the combined data to get Meredith the results they need.
In the following query, note the file type. Parquet is a popular format for string data in cloud data analytics and is commonly used to create external tables in BigQuery. Since Parquet files are compressed, they take up less storage space. They are also easy to manage since the data schema is stored within the file itself.
Copy and paste the following command into the Untitled query tab:
When you create an external table in BigQuery, you’re creating a table that points to data stored in an external location, such as Cloud Storage. The data types of the columns in the external table will be inferred from the data in the external location.
Now, check the table properties for the table you just created.
In the Explorer pane, expand the list of datasets by clicking the drop-down arrow next to the project ID.
Click the drop-down arrow next to the thelook_gcda dataset.
Select the product_returns table. Refresh your page if it’s not listed yet.
Click on the Details tab and study the details in the External Data Configuration section.
Check the Source URI(s) column and notice the pointer to gs://. In the BigQuery UI, the Source URI(s) column shows the location of the data source for a table. The gs:// prefix in the Source URI(s) column indicates that the data is stored in Cloud Storage. When you create an external table, the data is stored at the source location in Cloud Storage but can be queried just like a standard BigQuery table.
Copy the following query into the Query Editor:
This query returns a row count of 20,000.
Next, check the properties for the distribution_centers table.
In the Explorer pane, in the thelook_gcda dataset, select the distribution_centers table.
Click on the Details tab and study the details in the Table info and Storage info sections. This is not an external table as no external files are referenced.
Now, check the product_returns table data you created in the previous steps.
Copy the following query into the Query Editor:
This query returns the 10 most recent product returns based on the return status date.
Click Run.
Copy the following query into the Query Editor:
This query joins data from a cloud storage bucket (product_returns) with data stored in BigQuery (a standard table, distribution_centers), and returns the distribution center name (distribution_centers.name) along with all the columns from the returns data in the Parquet file.
Click Run.
Copy the following query into the Query Editor:
This query generates the information Meredith requested. The query displays the distribution center name (distribution_centers.name) and the number of products in the "Jeans" category (product.category) being returned.
Click Check my progress to verify that you have completed this task correctly.
A list of product names and updated product costs were sent from a key manufacturer. Meredith needs to have this data updated to reflect these new costs to make sure they have the most up-to-date information in their reports.
The manufacturer sent the list in CSV format and tech support has copied it to a Cloud Storage bucket. Tech support also sent you a screenshot of the file so you have an idea of what should be displayed:
The file is located in the Cloud Storage bucket “sureskills-lab-dev” in the price_update folder. In this task, you'll import the data into the thelook_gcda dataset.
Click Check my progress to verify that you have completed this task correctly.
Great work!
By helping Meredith learn more about product sales and returns, as well as updated product costs, you have given them the information they need to complete their reports and better understand TheLook eCommerce's product line.
You also gained practical experience in combining data stored in a Cloud Storage data lake with data stored in a BigQuery data warehouse. This helped you learn more about how to integrate data from different sources and store it in a way that is optimized for analysis. You also explored how this hybrid architecture works.
Finally, you imported data from a CSV file into a standard BigQuery table. This helped you learn about the different ways you can work and store data from multiple sources in BigQuery.
You’re well on your way to understanding how to use a data lakehouse and combine data on Cloud Storage with BigQuery.
Examples of differences between Standard Tables and External Tables:
Standard Table | External Table | |
---|---|---|
Where the data lives | “Inside” BigQuery. | “Outside” BigQuery, in Cloud Storage. |
CREATE TABLE (DDL) Statement - External Files" | No reference to external files. | Must have a reference to external files. |
CREATE TABLE Statement - File Format | Has no references to file formats like CSV or Parquet. | Must specify the type of external file referenced. |
Data Storage Format | An internal, BigQuery-specific format called Capacitor. | Several supported formats including CSV, Parquet, AVRO, and Iceberg. |
Performance | High | Can be high, depending on factors including data size, file format, and partitioning structure. |
CREATE TABLE Statement - Example | CREATE TABLE thelook_gcda.products (Product_id INT, Product_name STRING, Active BOOLEAN); | CREATE EXTERNAL TABLE thelook_gcda.product_returns OPTIONS (format="PARQUET", uris = ['gs://sureskills-lab-dev/DAC2M2L4/returns/returns_*.parquet']); |
There are additional technical differences, capabilities, and limitations between these two types of table that are outside the scope of this lab. For more information, Google "BigQuery Table Types" or check the Introduction to tables documentation.
Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.
Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one