
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
Joining data tables can provide meaningful insight into your dataset. However, when you join your data, there are common pitfalls that could corrupt your results. This lab focuses on avoiding those pitfalls. Types of joins:
For more information about joins, see Join Page.
The dataset you'll use is an ecommerce dataset that has millions of Google Analytics records for the Google Merchandise Store loaded into BigQuery. You have a copy of that dataset for this lab and will explore the available fields and row for insights.
For syntax information to help you follow and update the queries, see Standard SQL Query Syntax.
In this lab, you perform these tasks:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00
), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.
In your BigQuery project, create a new dataset titled ecommerce
.
ecommerce
. Leave the other options at their default values, and click Create dataset.In the left pane, you see an ecommerce
dataset listed under your project.
Scenario: Your team provides you with a new dataset on the inventory stock levels for each of your products for sale on your ecommerce website. You want to become familiar with the products on the website and the fields you could use to potentially join on to other datasets.
The project with the new dataset is data-to-insights.
BigQuery public datasets are not displayed by default in the BigQuery web UI. Since data-to-insights is a public dataset project, you have to pin it to your Resource Tree:
In the Explorer panel, then click on + Add data.
Select Star a project by name.
For Project name, enter data-to-insights
.
Click Star.
In the explorer pane, you will see the data-to-insights project starred.
You want to become familiar with the products on the website and the fields you could use to create queries to analyze the dataset.
In the left pane in the Explorer section, navigate to data-to-insights
> ecommerce
> all_sessions_raw
.
On the right, click the Schema tab to see the Fields and information about each field.
Examine the products and fields further. You want to become familiar with the products on the website and the fields you could use to potentially join on to other datasets.
In this section, you find how many product names and product SKUs are on your website and whether either one of those fields is unique.
Click RUN.
Look at the pagination results in the web UI for the total number of records returned, which in this case is 2,273 products and SKUs.
But...do the results mean that there are 2,273 unique product SKUs?
DISTINCT
:The number of distinct SKUs are 1,909.
Hmmm...you have 1,909 distinct SKUs which is less than the 2,273 number for total number of products on the website. The first results probably contain products with duplicate SKUs.
Take an even closer look at the records. Determine which products have more than one SKU and which SKUs have more than one product.
Copy and paste the below query to determine if some product names have more than one SKU. Notice ou use the STRING_AGG() function to aggregate all the product SKUs that are associated with one product name.
Results:
Do some product names have more than one SKU? Look at the query results to confirm.
Answer: Yes
It may also be true that one product name be associated with more than one SKU. This can happen due to variation. For example, one product name (e.g. T-Shirt) can have multiple product variants like color, size, etc. You would expect one product to have many SKUs.
Question: When you look at the query results, are there single SKU values with more than one product name associated? What do you notice about those product names?
Answer: Yes, it looks like there are quite a few SKUs that have more than one product name. Several of the product names appear to be closely related with a few misspellings (e.g. Waterproof Gear Bag vs Waterproof Gear Bag).
You see why this could be an issue in the next section.
A SKU is designed to uniquely identify one product and will be the basis of your join condition when you join against other tables. Having a non-unique key can cause serious data issues.
'GGOEGPJC019099'
.Possible solution:
v2ProductName | productSKU |
---|---|
7" Dog Frisbee | GGOEGPJC019099 |
7" Dog Frisbee | GGOEGPJC019099 |
Google 7-inch Dog Flying Disc Blue | GGOEGPJC019099 |
From the query results, it looks like there are three different names for the same product. In this example, there is a special character in one name and a slightly different name for another:
See the impact of joining on a dataset with multiple products for a single SKU. First, explore the product inventory dataset (the products
table) to see if this SKU is unique there.
Next, join the inventory dataset against your website product names and SKUs so you can have the inventory stock level associated with each product for sale on the website.
What happens when you join the website table and the product inventory table on SKU? Do you now have inventory stock levels for the product?
Answer: Yes but the stockLevel is showing three times (one for each record)!
Next, run a query that shows the total stock level for each item in inventory.
What are the options to solve your triple counting dilemma? First, you need to only select distinct SKUs from the website before joining on other datasets.
data-to-insights.ecommerce.all_sessions_raw
.Possible solution:
Answer: 1,909 distinct SKUs from the website dataset
Now you're ready to join against your product inventory dataset again.
How many records were returned? All 1,909 distinct SKUs?
Answer: No, just 1,090 records
You lost 819 SKUs after joining the datasets, investigate by adding more specificity in your fields.
It appears the SKUs are present in both of those datasets after the join.
The default JOIN type is an INNER JOIN which returns records only if there is a match on both the left and the right tables that are joined.
Possible solution:
You have successfully used a LEFT JOIN to return all of the original 1,909 website SKUs in your results.
How many SKUs are missing from your product inventory set?
Possible solution:
Why might the product inventory dataset be missing SKUs?
Answer: Unfortunately, there is no easy answer. It is most likely a business-related question:
Are there any products are in the product inventory dataset but missing from the website?
Possible solution:
Answer: Yes. There are two product SKUs missing from the website dataset
Next, add more fields from the product inventory dataset for more details.
Why would the below products be missing from the ecommerce website dataset?
Possible answers:
Why would the new product not show up on your website dataset?
What if you wanted one query that listed all products missing from either the website or inventory?
Possible solution:
You have your 819 + 2 = 821 product SKUs
LEFT JOIN + RIGHT JOIN = FULL JOIN which returns all records from both tables regardless of matching join keys. You then filter out where you have mismatches on either side
Not knowing the relationship between data table keys (1:1, 1:N, N:N) can return unexpected results and also significantly reduce query performance.
The last join type is the CROSS JOIN.
Create a new table with a site-wide discount percent that you want to apply across products in the Clearance category.
Replacing the table named qwiklabs-***.
In the left pane, site_wide_promotion is now listed in the Resource section under qwiklabs-gcp-xxx
> ecommerce
.
Let's see the impact of unintentionally adding more than one record in the discount table.
Next let's view the data values in the promotion table.
What happens when you apply the discount again across all 82 clearance products?
How many products are returned?
Answer: Instead of 82, you now have 246 returned which is more records than your original table started with.
Let's investigate the underlying cause by examining one product SKU.
What was the impact of the CROSS JOIN?
Answer:
Since there are 3 discount codes to cross join on, you are multiplying the original dataset by 3.
The solution is to know your data relationships before you join and don't assume keys are unique.
At the start of the lab you wrote a query that showed multiple product names for a single SKU. Deduplicating records like this is a common skill for data analysts. Examine one way you can select only one product per SKU.
First, start with the query to show all product names per SKU.
Since most of the product names are extremely similar (and you want to map a single SKU to a single product), write a query to only choose one of the product_names. You will be using this StackOverflow post by Felipe Hoffa as inspiration.
You have successfully deduplicated the product names for each SKU. Experiment with the above query and your own datasets to deduplicate your fields before joining against other datasets.
You've concluded this lab and worked through some serious SQL join pitfalls by identifying duplicate records and knowing when to use each type of JOIN. Nice work!
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one