
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Determine batch versus streaming sources
/ 100
Timely access to data is critical for organizations to respond quickly to market changes, customer needs, and operational issues.
Batch processing is a method of collecting large volumes of data over a period of time, then processing it all at once. It is best for processing large amounts of data and tasks that do not require near real-time processing.
Streaming processing is a method of processing data as it is received. It is best for processing data continuously in real time. It is important for data analysts and data scientists to understand the difference between the two since both processes have their own advantages and disadvantages.
As a data analyst, knowing when and how to apply batch processing or streaming processing helps you optimize the performance of data processing tasks, minimize time delay in data processing, and provide more accurate and timely data insights.
In this lab, you’ll process and collect data for a specific purpose, and observe the data loading in BigQuery tables using both computer-assisted batch processing and stream processing methods.
You’ve been asked to help Meredith, the lead merchandiser at TheLook eCommerce, monitor the results of their promotions and price changes.
Merchandisers set prices and make sure they sell the inventory they buy. For example, if t-shirts for a sports team are overstocked in a championship year, a retailer might decide to lower the price so the merchandise sells faster. That way, the store does not end up with unsold stock.
For this task, Meredith needs to view the number of items added to shopping carts two ways. Near real-time monitoring will allow Meredith to view the number of items added to shopping carts as they are available.
Minute by minute monitoring will allow Meredith to view the number of items added to each shopping cart in increments of one minute.
This data will help Meredith track the effectiveness of their promotions and price changes over time. They can then use this information to help improve the shopping experience for their customers and increase sales.
Artem, the data architect, points out that shopping cart activity is streamed into one of the tables in BigQuery. You need to help Meredith understand how to find the data she needs to monitor their merchandise.
Here’s how you'll do this task: First, you’ll search for your dataset and associated table. Next, you’ll run a query to display the time each product was added to the shopping cart. Then, you’ll rerun the query. Finally, you’ll examine the properties of the shopping_cart and order tables.
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This practical lab lets you do the activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. On the left is the Lab Details panel with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window) if you are running the Chrome browser. The Sign in page opens in a new browser tab.
Tip: You can arrange the tabs in separate, side-by-side windows to easily switch between them.
If necessary, copy the Google Cloud username below and paste it into the Sign in dialog. Click Next.
You can also find the Google Cloud username in the Lab Details panel.
You can also find the Google Cloud password in the Lab Details panel.
After a few moments, the Console opens in this tab.
In this task, you’ll help Meredith create a table that stores data about the number of orders and the time each order was added to a customer's cart.
If you get a pop up, click Ok.
Find a dataset named thelook_gcda, and click the drop-down arrow next to it.
Select the table named shopping_cart.
Click the Query button to open the Query Editor, and select In a new tab. A pre populated Untitled tab opens.
Replace the pre populated data by copying and pasting the following query in the Query Editor:
This query displays the first 10 rows of the shopping_cart
table.
The query results should display in table format below the Query Editor.
Data is continuously added to this table, so queries can be run on the latest data as soon as it is available. This changes the output of the query.
shopping_cart
table in the Explorer area, then click on the Details tab in the query area. Examine the shopping_cart
table properties in BigQuery. The section Streaming buffer statistics is an indication that data is being streamed into the table.orders
table. Click on the orders
table in the Explorer area, then click on the Details tab in the query area.This query displays the number of products from the 'Jeans' product category that were added to the shopping_cart
table each minute within the last hour.
This query is an example of a data source for a dashboard. Because the query is based on the shopping_cart
table, which is constantly updated through a streaming source, the dashboard will get fresh data every time the query runs.
Click Check my progress to verify that you have completed this task correctly.
Now, let’s learn more about this query and its main components.
The query joins the thelook_gcda.shopping_cart and thelook_gcda.products tables using the product_id column. This ensures that only rows that match the product id on both tables are included in the results.
An INNER JOIN
is used in this query. A shopping cart row with a value in the product_id that does not exist in the products table will not be included in the results, and therefore will not be counted.
Because Meredith asked for minute-by-minute information, the query formats the created_at
column as HH:MM, for example 10:15 for a quarter past ten in the morning.
SELECT
portion of the query:Formatting dates and times is a very common task for a cloud data analyst. You can find more information and examples on how to use the built-in FORMAT_TIMESTAMP
function in BigQuery’s Timestamp functions documentation.
WHERE
statement in this query.The WHERE
statement in this query filters shopping cart items by Product Category and by the time it was created:
Since Meredith is only interested in Jeans, the following part of the query removes products not in that category with the first part of the WHERE
clause:
The shopping cart table is very active and can be quite large, so Meredith also asked to see data only for the last hour. To accomplish this, the WHERE
clause filters rows using the created_at
column and uses two built in functions, one to get the date and time when the query runs (current_timestamp
) and one to calculate the time for one hour ago (timestamp_add
). The query will only return rows that were created before now minus one hour. Now is determined by the current_timestamp
function. Minus one hour is done by subtracting 1 hour from the current time.
For more information, refer to the Timestamp functions documentation.
The following line groups the results by minute and adds up the quantities added for each minute because we used the SUM()
function in our SELECT
statement. Since you need to list the Product Category in the results, you must also group by Product Category.
Notice that you are using the alias of the added_at_minute
column that was defined in the SELECT
portion of the query.
ORDER BY
clause in the query:The ORDER BY
clause populates the results with the most recent available data, by the minute, at the top.
Great work!
You were able to locate the information Meredith needed to stay on top of customer buying trends, which will help them make timely business decisions about pricing and stock.
You did this by first determining if the table was using batch or streaming processing, then exploring and running a query to populate a table with data about the number of jeans that were added to shopping carts in the last hour, by minute.
You also learned about the importance of timely data access and the difference between batch processing and streaming processing.
By observing data loading in BigQuery tables using both methods, you developed the skills to choose the appropriate approach for data processing.
Before you end the lab, make sure you’re satisfied that you’ve completed all the tasks. When you're ready, click End Lab and then click Submit.
Ending the lab will remove your access to the lab environment, and you won’t be able to access the work you've completed in it again.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one