arrow_back

Using BigQuery to do Analysis

Sign in Join
Get access to 700+ labs and courses

Using BigQuery to do Analysis

Lab 45 minutes universal_currency_alt 5 Credits show_chart Introductory
info This lab may incorporate AI tools to support your learning.
Get access to 700+ labs and courses

Overview

In this lab you analyze 2 different public datasets, run queries on them, separately and then combined, to derive interesting insights.

What you'll learn

In this lab, you will:

  • Carry out interactive queries on the BigQuery console.
  • Combine and run analytics on multiple datasets.

Prerequisites

This is a fundamental level lab and assumes some experience with BigQuery and SQL.

Introduction

This lab uses two public datasets in BigQuery: weather data from the US National Oceanic and Atmospheric Administration (NOAA), and bicycle rental data from New York City.

You will encounter, for the first time, several aspects of Google Cloud Platform that are of great benefit to scientists:

  1. Serverless. No need to download data to your machine in order to work with it - the dataset will remain on the cloud.
  2. Ease of use. Run ad-hoc SQL queries on your dataset without having to prepare the data, like indexes, beforehand. This is invaluable for data exploration.
  3. Scale. Carry out data exploration on extremely large datasets interactively. You don't need to sample the data in order to work with it in a timely manner.
  4. Shareability. You will be able to run queries on data from different datasets without any issues. BigQuery is a convenient way to share datasets. Of course, you can also keep your data private, or share them only with specific persons -- not all data need to be public.

The end-result is that you will find if there are lesser bike rentals on rainy days.

Setup and requirements

Lab setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Cloud console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).

    The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username below and paste it into the Sign in dialog.

    {{{user_0.username | "Username"}}}

    You can also find the Username in the Lab Details panel.

  4. Click Next.

  5. Copy the Password below and paste it into the Welcome dialog.

    {{{user_0.password | "Password"}}}

    You can also find the Password in the Lab Details panel.

  6. Click Next.

    Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  7. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Google Cloud console opens in this tab.

Note: To view a menu with a list of Google Cloud products and services, click the Navigation menu at the top-left, or type the service or product name in the Search field.

Task 1. Explore bicycle rental data

In this task, you explore the public bigquery-public-data dataset, specifically the new_york_citibike dataset, by examining the schema of the citibike_trips table and running queries to analyze trip duration and calculate the total distance traveled by each bicycle.

  1. In the Google Cloud console, in the Navigation menu () click BigQuery.

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and lists UI updates.

  1. Click Done.

  2. In the left pane, click + Add , then click Star a project by name, next in the pop-up window type bigquery-public-data, finally click Star.

  1. In the BigQuery console, you see two projects in the left pane, one named your Qwiklabs project ID, and one named bigquery-public-data.

  2. In the left pane of the BigQuery console, select bigquery-public-data > new_york_citibike > citibike_trips table.

  3. In the Table (citibike_trips) window, click the Schema tab.

  4. Examine the column names and the datatypes.

  5. Click the Blue + button to compose a new query.

Enter the following query:

SELECT MIN(start_station_name) AS start_station_name, MIN(end_station_name) AS end_station_name, APPROX_QUANTILES(tripduration, 10)[OFFSET (5)] AS typical_duration, COUNT(tripduration) AS num_trips FROM `bigquery-public-data.new_york_citibike.citibike_trips` WHERE start_station_id != end_station_id GROUP BY start_station_id, end_station_id ORDER BY num_trips DESC LIMIT 10
  1. Click Run. Look at the result and try to determine what this query does ?

    Hint: Typical duration for the 10 most common one-way rentals
  2. Next, run the query below to find another interesting fact: total distance traveled by each bicycle in the dataset. Note that the query limits the results to only top 5.

WITH trip_distance AS ( SELECT bikeid, ST_Distance(ST_GeogPoint(s.longitude, s.latitude), ST_GeogPoint(e.longitude, e.latitude)) AS distance FROM `bigquery-public-data.new_york_citibike.citibike_trips`, `bigquery-public-data.new_york_citibike.citibike_stations` as s, `bigquery-public-data.new_york_citibike.citibike_stations` as e WHERE start_station_name = s.name AND end_station_name = e.name) SELECT bikeid, SUM(distance)/1000 AS total_distance FROM trip_distance GROUP BY bikeid ORDER BY total_distance DESC LIMIT 5 Note: For this query, we also used the other table in the dataset called citibike_stations to get bicycle station information.

Task 2. Explore the weather dataset

In this task, you explore the ghcn_d dataset, by previewing the ghcnd_2015 table and running a query to retrieve rainfall data from a New York weather station.

  1. In the left pane of the BigQuery Console, select the newly added bigquery-public-data project and select ghcn_d > ghcnd_2015.

  2. Then click on the Preview tab. Your console should resemble the following:

Examine the columns and some of the data values.

  1. Click the Blue + button to compose a new query and enter the following:
SELECT wx.date, wx.value/10.0 AS prcp FROM `bigquery-public-data.ghcn_d.ghcnd_2015` AS wx WHERE id = 'USW00094728' AND qflag IS NULL AND element = 'PRCP' ORDER BY wx.date
  1. Click Run.

This query will return rainfall (in mm) for all days in 2015 from a weather station in New York whose id is provided in the query (the station corresponds to NEW YORK CNTRL PK TWR ).

Task 3. Find correlation between rain and bicycle rentals

In this task, you analyze the correlation between rain and bicycle rentals by joining the new_york_citibike.citibike_trips dataset with the ghcn_d.ghcnd_2015 weather dataset, determining the average number of bicycle trips on rainy versus non-rainy days.

How about joining the bicycle rentals data against weather data to learn whether there are fewer bicycle rentals on rainy days?

  1. Click the Blue + button to compose a new query and enter the following:
WITH bicycle_rentals AS ( SELECT COUNT(starttime) as num_trips, EXTRACT(DATE from starttime) as trip_date FROM `bigquery-public-data.new_york_citibike.citibike_trips` GROUP BY trip_date ), rainy_days AS ( SELECT date, (MAX(prcp) > 5) AS rainy FROM ( SELECT wx.date AS date, IF (wx.element = 'PRCP', wx.value/10, NULL) AS prcp FROM `bigquery-public-data.ghcn_d.ghcnd_2015` AS wx WHERE wx.id = 'USW00094728' ) GROUP BY date ) SELECT ROUND(AVG(bk.num_trips)) AS num_trips, wx.rainy FROM bicycle_rentals AS bk JOIN rainy_days AS wx ON wx.date = bk.trip_date GROUP BY wx.rainy
  1. Click Run.

Now you can see the results of joining the bicycle rental dataset with a weather dataset that comes from a completely different source:

Running the query yields that, yes, New Yorkers ride the bicycle 47% fewer times when it rains.

Summary

In this lab you did ad-hoc queries on two datasets. You were able to query the data without setting up any clusters, creating any indexes, etc. You were also able to mash up the two datasets and get some interesting insights. All without ever leaving your browser!

Congratulations!

You learned how to run some very interesting queries on BigQuery!

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

Before you begin

  1. Labs create a Google Cloud project and resources for a fixed time
  2. Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
  3. On the top left of your screen, click Start lab to begin

This content is not currently available

We will notify you via email when it becomes available

Great!

We will contact you via email if it becomes available

One lab at a time

Confirm to end all existing labs and start this one

Use private browsing to run the lab

Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.