arrow_back

Interactive Data Exploration with Vertex AI Workbench

Join Sign in

Interactive Data Exploration with Vertex AI Workbench

1 hour 30 minutes 5 Credits

GSP203

Google Cloud self-paced labs logo

Overview

In this lab you will learn the process of analyzing a dataset stored in BigQuery using Vertex AI Workbench to perform queries and present the data using various statistical plotting techniques. The analysis will help you discover patterns in the data that will allow you to predict probable arrival time delays given the initial flight details and actual departure time.

Vertex AI Workbench is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud. It runs on Compute Engine and connects easily to multiple cloud services so you can focus on your data science tasks.

BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage.

The dataset that is used provides historic information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This dataset can be used to demonstrate a wide range of data science concepts and techniques.

Objectives

  • Create a Notebook instance

  • Use Pandas and Seaborn to visualize Bigquery data

  • Iterate through the analysis process to refine the relationship between departure delays and arrival delays

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.

  4. Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  5. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Navigation menu icon

Task 1. Exploratory data analysis in Vertex AI Workbench

Launch Vertex AI Workbench Notebook

To create and launch a Vertex AI Workbench notebook:

  1. In the Navigation Menu Navigation menu icon, click Vertex AI > Workbench.

  2. On the Workbench page, click New Notebook.

  3. In the Customize instance menu, select TensorFlow Enterprise and choose the latest version of TensorFlow Enterprise 2.x (with LTS) > Without GPUs.

  4. Name the notebook.

  5. Set Region to us-east1 and Zone to any zone within the designated region.

  6. In the Notebook properties, click the pencil icon pencil icon to edit the instance properties.

  7. Scroll down to Machine configuration and select e2-standard-2 for Machine type.

  8. Leave the remaining fields at their default and click Create.

After a few minutes, the Workbench page lists your instance, followed by Open JupyterLab.

  1. Click Open JupyterLab to open JupyterLab in a new tab.

Click Check my progress below to check your lab progress. Create the AI Platform notebook instance

  1. In JupyterLab, start a new notebook by clicking on Notebook > Python 3.

  2. Insert the following code as cells to import the necessary modules and initialize a BigQuery client. The BigQuery client will be used to send and receive messages from the BigQuery API:

import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np from google.cloud import bigquery bq = bigquery.Client()
  1. Press Shift + Enter to run the cell.

Jupyter magic

Jupyter magics provide a mechanism to run a wide variety of languages, and ways to add some more. The BigQuery Python package has added a few magics to make the interaction with Google Cloud Platform convenient.

For example, you can run a query on your BigQuery table using the %%bigquery magic environment that comes with Vertex AI Workbench.

  1. Enter the following into the new cell of the notebook:

%%bigquery SELECT COUNTIF(arr_delay >= 15)/COUNT(arr_delay) AS frac_delayed FROM dsongcp.flights_tzcorr
  1. Press Shift + Enter to run the cell.

Calls to BigQuery

You can also query BigQuery directly with the Python library.

  1. Enter the following code into the next empty cell:

sql = """ SELECT COUNTIF(arr_delay >= 15)/COUNT(arr_delay) AS frac_delayed FROM dsongcp.flights_tzcorr """ bq.query(sql).to_dataframe()
  1. Press Shift + Enter to run the cell.

Basic statistics

  1. For a high-level view, draw a Probability Distribution Function (PDF) of different arrival delays. In a Notebook, assign the output of a cell magic query to a variable, in this case df:

%%bigquery df SELECT ARR_DELAY, DEP_DELAY FROM dsongcp.flights_tzcorr WHERE DEP_DELAY >= 10
  1. Press Shift + Enter to run the cell:

type(df) df.describe()
  1. Press Shift + Enter to run the cell.

Results are similar to the following.

The output for the DataFrame describe function, displaying the ARR_DELAY and DEP_DELAY information

The DataFrame.describe() function reports the core statistics for the DataFrame returned by the query.

Plotting distributions

Beyond just the statistical capabilities of Pandas, you can also pass Pandas dataframes and underlying numpy arrays to plotting libraries like seaborn.

  1. Click the + Insert cell icon to create a new code cell.

  2. Enter the following Python code in the new cell to display a Seaborn violin chart. This shows the distribution dataframe that contains the sample of the arrival delay for flights that depart more than 10 minutes late:

sns.set_style("whitegrid") ax = sns.violinplot(data=df, x='ARR_DELAY', inner='box', orient='h') ax.axes.set_xlim(-50, 300);
  1. Click Run or Shift+Enter to run the cell.

The Seaborn violin plot displaying the arrival delay for flights that depart more than 10 minutes late

A violin plot is a kernel density plot, which is an estimate of the PDF. Notice that even though the distribution peaks around 10 minutes, deviations around this peak are skewed toward larger delays, rather than the smaller ones. Also, there is only one peak—the distribution.

Run a query comparing the violin plot for flights that depart more than 10 minutes late with the violin plot for flights that depart less than 10 minutes late. Zoom in on the x-axis close to your 15-minute threshold.

  1. Enter the following Python code in the new cell:

%%bigquery df SELECT ARR_DELAY, DEP_DELAY FROM dsongcp.flights_tzcorr
  1. Create a new column in the Pandas dataframe that is either True or False, depending on whether the flight departed less than 10 minutes late:

df['ontime'] = df['DEP_DELAY'] < 10 df[df['ARR_DELAY'] > 0].head()
  1. Click Run or Shift+Enter to run the cell.

  2. Use a seaborn to graph this new Pandas dataframe:

sns.set_style("whitegrid") ax = sns.violinplot(data=df, x='ARR_DELAY', y='ontime', inner='box', orient='h') ax.set_xlim(-50, 200);
  1. Click Run or Shift+Enter to run the cell.

This violin plot includes the ontime column. This illustrates how flights that depart 10 minutes late differ from flights that depart early.

The violin plot with True (bottom) and False (top) values for 'ontime' on the Y-axis, and arrival delay plotted on the x-axis

The angular peak of the top violin plot indicates that the seaborn default smoothing was too coarse. Fix this by passing in a gridsize parameter.

  1. Enter the following Python code into the next empty cell to pass in a gridsize parameter:

ax = sns.violinplot(data=df, x='ARR_DELAY', y='ontime', inner='box', orient='h', gridsize=1000) ax.set_xlim(-50, 200)
  1. Click Run or Shift+Enter to run the cell.

Notice the computation takes longer.

The violin plot with True (bottom) and False (top) values for 'ontime' on the Y-axis, and arrival delay plotted on the x-axis

This violin plot shows that the 10 minute departure delay threshold divides the dataset into two datasets, shown by the box plots in the center of these two distributions. Flights that depart 10 or more minutes late are asymmetrically distributed towards longer delay times. Flights that depart less than 10 minutes late are centered around the peak arrival time.

The long tail is a red flag. Further analyze the data to better understand what is really happening. In the next task, you use BigQuery to take a closer look at the range of delay times.

Click Check my progress below to check your lab progress. Run Queries in Jupyter notebook

Task 2. Quality control

Although you could continue writing queries in the notebook, using BigQuery to query provides immediate feedback on syntax and logic errors.

  1. Switch back to the Cloud Console browser tab.

  2. In the Navigation menu (Navigation menu icon), click BigQuery.

  3. Paste the following query into the Query Editor box, and then click RUN. This query provides the average arrival delay for every unique departure delay and returns more than 1000 different values. Departure delays are recorded as integers in minutes so this seems like a very large number:

SELECT AVG(ARR_DELAY) AS arrival_delay FROM `dsongcp.flights_tzcorr` GROUP BY DEP_DELAY ORDER BY DEP_DELAY

Are the delays that span up to 20 hours significant?

  1. To refine the query and add some more data, paste the following query into the Query Editor box and click RUN:

#standardsql SELECT DEP_DELAY, AVG(ARR_DELAY) AS arrival_delay, COUNT(ARR_DELAY) AS numflights FROM `dsongcp.flights_tzcorr` GROUP BY DEP_DELAY ORDER BY DEP_DELAY

The resulting table explains what’s going on. The first few rows have only a few flights each:

Table showing departure delay, arrival delay and numflights

  1. If you page ahead in the data, you see departure delay values of a few minutes have hundreds of thousands of flights:

Table showing departure delay, arrival delay and numflights

However, a delay of a few minutes is so small and can be considered insignificant. There is no reason to complicate our statistical modeling with insignificant values.

To eliminate these values from the dataset, filter the dataset so that you view only delay values within a particular range. For a dataset of a few hundred thousand examples, use a standard deviation range to focus on the you're interested in.

  1. To apply a filter based on values within three times the standard deviation of the average value for departure delays, paste the following query into the Query Editor box and click RUN:

SELECT AVG(DEP_DELAY) - 3*STDDEV(DEP_DELAY) AS filtermin, AVG(DEP_DELAY) + 3*STDDEV(DEP_DELAY) AS filtermax FROM `dsongcp.flights_tzcorr`

This yields a range of -102 to +121. The query is still making some assumptions about the data values themselves and extreme outlier values can distort these calculations, even with hundreds of thousands of examples.

  1. To avoid such an assumption, query using percentiles, omitting the top and bottom 5% of values, paste the following query into the Query Editor box and click RUN:

SELECT APPROX_QUANTILES(DEP_DELAY, 20) FROM `dsongcp.flights_tzcorr`

This leads you to retain values in the range [−9, 66]. Regardless of how we find the range, though, the range is based on an assumption that unusually high and low values are outliers.

On datasets that number in the hundreds of thousands to millions of examples, thresholding your input data based on value is dangerous because you may be throwing out valuable nuance if there are sufficient examples of a delay of 150 minutes. It is worth modeling such a value regardless of how far off the mean it is.

To address this, use a filter that ensures that if there is a statistically significant number of flights with a specific delay, then it's' included, regardless of the actual delay value. For datasets of this size, you can use a filter that selects only values that have 370 or more examples. 370 examples is the sample size required to cover three standard deviations in a dataset with a normal distribution. A filter that selects only data points where you have at least 370 examples ensures that you have enough samples at each datapoint to satisfy the three sigma rule. This gives you enough data to make reasonably robust predictions about each individual delay value.

  1. Paste the following query into the Query Editor box and click RUN:

#standardsql SELECT DEP_DELAY, AVG(ARR_DELAY) AS arrival_delay, STDDEV(ARR_DELAY) AS stddev_arrival_delay, COUNT(ARR_DELAY) AS numflights FROM `dsongcp.flights_tzcorr` GROUP BY DEP_DELAY HAVING numflights > 370 ORDER BY DEP_DELAY

Click Check my progress below to check your lab progress. Run Queries in Bigquery

Filtering data on occurrence frequency

How different would the results be if you choose a different threshold? Look at the number of flights that are removed by different quality-control thresholds by looking at the slope of a linear model between arrival delay and departure delay.

Return to Vertex AI Workbench to visualize this thresholded data.

  1. Return the untitled Notebook Vertex AI Workbench browser tab.

  2. Click the + Insert cell icon to create a new code cell:

%%bigquery df DECLARE total_flights INT64; SET total_flights = ( SELECT COUNT(*) FROM dsongcp.flights_tzcorr ); CREATE TEMPORARY FUNCTION linear_fit(NUM_TOTAL INT64, THRESH INT64) RETURNS STRUCT<thresh INT64, num_removed INT64, lm FLOAT64> AS (( SELECT AS STRUCT THRESH, (NUM_TOTAL - SUM(numflights)) AS num_removed, ROUND(AVG(arrival_delay * numflights) / AVG(dep_delay * numflights), 2) AS lm FROM ( SELECT DEP_DELAY, AVG(ARR_DELAY) AS arrival_delay, STDDEV(ARR_DELAY) AS stddev_arrival_delay, COUNT(ARR_DELAY) AS numflights FROM dsongcp.flights_tzcorr GROUP BY DEP_DELAY ) WHERE numflights > THRESH )) ; SELECT linear_fit(total_flights, 1000) stats UNION ALL SELECT linear_fit(total_flights, 500) UNION ALL SELECT linear_fit(total_flights, 370) UNION ALL SELECT linear_fit(total_flights, 300) UNION ALL SELECT linear_fit(total_flights, 200) UNION ALL SELECT linear_fit(total_flights, 100) UNION ALL SELECT linear_fit(total_flights, 22) UNION ALL SELECT linear_fit(total_flights, 10) UNION ALL SELECT linear_fit(total_flights, 5) ORDER BY stats.thresh DESC
  1. Enter the following into the new cell of the notebook:

df['stats'].map(lambda x: (x['thresh'], x['num_removed'], x['lm']))
  1. Click Run or Shift+Enter to run the cell.

As you remove fewer and fewer flights by decreasing the threshold. As you remove fewer and fewer flights by decreasing the threshold, the slope slowly changes. Thus, the differences in the model created for thresholds of 300, 370, or 500 are quite minor. However, that model is quite different from that created if the threshold were 5 or 10. The order of magnitude of the threshold matters, but perhaps not the exact value.

Task 3. Distribution of arrival delays

  1. Enter the following Python code into the new notebook cell to create a variable with the BigQuery query based on the threshold level of 370 flights:

%%bigquery depdelay SELECT DEP_DELAY, AVG(ARR_DELAY) AS arrival_delay, STDDEV(ARR_DELAY) AS stddev_arrival_delay, COUNT(ARR_DELAY) AS numflights FROM dsongcp.flights_tzcorr GROUP BY DEP_DELAY HAVING numflights > 370 ORDER BY DEP_DELAY
  1. Click Run or Shift+Enter to run the cell.

  2. Enter the following into the new notebook cell:

depdelay[:5]

This provides the first five rows of the Pandas dataframe.

Table with first five rows of the Pandas dataframe below the column headings: DEP_DELAY, arrival_delay, stddev_arrival_delay, and numflights

Click Check my progress below to check your lab progress. Execute query in Jupyter notebook to visualize thresholded data

Plot this data to see what insight you can gain. Even though you've been using seaborn, Pandas has built-in plotting functions.

  1. Enter the following Python code to use Pandas itself to plot average arrival delay and the standard deviation for each delay:

ax = depdelay.plot(kind='line', x='DEP_DELAY', y='arrival_delay', yerr='stddev_arrival_delay');
  1. Click Run or Shift+Enter to run the cell.

A linear line graph showing the relationship between departure delay and arrival delay

It looks like the relationship between departure delay and arrival delay is linear. The width of the standard deviation of the arrival delay is also constant, on the order of 10 minutes.

Task 4. Applying a probabilistic decision threshold

The purpose of all of this analysis has been to develop a model that allows you to predict whether a flight that has departed will arrive late. Specifically, a model that predicts whether a given flight has more than a 30% probability of arriving 15 or more minutes late.

Assume, in this case, the arrival delay distributions are normal. Based on that assumption you can use a complementary cumulative distribution table to establish the Z value at which the probability is 30%. This turns out to be 0.52.

The complementary cumulative distribution table

  1. Enter the following Python code to use Pandas to plot the intersection of the 15 minute delay line with the 30% arrival delay probability line:

Z_30 = 0.52 depdelay['arr_delay_30'] = (Z_30 * depdelay['stddev_arrival_delay']) \ + depdelay['arrival_delay'] ax = plt.axes() depdelay.plot(kind='line', x='DEP_DELAY', y='arr_delay_30', ax=ax, ylim=(0,30), xlim=(0,30), legend=False) ax.set_xlabel('Departure Delay (minutes)') ax.set_ylabel('> 30% prob of this Arrival Delay (minutes)'); x = np.arange(0, 30) y = np.ones_like(x) * 15 ax.plot(x, y, color='r'); y = np.arange(0, 30) x = np.ones_like(y) * 13 ax.plot(x, y, 'g.');
  1. Click Run or Shift+Enter to run the cell.
Note: For this lab, ignore the deprecation warnings.

A line graph showing the departure delay threshold with Departure delay in minutes plotted on the x-axis and less that 30 percent probability of this arrival delay in minutes plotted on the y-axis

Your decision criterion translates to a departure delay of 13 minutes. If the departure delay is 13 minutes or more, the aircraft is more than 30% likely to be delayed by 15 minutes or more.

Task 5. Deriving the probability distribution function from the data itself

The analysis in the previous section used the number 0.52, which assumes that the distribution of flights at each departure delay is normal. What if you drop that assumption? You then must empirically determine the 30% likelihood at each departure delay.

Happily, we do have at least 370 flights at each departure delay (the joys of working with large datasets!), so we can compute the 30th percentile for each departure delay.

  1. Copy and paste the following into a cell:

%%bigquery depdelay SELECT DEP_DELAY, APPROX_QUANTILES(ARR_DELAY, 101)[OFFSET(70)] AS arrival_delay, COUNT(ARR_DELAY) AS numflights FROM dsongcp.flights_tzcorr GROUP BY DEP_DELAY HAVING numflights > 370 ORDER BY DEP_DELAY
  1. Click Run or Shift+Enter to run the cell.

  2. Copy and paste the following into a new cell:

ax = plt.axes() depdelay.plot(kind='line', x='DEP_DELAY', y='arrival_delay', ax=ax, ylim=(0,30), xlim=(0,30), legend=False) ax.set_xlabel('Departure Delay (minutes)') ax.set_ylabel('> 30% prob of this Arrival Delay (minutes)'); x = np.arange(0, 30) y = np.ones_like(x) * 15 ax.plot(x, y, color='r'); y = np.arange(0, 30) x = np.ones_like(y) * 16 ax.plot(x, y, 'g.');

A line graph showing the decision threshold with Departure delay in minutes plotted on the x-axis and less that 30 percent probability of this arrival delay in minutes plotted on the y-axis

From the chart, your decision threshold, without the assumption of normal distribution, is 16 minutes. If a flight is delayed by more than 16 minutes, there is a greater than 30% likelihood that the flight will arrive more than 15 minutes late.

Click Check my progress below to check your lab progress. Execute query in Jupyter notebook to derive the probability distribution function from the data itself

Test your knowledge

Test your knowledge about Google cloud Platform by taking our quiz.

Congratulations!

Now you know how to use Vertex AI Workbench to query and visualize data from BigQuery.

Finish your quest

This self-paced lab is part of the Data Science on Google Cloud quest. A quest is a series of related labs that form a learning path. Completing this quest earns you a badge to recognize your achievement. You can make your badge or badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. Refer to the Google Cloud Skills Boost catalog for all available quests.

Take your next lab

Finish your Quest with Evaluating a Data Model, or if you haven't taken the labs in order, check out other labs in this Quest:

Next steps / learn more

Here are some follow-up steps:

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated September 21, 2022

Lab Last Tested March 15, 2022

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.