arrow_back

Machine Learning with BigQuery ML

Join Sign in

Machine Learning with BigQuery ML

1 hour 30 minutes 9 Credits

GSP202

Google Cloud self-paced labs logo

Overview

BigQuery is a serverless, highly scalable data warehouse. It is also an excellent machine learning platform. This combination is significant for organizations that handle privacy-sensitive or confidential data as you can do machine learning without having to extract data out of the data warehouse. Not having extracts of data floating around in people’s projects is important for security. Also, the auditability that BigQuery provides out of the box means that you know exactly who created the model, and which data was used in which model.

This lab uses a set of code samples and scripts developed for Data Science on the Google Cloud Platform, 2nd Edition from O'Reilly Media, Inc.

Objectives

  • Deploy a Vertex AI Workbench instance

  • Build logistic regression model with BigQuery ML

  • Evaluate the model

  • Make prediction from the model

  • Create an AI model and then use the model to make predictions

  • Add additional information to evaluate the model

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.

  4. Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  5. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Navigation menu icon

Task 1. Deploy a Vertex AI Workbench instance

  1. On the Navigation Menu, click More products, and then Vertex AI > Workbench.

  2. On the Notebook instances page, click New Notebook > TensorFlow Enterprise > TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.

  3. In the New notebook dialog, name the Notebook bqml-notebook.

  4. For Region, select us-west1.

  5. For Zone, select a zone within the selected region.

  6. Leave all other fields with their default options, and click Create.

As the Notebook spins up, the bqml-notebook notebook is listed in the Workbench Notebook list. When it completes, Open Jupyterlab appears inline with the Notebook name.

  1. Click Open JupyterLab.

When the Vertex AI Notebook instance starts you will see a suggestion to start a build to add tensorflow_model_analysis and xai_tabular_widget.

  1. Click Build.

Your notebook is now set up.

  1. In the Notebook launcher section click Python 3 to open a new notebook.

The Launcher page with the Python 3 tile highlighted in the Notebook section

To use a Notebook you enter commands into a cell. Be sure you run the commands in the cell by either pressing Shift + Enter, or clicking the triangle on the Notebook top menu to Run selected cells and advance.

Note: After pasting commands into the Jupyter notebook cell, always run the cell to execute the command and advance to the next cell. Deploy a Vertex AI Workbench instance

Task 2. Build logistic regression model

Create the training dataset

The first step in BigQuery ML is to create the training dataset. We want the three features (departure delay, taxi out time, and distance.) and the label (ontime), so let’s use SQL to craft the dataset just the way we want it.

  • Enter following code into new cell and then run the cell:
%%bigquery SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance, IF(is_train_day = 'True', False, True) AS is_eval_day FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False LIMIT 5

Create the model

In this section, you create a logistic regression model where the label column is called ontime, and the remaining columns are used as input features to the model. Here, you do not want to randomly split the flight data – you need to avoid having correlated flights on the same day split between training and test datasets.

That’s why when you created the training data in the previous section, you pre-split the data and created a table that specifies which days should be used for training and which days for evaluation. The resulting model parameters are stored in a BigQuery model object called arr_delay_lm in the dataset dsongcp.

You can explicitly tell BigQuery ML to use a column in the training dataset to split the data. Add that column to your SELECT statement as a Boolean value by joining the flight data against the table of prespecified training days.

  1. Create a logistic regression model in BigQuery ML by running the following query in new cell:
%%bigquery CREATE OR REPLACE MODEL dsongcp.arr_delay_lm OPTIONS(input_label_cols=['ontime'], model_type='logistic_reg', data_split_method='custom', data_split_col='is_eval_day') AS SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance, IF(is_train_day = 'True', False, True) AS is_eval_day FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False Note: It will take about 10 minutes to complete.

In the above query, BigQuery trains a logistic regression model and puts the weights into the model arr_delay_lm. Obtain the training error (called the loss) by querying the result of the special function ML.TRAINING_INFO:

  1. Enter following into new cell and run to obtain training error:
%%bigquery SELECT * FROM ML.TRAINING_INFO(MODEL dsongcp.arr_delay_lm) Create logistic regression model

Task 3. Evaluate the model

Evaluate the model by calling ML.EVALUATE in SQL. BigQuery then evaluates the model on the withheld data (where is_train_day is False), but use a threshold of 0.7.

  • Run the query below in the new cell to evaluate the model:
%%bigquery SELECT * FROM ML.EVALUATE(MODEL dsongcp.arr_delay_lm, ( SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False AND is_train_day = 'False' ), STRUCT(0.7 AS threshold))

The resulting evaluation statistics are:

precision recall accuracy f1_score log_loss roc_auc
0 0.964337 0.956535 0.935174 0.96042 0.167233 0.956276

The precision is how often the model is right when it reports a flight as being on time. The recall is the fraction of on time flights correctly classified. The Receiver Operating Characteristic (ROC) is a threshold-independent measure of classifier performance.

Task 4. Make prediction from the model

The ML.WEIGHTS function allows you to see the underlying weights used by a model during prediction. This function applies to linear & logistic regression models.

  1. Run the following query in new cell to obtain the weights by calling ML.WEIGHTS:
%%bigquery SELECT * FROM ML.WEIGHTS(MODEL dsongcp.arr_delay_lm)

The output is similar to the following:

A table displaying data for the processed_input, weight, and category_weights columns

However, there is usually no point to getting just the weights. Instead, what you want is the predicted value for some set of inputs.

  1. Run the query below in a new cell to carry out a prediction:
%%bigquery SELECT * FROM ML.PREDICT(MODEL dsongcp.arr_delay_lm, ( SELECT 12.0 AS dep_delay, 14.0 AS taxi_out, 1231 AS distance ))

Output:

predicted _ontime predicted _ontime_probs dep_delay taxi_out distance
0 ontime [{'label': 'ontime', 'prob': 0.850350772376717...}] 12.0 14.0 1231

While you can use ML.PREDICT to actually carry out predictions, the predictions are subject to the typical BigQuery latency of a second or so. Therefore, ML.PREDICT is typically used for batch predictions over large datasets. For online prediction (i.e. exposing the prediction service a microservice using REST), you can extract the model as a TensorFlow model and deploy it into Vertex AI.

Use ML.PREDICT to actually carry out predictions in case you want to compute some other metric.

  1. Run the query below in new cell to compute Root Mean Squared Error (RMSE):
%%bigquery WITH predictions AS ( SELECT * FROM ML.PREDICT(MODEL dsongcp.arr_delay_lm, ( SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False AND t.is_train_day = 'False' ), STRUCT(0.7 AS threshold))), stats AS ( SELECT COUNTIF(ontime != 'ontime' AND ontime = predicted_ontime) AS correct_cancel , COUNTIF(predicted_ontime = 'ontime') AS total_noncancel , COUNTIF(ontime = 'ontime' AND ontime = predicted_ontime) AS correct_noncancel , COUNTIF(ontime != 'ontime') AS total_cancel , SQRT(SUM((IF(ontime = 'ontime', 1, 0) - p.prob) * (IF(ontime = 'ontime', 1, 0) - p.prob))/COUNT(*)) AS rmse FROM predictions, UNNEST(predicted_ontime_probs) p WHERE p.label = 'ontime' ) SELECT correct_cancel / total_cancel AS correct_cancel , total_noncancel , correct_noncancel / total_noncancel AS correct_noncancel , total_cancel , rmse FROM stats

Output something similar to the following:

correct_cancel total_noncancel correct_noncancel total_cancel rmse
0 0.836363 1301948 0.964337 0.213091

The query above pulls out the probability field from the predictions (it’s an array, one for each category, hence the UNNEST) and uses it to compute the RMSE. The resulting RMSE was 0.2131.

Task 5. Create, evaluate and predict the model by adding additional airport information

Create the BigQuery ML logistic regression model arr_delay_airports_lm by adding airport information to model (note two additional columns: origin and dest). This seemingly simple change adds two categorical variables that, when one-hot-encoded, adds 600+ new columns to the model.

  1. To showcase the scalability of BigQuery, add two fields, the origin and destination airport:
%%bigquery CREATE OR REPLACE MODEL dsongcp.arr_delay_airports_lm OPTIONS(input_label_cols=['ontime'], model_type='logistic_reg', data_split_method='custom', data_split_col='is_eval_day') AS SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance, origin, dest, IF(is_train_day = 'True', False, True) AS is_eval_day FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False Create a logistic regression model by adding additional airport information Note: It will take about 10 minutes to complete.
  1. Evaluate the model arr_delay_airports_lm:
%%bigquery SELECT * FROM ML.EVALUATE(MODEL dsongcp.arr_delay_airports_lm, ( SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance, origin, dest, IF(is_train_day = 'True', False, True) AS is_eval_day FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False AND t.is_train_day = 'False' ), STRUCT(0.7 AS threshold))

Output:

precision recall accuracy f1_score log_loss roc_auc
0 0.967151 0.957706 0.938477 0.962405 0.165557 0.960855
  1. Make prediction from the model arr_delay_airports_lm:
%%bigquery WITH predictions AS ( SELECT * FROM ML.PREDICT(MODEL dsongcp.arr_delay_airports_lm, ( SELECT IF(arr_delay < 15, 'ontime', 'late') AS ontime, dep_delay, taxi_out, distance, origin, dest, IF(is_train_day = 'True', False, True) AS is_eval_day FROM dsongcp.flights_tzcorr f JOIN dsongcp.trainday t ON f.FL_DATE = t.FL_DATE WHERE f.CANCELLED = False AND f.DIVERTED = False AND t.is_train_day = 'False' ), STRUCT(0.7 AS threshold))), stats AS ( SELECT COUNTIF(ontime != 'ontime' AND ontime = predicted_ontime) AS correct_cancel , COUNTIF(predicted_ontime = 'ontime') AS total_noncancel , COUNTIF(ontime = 'ontime' AND ontime = predicted_ontime) AS correct_noncancel , COUNTIF(ontime != 'ontime') AS total_cancel , SQRT(SUM((IF(ontime = 'ontime', 1, 0) - p.prob) * (IF(ontime = 'ontime', 1, 0) - p.prob))/COUNT(*)) AS rmse FROM predictions, UNNEST(predicted_ontime_probs) p WHERE p.label = 'ontime' ) SELECT correct_cancel / total_cancel AS correct_cancel , total_noncancel , correct_noncancel / total_noncancel AS correct_noncancel , total_cancel , rmse FROM stats

The resulting prediction statistics are:

correct_cancel total_noncancel correct_noncancel total_cancel rmse
0 0.84953 1299749 0.967151 0.209839

The model that includes the airport information has a RMSE of 0.2098, which is an improvement over the original model created above.

Congratulations!

BigQuery ML provides a simple and powerful SQL interface for machine learning. You created a classifier model for predicting flight delays using BigQuery ML. Now you know how to use BigQuery ML to build, evaluate, and use it for batch predictions with just SQL.

Finish your quest

This self-paced lab is part of the Data Science on Google Cloud: Machine Learning quest. A quest is a series of related labs that form a learning path. Completing this quest earns a badge to recognize your achievement. You can make your badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. See the Google Cloud Skills Boost catalog to see all available quests.

Take your next lab

Continue your quest with Machine Learning with TensorFlow in Vertex AI, or MLOps with Vertex AI.

Next steps / learn more

The source of this lab:

Explore other notebooks. Repeat the steps in this lab for the following notebooks (in order).

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated September 13, 2022

Lab Last Tested July 8, 2022

Copyright 2023 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.