Checkpoints
Deploy a Vertex AI Workbench instance
/ 20
Create logistic regression model
/ 40
Create a logistic regression model by adding additional airport information
/ 40
Machine Learning with BigQuery ML
GSP202
Overview
BigQuery is a serverless, highly scalable data warehouse. It is also an excellent machine learning platform. This combination is significant for organizations that handle privacy-sensitive or confidential data as you can do machine learning without having to extract data out of the data warehouse. Not having extracts of data floating around in people’s projects is important for security. Also, the auditability that BigQuery provides out of the box means that you know exactly who created the model, and which data was used in which model.
This lab uses a set of code samples and scripts developed for Data Science on the Google Cloud Platform, 2nd Edition from O'Reilly Media, Inc.
Objectives
-
Deploy a Vertex AI Workbench instance
-
Build logistic regression model with BigQuery ML
-
Evaluate the model
-
Make prediction from the model
-
Create an AI model and then use the model to make predictions
-
Add additional information to evaluate the model
Setup and requirements
Before you click the Start Lab button
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
- Access to a standard internet browser (Chrome browser recommended).
- Time to complete the lab---remember, once you start, you cannot pause a lab.
How to start your lab and sign in to the Google Cloud Console
-
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:
- The Open Google Console button
- Time remaining
- The temporary credentials that you must use for this lab
- Other information, if needed, to step through this lab
-
Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.
Tip: Arrange the tabs in separate windows, side-by-side.
Note: If you see the Choose an account dialog, click Use Another Account. -
If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.
-
Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.
Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges. -
Click through the subsequent pages:
- Accept the terms and conditions.
- Do not add recovery options or two-factor authentication (because this is a temporary account).
- Do not sign up for free trials.
After a few moments, the Cloud Console opens in this tab.
Task 1. Deploy a Vertex AI Workbench instance
-
On the Navigation Menu, click More products, and then Vertex AI > Workbench.
-
On the Notebook instances page, click New Notebook > TensorFlow Enterprise > TensorFlow Enterprise 2.6 (with LTS) > Without GPUs.
-
In the New notebook dialog, name the Notebook
bqml-notebook
. -
For Region, select
us-west1
. -
For Zone, select a zone within the selected region.
-
Leave all other fields with their default options, and click Create.
As the Notebook spins up, the bqml-notebook notebook is listed in the Workbench Notebook list. When it completes, Open Jupyterlab
appears inline with the Notebook name.
- Click Open JupyterLab.
When the Vertex AI Notebook instance starts you will see a suggestion to start a build to add tensorflow_model_analysis and xai_tabular_widget.
- Click Build.
Your notebook is now set up.
- In the Notebook launcher section click Python 3 to open a new notebook.
To use a Notebook you enter commands into a cell. Be sure you run the commands in the cell by either pressing Shift + Enter, or clicking the triangle on the Notebook top menu to Run selected cells and advance.
Task 2. Build logistic regression model
Create the training dataset
The first step in BigQuery ML is to create the training dataset. We want the three features (departure delay, taxi out time, and distance.) and the label (ontime), so let’s use SQL to craft the dataset just the way we want it.
- Enter following code into new cell and then run the cell:
Create the model
In this section, you create a logistic regression model where the label column is called ontime
, and the remaining columns are used as input features to the model. Here, you do not want to randomly split the flight data – you need to avoid having correlated flights on the same day split between training and test datasets.
That’s why when you created the training data in the previous section, you pre-split the data and created a table that specifies which days should be used for training and which days for evaluation. The resulting model parameters are stored in a BigQuery model object called arr_delay_lm
in the dataset dsongcp
.
You can explicitly tell BigQuery ML to use a column in the training dataset to split the data. Add that column to your SELECT
statement as a Boolean value by joining the flight data against the table of prespecified training days.
- Create a logistic regression model in BigQuery ML by running the following query in new cell:
In the above query, BigQuery trains a logistic regression model and puts the weights into the model arr_delay_lm
. Obtain the training error (called the loss) by querying the result of the special function ML.TRAINING_INFO
:
- Enter following into new cell and run to obtain training error:
Task 3. Evaluate the model
Evaluate the model by calling ML.EVALUATE
in SQL. BigQuery then evaluates the model on the withheld data (where is_train_day
is False
), but use a threshold of 0.7.
- Run the query below in the new cell to evaluate the model:
The resulting evaluation statistics are:
precision | recall | accuracy | f1_score | log_loss | roc_auc | |
---|---|---|---|---|---|---|
0 | 0.964337 | 0.956535 | 0.935174 | 0.96042 | 0.167233 | 0.956276 |
The precision is how often the model is right when it reports a flight as being on time. The recall is the fraction of on time flights correctly classified. The Receiver Operating Characteristic (ROC) is a threshold-independent measure of classifier performance.
Task 4. Make prediction from the model
The ML.WEIGHTS
function allows you to see the underlying weights used by a model during prediction. This function applies to linear & logistic regression models.
- Run the following query in new cell to obtain the weights by calling
ML.WEIGHTS
:
The output is similar to the following:
However, there is usually no point to getting just the weights. Instead, what you want is the predicted value for some set of inputs.
- Run the query below in a new cell to carry out a prediction:
Output:
predicted _ontime | predicted _ontime_probs | dep_delay | taxi_out | distance |
---|---|---|---|---|
0 | ontime [{'label': 'ontime', 'prob': 0.850350772376717...}] | 12.0 | 14.0 | 1231 |
While you can use ML.PREDICT to actually carry out predictions, the predictions are subject to the typical BigQuery latency of a second or so. Therefore, ML.PREDICT is typically used for batch predictions over large datasets. For online prediction (i.e. exposing the prediction service a microservice using REST), you can extract the model as a TensorFlow model and deploy it into Vertex AI.
Use ML.PREDICT
to actually carry out predictions in case you want to compute some other metric.
- Run the query below in new cell to compute Root Mean Squared Error (RMSE):
Output something similar to the following:
correct_cancel | total_noncancel | correct_noncancel | total_cancel | rmse |
---|---|---|---|---|
0 | 0.836363 | 1301948 | 0.964337 | 0.213091 |
The query above pulls out the probability field from the predictions (it’s an array, one for each category, hence the UNNEST) and uses it to compute the RMSE. The resulting RMSE was 0.2131.
Task 5. Create, evaluate and predict the model by adding additional airport information
Create the BigQuery ML logistic regression model arr_delay_airports_lm
by adding airport information to model (note two additional columns: origin and dest). This seemingly simple change adds two categorical variables that, when one-hot-encoded, adds 600+ new columns to the model.
- To showcase the scalability of BigQuery, add two fields, the
origin
anddestination
airport:
- Evaluate the model
arr_delay_airports_lm
:
Output:
precision | recall | accuracy | f1_score | log_loss | roc_auc | |
---|---|---|---|---|---|---|
0 | 0.967151 | 0.957706 | 0.938477 | 0.962405 | 0.165557 | 0.960855 |
- Make prediction from the model
arr_delay_airports_lm
:
The resulting prediction statistics are:
correct_cancel | total_noncancel | correct_noncancel | total_cancel | rmse |
---|---|---|---|---|
0 | 0.84953 | 1299749 | 0.967151 | 0.209839 |
The model that includes the airport information has a RMSE of 0.2098, which is an improvement over the original model created above.
Congratulations!
BigQuery ML provides a simple and powerful SQL interface for machine learning. You created a classifier model for predicting flight delays using BigQuery ML. Now you know how to use BigQuery ML to build, evaluate, and use it for batch predictions with just SQL.
Finish your quest
This self-paced lab is part of the Data Science on Google Cloud: Machine Learning quest. A quest is a series of related labs that form a learning path. Completing this quest earns a badge to recognize your achievement. You can make your badges public and link to them in your online resume or social media account. Enroll in this quest and get immediate completion credit. See the Google Cloud Skills Boost catalog to see all available quests.
Take your next lab
Continue your quest with Machine Learning with TensorFlow in Vertex AI, or MLOps with Vertex AI.
Next steps / learn more
The source of this lab:
Explore other notebooks. Repeat the steps in this lab for the following notebooks (in order).
Google Cloud training and certification
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated September 13, 2022
Lab Last Tested July 8, 2022
Copyright 2023 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.