arrow_back

Visualize Real Time Geospatial Data with Google Data Studio

Join Sign in

Visualize Real Time Geospatial Data with Google Data Studio

1 hour 15 minutes 7 Credits

GSP201

Google Cloud Self-Paced Labs

Overview

This lab demonstrates how to use Google Dataflow to process real-time streaming data from a real-time real world historical data set, storing the results in BigQuery and then using Google Data Studio to visualize real-time geospatial data.

Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes via Java and Python APIs with the Apache Beam SDK. Cloud Dataflow provides a serverless architecture that can be used to shard and process very large batch data sets, or high volume live streams of data, in parallel.

BigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage.

The data set that is used provides historic information about internal flights in the United States retrieved from the US Bureau of Transport Statistics website. This data set can be used to demonstrate a wide range of data science concepts and techniques and will be used in all of the other labs in the Data Science on Google Cloud Quest.

Objectives

  • Create a Google Dataflow processing job for streaming data

  • Generate real-time streaming data using Python.

  • Analyze streaming data in BigQuery

  • Create a real-time geospatial dashboard for streaming data

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

What you need

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
  • Time to complete the lab.

Note: If you already have your own personal Google Cloud account or project, do not use it for this lab.

Note: If you are using a Chrome OS device, open an Incognito window to run this lab.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is a panel populated with the temporary credentials that you must use for this lab.

    Open Google Console

  2. Copy the username, and then click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Sign in

    Tip: Open the tabs in separate windows, side-by-side.

  3. In the Sign in page, paste the username that you copied from the left panel. Then copy and paste the password.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Training credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).

  4. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

In the Cloud Console, in the top right toolbar, click the Activate Cloud Shell button.

Cloud Shell icon

Click Continue.

cloudshell_continue.png

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

You can list the active account name with this command:

gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net

You can list the project ID with this command:

gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (nav-menu.png), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Home.

check-sa.png

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  • In the Google Cloud console, on the Navigation menu, click Home.

  • Copy the project number (e.g. 729328892908).

  • On the Navigation menu, click IAM & Admin > IAM.

  • At the top of the IAM page, click Add.

  • For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  • For Role, select Project (or Basic) > Editor. Click Save.

add-sa.png

Preparing your environment

This lab uses a set of code samples and scripts developed for the Data Science on Google Cloud book from O'Reilly Media, Inc. You will clone the sample repository used in Chapter 4 from Github to the Cloud Shell and carry out all of the lab tasks from there.

Please wait until you see the Lab Running green light on the page where to started the lab. Your environment isn't ready until you see this indicator.

Clone the Data Science on Google Cloud repository

In Cloud Shell, enter the following commands to clone the repository:

git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp/

Change to the repository source directory for this lab:

cd ~/data-science-on-gcp/04_streaming

Create isolated Python environment:

Execute the following command to download and update the packages list.

sudo apt-get update

Python virtual environments are used to isolate package installation from the system.

sudo apt-get install -y virtualenv If prompted [Y/n], press Y and then Enter. virtualenv -p python3 venv

Activate the virtual environment.

source venv/bin/activate

Install the Python packages that are required:

pip install google-cloud-bigquery pip install google-cloud-pubsub

Create the real-time Google Dataflow stream processing job

Create a default OAuth application credential that will allow the Python real-time event simulation script to access the simulated event data in BigQuery:

gcloud auth application-default login

When prompted to continue enter Y.

Click on the link that appears in the output to open the Oauth authentication page.

Select the credentials you signed into the lab with, then click ALLOW.

Copy the authentication code that is displayed and paste it into the verification code prompt, then press Enter. You should see a confirmation that those credentials will now be used by any application requesting Application Default Credentials.

Navigate to the simulate directory:

cd ~/data-science-on-gcp/04_streaming/simulate

Create an environment variable for your Project ID.

export PROJECT_ID=$(gcloud info --format='value(config.project)')

Run this command to install pytz library

pip install pytz

Run this simulation script so that it creates the Google Pub/Sub topics:

python ./simulate.py --project $PROJECT_ID --startTime '2015-01-01 06:00:00 UTC' --endTime '2015-01-04 06:00:00 UTC' --speedFactor=30

Wait for a single event to be displayed, then press CTRL+c to quit.

Example event:

Simulation start time is 2015-01-01 00:00:00+00:00

Click Check my progress to verify the objective.

Create the real-time Google Dataflow stream processing job

Process stream data using Cloud Dataflow

Deploy job to process stream data

Create a Cloud Storage bucket to hold the simulated event data:

export BUCKET=$PROJECT_ID-ml gsutil mb gs://$BUCKET

Use Maven to deploy the JAVA stream processing job to Google Cloud Dataflow:

cd ../realtime/chapter4 mvn compile exec:java \ -Dexec.mainClass=com.google.cloud.training.flights.AverageDelayPipeline \ -Dexec.args="--project=$PROJECT_ID \ --stagingLocation=gs://$BUCKET/staging/ \ --averagingInterval=60 \ --speedupFactor=30 \ --runner=DataflowRunner" cd ..

In the Cloud Console, click Navigation menu > Dataflow to open the Dataflow console.

Click on the name of the streaming Dataflow job to inspect it.

The ingestion paths for arrival and departure events can be seen, along with their associated delay, timestamp, and aggregation functions. The data that is output to BigQuery is aggregated by airport ID.

d5aade8b13cd20f.png

Start the Python simulation script

Back in Cloud Shell, replace the simulation script's project_path variable with your unique pub/sub project path and rerun the simulation script:

cd ../simulate sed -i 's/project_path/\"projects\/'"$PROJECT_ID"'\"/g' simulate.py python ./simulate.py --project $PROJECT_ID --startTime '2015-01-01 06:00:00 UTC' --endTime '2015-01-04 06:00:00 UTC' --speedFactor=30

Click Check my progress to verify the objective.

Process stream data using Cloud Dataflow

Prepare your data in BigQuery

Open BigQuery Console

In the Google Cloud Console, select Navigation menu > BigQuery:

BigQuery_menu.png

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

Click Done.

The BigQuery console opens.

Bigquery-UI.png

Enter the following in the Query editor:

#standardsql SELECT * FROM flights.streaming_delays WHERE airport = 'DEN' ORDER BY timestamp

It takes 2-5 minutes for your Dataflow job to create and put data in the table. If you click Run to check the status, you may initially see the Query Failed Error below.

BQ_no_table.png

You may want to watch Dataflow to see when data is getting written to BigQuery. Re-run the query a few times. When you see 5-6 records in the table, you can analyze the data.

BQ_query_results.png

Analyze the data as it arrives by constructing a query that first checks to find the latest timestamp. You can use that to return the delay data for all events that have happened within the last 30 minutes.

Click on COMPOSE NEW QUERY. Copy the updated query below and paste it into the query dialog field:

#standardsql SELECT airport, arr_delay, dep_delay, timestamp, latitude, longitude, num_flights FROM flights.streaming_delays WHERE ABS(TIMESTAMP_DIFF(timestamp, ( SELECT MAX(timestamp) latest FROM flights.streaming_delays ), MINUTE)) < 29 AND num_flights > 10

Click Run.

You may or may not see results. However, you can build on this to provide a query that will report aggregate arrival and delay times for all airports.

Click on COMPOSE NEW QUERY. Copy the following query to the Query editor, then replace [PROJECT_ID] with the Project ID for this lab, click Run:

#standardSQL SELECT airport, last[safe_OFFSET(0)].*, CONCAT(CAST(last[safe_OFFSET(0)].latitude AS STRING), ",", CAST(last[safe_OFFSET(0)].longitude AS STRING)) AS location FROM ( SELECT airport, ARRAY_AGG(STRUCT(arr_delay, dep_delay, timestamp, latitude, longitude, num_flights) ORDER BY timestamp DESC LIMIT 1) last FROM `[PROJECT_ID].flights.streaming_delays` GROUP BY airport )

This query enhances the aggregate delay information that the previous query provided and retains the latest update from each airport, thus maintaining visibility to airports with very low flight numbers. The query also provides a combined latitude and longitude value that is in a format that aligns with one of the geographic location formats recognized by Data Studio.

The query is in this format so you can convert it into a BigQuery database view.

Now click Save > Save view, call this view flightdelays as Table name. Select flights as Dataset ID, and click Save. Next you'll work with your saved results in Google Data Studio.

Click Check my progress to verify the objective.

Prepare your data in BigQuery

Visualize your data in Data Studio

Click this link to open Google Data Studio: https://datastudio.google.com/.

Click Blank Report.

Click GET STARTED.

Check the acknowledgement box to accept the Terms of service and click Continue.

Click No for all of the notification email prefrences and then click Continue.

Again, click Blank Report.

Select the BigQuery tile.

Click Authorize for Data Studio to connect to your BigQuery project.

Select My Projects > [Project-ID] > flights > flightdelays.

Remember, [PROJECT-ID] is the Project ID in the left pane under Username and Password.

Click Add in the lower right. In the confirmation dialog, click on ADD TO REPORT.

Select Insert > Geo chart in the top ribbon, and then drop it on the canvas.

In the Available Fields menu, on the right, drag location to the Dimension block, replacing "Invalid dimension".

DataStudio_location_dimention.png

Click on the ABC icon for location to change the Type to Geo > Latitude, Longitude.

Drag arr_delay into Metric.

Click in the Zoom Area and select United States.

Click on the Text tool in the top ribbon and add a label to the chart - type in Arrival Delay - to identify this as the chart that displays the arrival delay results.

Copy the map and label and paste it onto the canvas. Click on the new map, which switches you back to the Data tab in the right menu. Change the metric to dep_delay. Change the label for this chart to Departure Delay.

Make a third copy of the map and label. Change the metric to num_flights. Change the label for this chart to Total Number of Flights.

f0fbff225b2496bc.png

You can now see that the average arrival delay is significantly smaller than the average departure delay, indicating that time lost due to departure delays is often recovered. The main hub airport locations are also easily identified in the third chart.

Congratulations!

Now you know how to use Google Dataflow to process streaming data and how to visualize real-time geospatial event data using Google Data Studio.

2ea99a2e13bf2db4.png

Finish Your Quest

This self-paced lab is part of the Qwiklabs Data Science on Google Cloud Quest. A Quest is a series of related labs that form a learning path. Completing this Quest earns you the badge above, to recognize your achievement. You can make your badge (or badges) public and link to them in your online resume or social media account. Enroll in this Quest and get immediate completion credit if you've taken this lab. See other available Qwiklabs Quests.

Take Your Next lab

Continue your Quest with Loading Data into BigQuery for Exploratory Data Analysis, or check out these suggestions:

Next steps / learn more

Here are some follow-up steps:

Google Cloud Training & Certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated September 08, 2021
Lab Last Tested September 08, 2021

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.