In this lab, you will learn how to import data from CSV text files into Cloud SQL and then carry out some basic data analysis using simple queries.
The dataset used in this lab is collected by the NYC Taxi and Limousine Commission and includes trip records from all trips completed in Yellow and Green taxis in NYC from 2009 to present, and all trips in for-hire vehicles (FHV) from 2015 to present. Records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.
This dataset can be used to demonstrate a wide range of data science concepts and techniques and will be used in several of the labs in the Data Engineering curriculum.
Objectives
Create Cloud SQL instance
Create a Cloud SQL database
Import text data into Cloud SQL
Check the data for integrity
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Activate Google Cloud Shell
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
You can list the active account name with this command:
Create the taxi trips table by logging into the mysql command line interface:
mysql --host=$MYSQLIP --user=root \
--password --verbose
When prompted for a password enter Passw0rd.
Paste the following content into the command line to create the schema for the trips table:
create database if not exists bts;
use bts;
drop table if exists trips;
create table trips (
vendor_id VARCHAR(16),
pickup_datetime DATETIME,
dropoff_datetime DATETIME,
passenger_count INT,
trip_distance FLOAT,
rate_code VARCHAR(16),
store_and_fwd_flag VARCHAR(16),
payment_type VARCHAR(16),
fare_amount FLOAT,
extra FLOAT,
mta_tax FLOAT,
tip_amount FLOAT,
tolls_amount FLOAT,
imp_surcharge FLOAT,
total_amount FLOAT,
pickup_location_id VARCHAR(16),
dropoff_location_id VARCHAR(16)
);
Test completed task
Click Check my progress to verify your performed task. If you have completed the task successfully you will be granted an assessment score.
Create a bts database and trips table.
In the mysql command line interface check the import by entering the following commands:
describe trips;
Query the trips table:
select distinct(pickup_location_id) from trips;
This will return an empty set as there is no data in the database yet.
Exit the mysql interactive console:
exit
Task 3. Add data to Cloud SQL instance
Now you'll copy the New York City taxi trips CSV files stored on Cloud Storage locally. To keep resource usage low, you'll only be working with a subset of the data (~20,000 rows).
Connect to the mysql interactive console to load local infile data:
mysql --host=$MYSQLIP --user=root --password --local-infile
When prompted for a password enter Passw0rd.
In the mysql interactive console select the database:
use bts;
Load the local CSV file data using local-infile:
LOAD DATA LOCAL INFILE 'trips.csv-1' INTO TABLE trips
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id);
LOAD DATA LOCAL INFILE 'trips.csv-2' INTO TABLE trips
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,rate_code,store_and_fwd_flag,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,imp_surcharge,total_amount,pickup_location_id,dropoff_location_id);
Task 4. Checking for data integrity
Whenever data is imported from a source it's always important to check for data integrity. Roughly, this means making sure the data meets your expectations.
Query the trips table for unique pickup location regions:
select distinct(pickup_location_id) from trips;
This should return 159 unique ids.
Let's start by digging into the trip_distance column. Enter the following query into the console:
select
max(trip_distance),
min(trip_distance)
from
trips;
One would expect the trip distance to be greater than 0 and less than, say 1000 miles. The maximum trip distance returned of 85 miles seems reasonable but the minimum trip distance of 0 seems buggy.
How many trips in the dataset have a trip distance of 0?
select count(*) from trips where trip_distance = 0;
There are 155 such trips in the database. These trips warrant further exploration. You'll find that these trips have non-zero payment amounts associated with them. Perhaps these are fraudulent transactions?
Let's see if we can find more data that doesn't meet our expectations. We expect the fare_amount column to be positive. Enter the following query to see if this is true in the database:
select count(*) from trips where fare_amount < 0;
There should be 14 such trips returned. Again, these trips warrant further exploration. There may be a reasonable explanation for why the fares take on negative numbers. However, it's up to the data engineer to ensure there are no bugs in the data pipeline that would cause such a result.
Finally, let's investigate the payment_type column.
select
payment_type,
count(*)
from
trips
group by
payment_type;
The results of the query indicate that there are four different payment types, with:
Payment type = 1 has 13863 rows
Payment type = 2 has 6016 rows
Payment type = 3 has 113 rows
Payment type = 4 has 32 rows
Digging into the documentation, a payment type of 1 refers to credit card use, payment type of 2 is cash, and a payment type of 4 refers to a dispute. The figures make sense.
Exit the 'mysql' interactive console:
exit
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Lab membuat project dan resource Google Cloud untuk jangka waktu tertentu
Lab memiliki batas waktu dan tidak memiliki fitur jeda. Jika lab diakhiri, Anda harus memulainya lagi dari awal.
Di kiri atas layar, klik Start lab untuk memulai
Gunakan penjelajahan rahasia
Salin Nama Pengguna dan Sandi yang diberikan untuk lab tersebut
Klik Open console dalam mode pribadi
Login ke Konsol
Login menggunakan kredensial lab Anda. Menggunakan kredensial lain mungkin menyebabkan error atau dikenai biaya.
Setujui persyaratan, dan lewati halaman resource pemulihan
Jangan klik End lab kecuali jika Anda sudah menyelesaikan lab atau ingin mengulanginya, karena tindakan ini akan menghapus pekerjaan Anda dan menghapus project
Konten ini tidak tersedia untuk saat ini
Kami akan memberi tahu Anda melalui email saat konten tersedia
Bagus!
Kami akan menghubungi Anda melalui email saat konten tersedia
Satu lab dalam satu waktu
Konfirmasi untuk mengakhiri semua lab yang ada dan memulai lab ini
Gunakan penjelajahan rahasia untuk menjalankan lab
Gunakan jendela Samaran atau browser pribadi untuk menjalankan lab ini. Langkah ini akan mencegah konflik antara akun pribadi Anda dan akun Siswa yang dapat menyebabkan tagihan ekstra pada akun pribadi Anda.
In this lab you will import data from CSV text files into Cloud SQL and then carry out some basic data analysis using simple queries.