
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Query the NYC collision data
/ 30
Query the most popular bike route by gender
/ 30
Creating datacatalog template and tag
/ 40
Data Catalog is a fully managed, scalable metadata management service in Google Cloud's Data Analytics family of products.
Managing data assets can be time consuming and expensive without the right tools. Data Catalog provides a centralized place where organizations can find, curate and describe their data assets.
There are two main ways you interact with Data Catalog:
In this lab, you will learn how to:
Very Important: Before starting this lab, log out of your personal or corporate gmail account, or run this lab in Incognito. This prevents sign-in confusion while the lab is running.
If you have not done so already, click Start Lab.
Tip: It will take 3 - 5 minutes for the lab environment to auto-generate 2 Google Cloud Projects, 2 pre-populated Datasets, and 2 user accounts. You do not need to wait for the lab resources to complete to continue reading this lab (you wont be logging in under after you read the scenario below)
Click Open Bike Console in the lab or in a new Incognito Browser window, navigate to the Cloud Console. Do not log in with any of the accounts provided yet. Continue reading the scenario first and you will be instructed later which account to use.
You are the head of a transportation business operating in New York City. You have teams of data analysts that query datasets you have collected about NYC travel (by bike and car).
Challenges:
Each data engineering team maintains their dataset in their own separate Google Cloud Project so they can better manage access and billing. While this is good for them, it makes these datasets less discoverable for your team of analysts.
To make matters more complex, you have different levels of data analysts on your BI team working for you:
In order to best simulate a true enterprise environment with multiple projects and datasets to catalog, your engineering team has given you access to already existing resources (the lab preloads resources so you dont have to create them).
Your team has provided you logins as shown above to:
They added the following notes on access restriction:
Recall that your data engineering team provided you with three projects each containing a different New York City dataset.
Confirm the Owner role can see and query all datasets.
Log in using the Owner (full admin) auto-generated email and password provided as part of this lab.
Accept the Terms and Conditions to use Google Cloud (if prompted).
NYC Motor Vehicle Collisions Project
and find that string value in the Select a Project popup:You will return later in this lab to use Data Catalog after you manually search and query the datasets first in BigQuery.
Let's now confirm that the owner role can view the new_york_mv_collisions
dataset.
In BigQuery, under Explorer click your project name to toggle open the available datasets you have access to see.
Confirm you can see the new_york_mv_collisions
dataset.
Click the new_york_mv_collisions
dataset to toggle open the tables inside.
Click the nypd_mv_collisions
table and explore the available fields in the schema.
The schema should look similar to the below:
While there isn't personally identifiable information like a phone number or email address in this table, you still need to use caution when sharing this dataset across the wider team.
The remainder of this lab will focus on teaching you ways to access restrict datasets and use Data Catalog to proactively tag datasets and tables with rich metadata for your organization.
Since you're logged in as a global owner, confirm you can see and access both projects and datasets. Confirm you can run the below query.
What were the 10 most common factors in NYC car crashes?
Click Check my progress to verify the objective.
Click Select a Project at the top of the page.
Click the All tab.
Find the Bike Share dataset by referring to the correct auto-generated project-id:
new_york_citibike
> citibike_trips
table.The NYC Citi Bike Public Dataset tracks each individual bike share trip (starting location, ending location) as well as other fields for each user.
Notice that the only three values provided in the dataset are unknown, male, and female which may not be representative of all the gender values for bike share riders.
Click Check my progress to verify the objective.
You will explore how to tag datasets and tables with sensitive data next.
So far in the lab you have been logged in as the Owner account which your data engineering team has provided with the highest level permissions.
You have asked your engineering teams to limit access to your Data Analyst users as follows
Data Analysts should see:
Data Analysts should NOT see:
Click the profile icon.
Sign out.
Click Use another account.
Log back into Google Cloud using the Data Analyst User
email and shared password.
Under select a project confirm you are only able to see one and not two Qwiklabs auto-generated projects.
Select the Qwiklabs project you can access.
Navigate to BigQuery.
In BigQuery, even if a project is not pinned or visible in your Explorer section you can still query it if you have access. Try to query the NYC Collisions dataset directly as a Data Analyst user by using the project-id.
NYC Motor Vehicle Collisions Project
:Try to run the query.
Verify that you receive a not found error message.
You have now explored the different privileges and accesses granted to owner roles (broadest set of privileges) and Data Analysts (most restrictive) when it comes to accessing projects, datasets, and queries.
Next you will try and find a hidden dataset using the Data Catalog search functionality. Do you think it will show up for Data Analysts if BigQuery blocks you?
Now that you are familiar with the datasets and access levels granted to different roles, you will address the challenges posed earlier in the sample scenario:
Challenges:
To comply with recent regulatory requirements, you need a very clear way to flag which datasets have PII (Personally Identifiable Information) in them. You will address these challenges and complete this task with the Data Catalog service.
qwiklabs-resources
project which you can ignore. That project provides shared assets across all labs.Filter for BigQuery and click OK.
Enter qwiklabs-gcp
into Data Catalog's search bar to filter out external Qwiklabs resources.
Confirm your view as a Data Analyst looks similar to below:
Regardless of the project you are logged into, Data Catalog will surface ALL of the BigQuery datasets that your role has access to.
As a Data Analyst user, you will not see new_york_mv_collisions
in Data Catalog even though it does exist (you queried it as an Owner):
Why is that? Next, explore how access control works at the Data Catalog level.
Before searching, discovering, or displaying Google Cloud resources, Data Catalog checks that the user has been granted an IAM role with the metadata read permissions required by BigQuery, Pub/Sub, or other source system to access the resource.
Example: Data Catalog checks that the user has been granted a role with bigquery.tables.get
permission before displaying BigQuery table metadata.
new_york_citibike
table name entry. This is a subtask of the ride share dataset you are allowed to view.For BigQuery tables, Data Catalog allows you to tag:
Attempt to click on the Attach tag button.
Confirm you get a similar error:
It appears the Data Analyst role can search for metadata in Data Catalog but not attach new tags.
Next you'll see how Data Catalog tagging permissions and tag templates work.
Data Catalog tag templates help you create and manage common metadata about data assets in a single location. The tags are attached to the data asset, which means it can be discovered in the Data Catalog system. Using this feature, you can also build additional applications that consume this contextual metadata about a data asset.
In order to create tag templates, the user needs to have, at minimum, edit access to the resource in question (BigQuery for this lab) AND datacatalog.tagTemplateUser
(assuming a template has already been created). To learn more, refer to the Data Catalog IAM guide.
What if you need to create a new tag template? Then you would need to be at minimum a datacatalog.tagTemplateCreator
or roles/datacatalog.tagTemplateOwner
. Owner allows you to delete existing templates and additional admin privileges.
Most common Data Catalog predefined Cloud IAM roles:
roles/datacatalog.tagTemplateViewer
roles/datacatalog.tagTemplateUser
roles/datacatalog.tagTemplateCreator
roles/datacatalog.tagTemplateOwner
Log in as the Owner role which has the roles/datacatalog.tagTemplateOwner
permission. Select the NYC Bike Share Project that you were using before.
In Data Catalog:
Click Add Field.
Name the new field Contains PII, make it required, select the Boolean type, then click Done.
Click Add field.
Name the field PII Type, select the Enumerated type, then add the following values, click Done when you're finished:
Click Add field.
Name the field Data Owner Team, make it required, select the Enumerated type and add the following values, then click Done when you're finished:
You will see that there are no tags below the dataset name, then Attach Tags.
Choose the template that you created earlier then click OK.
For more granular asset tagging, you can apply tags at the table and column level.
nypd_mv_collisions
.Click Check my progress to verify the objective.
Now that you have tagged, you can search your catalog by the tags you just added.
tag:qwiklabs-YOUR-PROJECT-HERE.new_york_datasets.contains_pii
and change the project id prefix to your current Qwiklabs Project ID.For other examples of how to quickly search across your catalog, follow the Search and view data assets with Data Catalog guide.
You've learned how to explore, search, and tag data in a project using Data Catalog. You also learned the value of access restricting datasets and flagging fields with PII to better surface rich metadata for your teams.
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one