
Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Creating the Cloud Storage buckets
/ 10
Pushing the source code to Cloud Source Repositories
/ 10
Creating Cloud Build pipelines
/ 10
Create the production pipeline
/ 10
Configuring a build trigger
/ 10
Test the trigger
/ 10
In this lab, you set up a continuous integration/continuous deployment (CI/CD) pipeline for processing data by implementing CI/CD methods with managed products on Google Cloud. Data scientists and engineers can adapt the methodologies from CI/CD practices to help ensure the high quality, maintainability, and adaptability of data processes and workflows. The methods that you can apply are as follows:
In this lab, you use the following Google Cloud products:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).
The lab spins up resources, and then opens another tab that shows the Sign in page.
Tip: Arrange the tabs in separate windows, side-by-side.
If necessary, copy the Username below and paste it into the Sign in dialog.
You can also find the Username in the Lab Details panel.
Click Next.
Copy the Password below and paste it into the Welcome dialog.
You can also find the Password in the Lab Details panel.
Click Next.
Click through the subsequent pages:
After a few moments, the Google Cloud console opens in this tab.
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
Output:
Example output:
Output:
Example output:
At a high level, the CI/CD pipeline consists of the following steps:
The following diagram shows a detailed view of the CI/CD pipeline steps.
In this lab, the deployments to the test and production environments are separated into two different Cloud Build pipelines: a test pipeline and a production pipeline.
In the preceding diagram, the test pipeline consists of the following steps:
To completely separate the environments, you need multiple Cloud Composer environments created in different projects, which are by default separated from each other. This separation helps to secure your production environment. This approach is outside the scope of this lab. For more information about how to access resources across multiple Google Cloud projects, see Setting service account permissions.
The instructions for how Cloud Composer runs the data-processing workflow are defined in a Directed acyclic graph (DAG) written in Python. In the DAG, all the steps of the data-processing workflow are defined together with the dependencies between them.
The CI/CD pipeline automatically deploys the DAG definition from Cloud Source Repositories to Cloud Composer in each build. This process ensures that Cloud Composer is always up to date with the latest workflow definition without needing any human intervention.
In the DAG definition for the test environment, an end-to-end test step is defined in addition to the data-processing workflow. The test step helps make sure that the data-processing workflow runs correctly.
The data-processing workflow is illustrated in the following diagram.
The data-processing workflow consists of the following steps:
Run the WordCount data process in Dataflow.
Download the output files from the WordCount process. The WordCount process outputs three files:
download_result_1
download_result_2
download_result_3
Download the reference file, called download_ref_string
.
Verify the result against the reference file. This integration test aggregates all three results and compares the aggregated results with the reference file.
Using a task-orchestration framework such as Cloud Composer to manage the data-processing workflow helps alleviate the code complexity of the workflow.
In addition to the integration test that verifies the data-processing workflow from end to end, there are two unit tests in this lab. The unit tests are automatic tests on the data-processing code and the data-processing workflow code. The test on the data-processing code is written in Java and runs automatically during the Maven build process. The test on the data-processing workflow code is written in Python and runs as an independent build step.
The sample code is in two folders:
The env-setup
folder contains shell scripts for the initial setup of
the Google Cloud environment.
The source-code
folder contains code that is developed over time,
needs to be source controlled, and triggers automatic build and test
processes. This folder contains the following subfolders:
data-processing-code
folder contains the Apache Beam
process source code.workflow-dag
folder contains the composer DAG definitions
for the data-processing workflows with the steps to design, implement,
and test the Dataflow process.build-pipeline
folder contains two Cloud Build
configurations — one for the test pipeline and the other for the
production pipeline. This folder also contains a support script for the
pipelines.For the purpose of this lab, the source code files for data processing and for DAG workflow are in different folders in the same source code repository. In a production environment, the source code files are usually in their own source code repositories and are managed by different teams.
In this lab, you run all commands in Cloud Shell. Cloud Shell appears as a window at the bottom of the Google Cloud console.
In the Cloud console, open Cloud Shell.
Clone the sample code repository:
Navigate to the directory that contains the sample files for this lab:
Update the region in the file set_env.sh
using the sed command.
Run a script to set environment variables:
The script sets the following environment variables:
Because environment variables aren't retained between sessions, if your Cloud Shell session shuts down or disconnects while you are working through the lab, you need to reset the environment variables.
Add the logging option to the yaml file:
Update the pipeline script:
To ensure access to the necessary APIs, restart the connection to the Cloud Composer API.
In the Google Cloud console, enter Cloud Composer API in the top search bar, then click on the result for Cloud Composer API.
Click Manage.
Click Disable API.
If asked to confirm, click Disable.
When the API has been enabled again, the page will show the option to disable.
In Cloud Shell, run the following to create a Cloud Composer environment:
When the command completes, you can verify it in Google Cloud.
Run a script to set the variables in the Cloud Composer environment. The variables are needed for the data-processing DAGs.
The script sets the following environment variables:
Cloud Composer uses a Cloud Storage bucket to store DAGs. Moving a DAG definition file to the bucket triggers Cloud Composer to automatically read the files. You created the Cloud Storage bucket for Cloud Composer when you created the Cloud Composer environment. In the following procedure, you extract the URL for the buckets, and then configure your CI/CD pipeline to automatically deploy DAG definitions to the Cloud Storage bucket.
In Cloud Shell, export the URL for the bucket as an environment variable:
Export the name of the service account that Cloud Composer uses in order to have access to the Cloud Storage buckets:
In this section, you create a set of Cloud Storage buckets to store the following:
To create the Cloud Storage buckets, complete the following step:
In Cloud Shell, create Cloud Storage buckets and give the Cloud Composer service account permission to run the data-processing workflows:
Click Check my progress to verify the objective.
In this lab, you have one source code base that you need to put into version control. The following step shows how a code base is developed and changes over time. Whenever changes are pushed to the repository, the pipeline to build, deploy, and test is triggered.
In Cloud Shell, push the source-code
folder to
Cloud Source Repositories:
These are standard commands to initialize Git in a new directory and push the content to a remote repository.
Click Check my progress to verify the objective.
In this section, you create the build pipelines that build, deploy, and test the data-processing workflow.
The build and test pipeline steps are configured in the
YAML configuration file.
In this lab, you use prebuilt
builder images
for git
, maven
, gsutil
, and gcloud
to run the tasks in each build step.
You use configuration variable
substitutions
to define the environment settings at build time. The source code repository
location is defined by variable substitutions, as well as the locations of
Cloud Storage buckets. The build needs this information to deploy the
JAR file, test files, and the DAG definition.
In Cloud Shell, submit the build pipeline configuration file to create the pipeline in Cloud Build:
This command instructs Cloud Build to run a build with the following steps:
Build and deploy the WordCount self-executing JAR file.
Deploy and set up the data-processing workflow on Cloud Composer.
Run the data-processing workflow in the test environment to trigger the test-processing workflow.
After you submit the build file, verify the build steps.
In the Cloud console, go to the Build History page to view a list of all past and currently running builds.
Click the build that is currently running.
On the Build details page, verify that the build steps match the previously described steps.
On the Build details page, the Status field of the build
says Build successful
when the build finishes.
In Cloud Shell, verify that the WordCount sample JAR file was copied into the correct bucket:
The output is similar to the following:
Get the URL to your Cloud Composer web interface. Make a note of the URL because it's used in the next step.
Use the URL from the previous step to go to the Cloud Composer UI to verify a successful DAG run. You can also click on the Airflow Webserver link from the Composer page. If the Dag Runs column doesn't display any information, wait a few minutes and reload the page.
To verify that the data-processing workflow DAG
test_word_count
is deployed and is in running mode, hold the pointer
over the light-green circle below DAG Runs and verify that it
says Running.
To see the running data-processing workflow as a graph, click
the light-green circle, and then on the Dag Runs page, click Dag
Id: test_word_count
.
Reload the Graph View page to update the state of the
current DAG run. It usually takes between three to five minutes for the
workflow to finish. To verify that the DAG run finishes successfully,
hold the pointer over each task to verify that the tooltip says
State: success. The last task, named do_comparison
, is the
integration test that verifies the process output against the reference
file.
do_comparison
or the publish_test_complete
task issue in the test_word_count
dag if you see the status of either of these tasks as failed.
If the DAG run fails, trigger another DAG run using the following steps:
test_word_count
row, click Trigger Dag. Click Check my progress to verify the objective.
When the test processing workflow runs successfully, you can promote the current version of the workflow to production. There are several ways to deploy the workflow to production:
The automatic approaches are beyond the scope of this lab. For more information, refer to Release Engineering.
In this lab, you do a manual deployment to production by running the Cloud Build production deployment build. The production deployment build follows these steps:
Variable substitutions define the name of the latest JAR file that is deployed to production with the Cloud Storage buckets used by the production processing workflow. To create the Cloud Build pipeline that deploys the production airflow workflow, complete the following steps:
In Cloud Shell, read the filename of the latest JAR file by printing the Cloud Composer variable for the JAR filename:
Use the build pipeline configuration file, deploy_prod.yaml,
to
create the pipeline in Cloud Build:
Get the URL for your Cloud Composer UI:
To verify that the production data-processing workflow DAG is deployed,
go to the URL that you retrieved in the previous step and verify that
prod_word_count
DAG is in the list of DAGs.
On the DAGs page, in the prod_word_count
row, click Trigger Dag.
Reload the page to update the DAG run status. To verify that the production data-processing workflow DAG is deployed and is in running mode, hold the pointer over the light-green circle below DAG Runs and verify that it says Running.
After the run succeeds, hold the pointer over the dark-green circle below the DAG runs column and verify that it says Success.
In Cloud Shell, list the result files in the Cloud Storage bucket:
The output is similar to the following:
Click Check my progress to verify the objective.
You set up a Cloud Build trigger that triggers a new build when changes are pushed to the master branch of the source repository.
In Cloud Shell, run the following command to get all the substitution variables needed for the build. Make a note of these values because they are needed in a later step.
In the Cloud console, go to the Build triggers page: Build Triggers page.
Click Create trigger.
To configure trigger settings, complete the following steps:
Trigger build in test environment
.data-pipeline-source (Cloud Source Repositories)
.^master$
.build-pipeline/build_deploy_test.yaml
.In the Advanced field, replace the variables with the values from your environment that you got from the earlier step. Add the following one at a time and click +ADD VARIABLE for each of the name-value pairs:
_DATAFLOW_JAR_BUCKET
_COMPOSER_INPUT_BUCKET
_COMPOSER_REF_BUCKET
_COMPOSER_DAG_BUCKET
_COMPOSER_ENV_NAME
_COMPOSER_REGION
_COMPOSER_DAG_NAME_TEST
For Service account Select xxxxxxx-compute@developer.gserviceaccount.com.
Click Create.
Click Check my progress to verify the objective.
To test the trigger, you add a new word to the test input file and make the corresponding adjustment to the test reference file. You verify that the build pipeline is triggered by a commit push to Cloud Source Repositories and that the data-processing workflow runs correctly with the updated test files.
In Cloud Shell, add a test word at the end of the test file:
Update the test result reference file, ref.txt
, to match the changes
done in the test input file:
Commit and push changes to Cloud Source Repositories:
In the Cloud console, go to the History page: History page.
To verify that a new build is triggered by the previous push to master branch, on the current running build, the Trigger column says Push to master branch.
In Cloud Shell, get the URL for your Cloud Composer web interface:
After the build finishes, go to the URL from the previous command to
verify that the test_word_count
DAG is running.
Wait until the DAG run finishes, which is indicated when the light green circle in the DAG runs column goes away. It usually takes between three to five minutes for the process to finish.
do_comparison
task issue in the test_word_count
dag.
In Cloud Shell, download the test result files:
Verify that the newly added word is in one of the result files:
The output is similar to the following:
Click Check my progress to verify the objective.
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one