In this lab you will learn how to migrate Apache Spark code to Cloud Dataproc. You will follow a sequence of steps progressively moving more of the job components over to Google Cloud services:
Run original Spark code on Cloud Dataproc (Lift and Shift)
Replace HDFS with Cloud Storage (cloud-native)
Automate everything so it runs on job-specific clusters (cloud-optimized)
What you'll learn
In this lab you will learn how to:
Migrate existing Spark jobs to Cloud Dataproc
Modify Spark jobs to use Cloud Storage instead of HDFS
Optimize Spark jobs to run on Job specific clusters
What you'll use
Cloud Dataproc
Apache Spark
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Activate Google Cloud Shell
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
You can list the active account name with this command:
Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).
In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.
Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.
In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
Copy the project number (e.g. 729328892908).
On the Navigation menu, select IAM & Admin > IAM.
At the top of the roles table, below View by Principals, click Grant Access.
Replace {project-number} with your project number.
For Role, select Project (or Basic) > Editor.
Click Save.
Scenario
You are migrating an existing Spark workload to Cloud Dataproc and then progressively modifying the Spark code to make use of Google Cloud native features and services.
Task 1. Lift and shift
Migrate existing Spark jobs to Cloud Dataproc
You will create a new Cloud Dataproc cluster and then run an imported Jupyter notebook that uses the cluster's default local Hadoop Distributed File system (HDFS) to store source data and then process that data just as you would on any Hadoop cluster using Spark. This demonstrates how many existing analytics workloads such as Jupyter notebooks containing Spark code require no changes when they are migrated to a Cloud Dataproc environment.
Configure and start a Cloud Dataproc cluster
In the Google Cloud console, on the Navigation menu, in the Analytics section, click Dataproc.
In the left menu, click Clusters, and then click Create cluster..
Enter sparktodp for Cluster Name.
Set the Region to and zone to
In the Versioning section, click Change and select 2.1 (Debian 11, Hadoop 3.3, Spark 3.3).
This version includes Python3, which is required for the sample code used in this lab.
Click Select.
In the Components > Component gateway section, select Enable component gateway.
Under Optional components, Select Jupyter Notebook.
Below Set up cluster from the list on the left side, click Configure nodes (optional).
Under Manager node:
Select Primary disk type to Standard Persistent Disk first.
Change Series to E2.
Set Machine Type to e2-standard-2 (2 vCPU, 8 GB memory).
Set Primary disk size to 30 GB.
Under Worker nodes:
Select Primary disk type to Standard Persistent Disk first.
Change Series to E2.
Set Machine Type to e2-standard-2 (2 vCPU, 8 GB memory).
Set Primary disk size to 30 GB.
Click Create.
The cluster should start in a few minutes. Please wait until the Cloud Dataproc Cluster is fully deployed to proceed to the next step.
Clone the source repository for the lab
In the Cloud Shell you clone the Git repository for the lab and copy the required notebook files to the Cloud Storage bucket used by Cloud Dataproc as the home directory for Jupyter notebooks.
To clone the Git repository for the lab enter the following command in Cloud Shell:
As soon as the cluster has fully started up you can connect to the Web interfaces. Click the refresh button to check as it may be deployed fully by the time you reach this stage.
On the Dataproc Clusters page wait for the cluster to finish starting and then click the name of your cluster to open the Cluster details page.
Click Web Interfaces.
Click the Jupyter link to open a new Jupyter tab in your browser.
This opens the Jupyter home page. Here you can see the contents of the /notebooks/jupyter directory in Cloud Storage that now includes the sample Jupyter notebooks used in this lab.
Under the Files tab, click the GCS folder and then click 01_spark.ipynb notebook to open it.
Click Cell and then Run All to run all of the cells in the notebook.
Page back up to the top of the notebook and follow as the notebook completes runs each cell and outputs the results below them.
You can now step down through the cells and examine the code as it is processed so that you can see what the notebook is doing. In particular pay attention to where the data is saved and processed from.
The first code cell fetches the source data file, which is an extract from the KDD Cup competition from the Knowledge, Discovery, and Data (KDD) conference in 1999. The data relates to computer intrusion detection events.
SparkSQL can also be used to query the parsed data stored in the Dataframe.
In cell In [7] a temporary table (connections) is registered that is then referenced inside the subsequent SparkSQL SQL query statement:
df.registerTempTable("connections")
attack_stats = sqlContext.sql("""
SELECT
protocol_type,
CASE label
WHEN 'normal.' THEN 'no attack'
ELSE 'attack'
END AS state,
COUNT(*) as total_freq,
ROUND(AVG(src_bytes), 2) as mean_src_bytes,
ROUND(AVG(dst_bytes), 2) as mean_dst_bytes,
ROUND(AVG(duration), 2) as mean_duration,
SUM(num_failed_logins) as total_failed_logins,
SUM(num_compromised) as total_compromised,
SUM(num_file_creations) as total_file_creations,
SUM(su_attempted) as total_root_attempts,
SUM(num_root) as total_root_acceses
FROM connections
GROUP BY protocol_type, state
ORDER BY 3 DESC
""")
attack_stats.show()
You will see output similar to this truncated example when the query has finished:
And you can now also display this data visually using bar charts.
The last cell, In [8] uses the %matplotlib inline Jupyter magic function to redirect matplotlib to render a graphic figure inline in the notebook instead of just dumping the data into a variable. This cell displays a bar chart using the attack_stats query from the previous step.
The first part of the output should look like the following chart once all cells in the notebook have run successfully. You can scroll down in your notebook to see the complete output chart.
Task 2. Separate compute and storage
Modify Spark jobs to use Cloud Storage instead of HDFS
Taking this original 'Lift & Shift' sample notebook you will now create a copy that decouples the storage requirements for the job from the compute requirements. In this case, all you have to do is replace the Hadoop file system calls with Cloud Storage calls by replacing hdfs:// storage references with gs:// references in the code and adjusting folder names as necessary.
You start by using the cloud shell to place a copy of the source data in a new Cloud Storage bucket.
In the Cloud Shell create a new storage bucket for your source data:
export PROJECT_ID=$(gcloud info --format='value(config.project)')
gcloud storage buckets create gs://$PROJECT_ID
In the Cloud Shell copy the source data into the bucket:
Make sure that the last command completes and the file has been copied to your new storage bucket.
Switch back to the 01_spark Jupyter Notebook tab in your browser.
Click File and then select Make a Copy.
When the copy opens, click the 01_spark-Copy1 title and rename it to De-couple-storage.
Open the Jupyter tab for 01_spark.
Click File and then Save and checkpoint to save the notebook.
Click File and then Close and Halt to shutdown the notebook.
If you are prompted to confirm that you want to close the notebook click Leave or Cancel.
Switch back to the De-couple-storage Jupyter Notebook tab in your browser, if necessary.
You no longer need the cells that download and copy the data onto the cluster's internal HDFS file system so you will remove those first.
To delete a cell, you click in the cell to select it and then click the cut selected cells icon (the scissors) on the notebook toolbar.
Delete the initial comment cells and the first three code cells ( In [1], In [2], and In [3]) so that the notebook now starts with the section Reading in Data.
You will now change the code in the first cell ( still called In[4] unless you have rerun the notebook ) that defines the data file source location and reads in the source data. The cell currently contains the following code:
Replace the contents of cell In [4] with the following code. The only change here is create a variable to store a Cloud Storage bucket name and then to point the data_file to the bucket we used to store the source data on Cloud Storage:
When you have replaced the code the first cell will look similar to the following, with your lab project ID as the bucket name:
In the cell you just updated, replace the placeholder [Your-Bucket-Name] with the name of the storage bucket you created in the first step of this section. You created that bucket using the Project ID as the name, which you can copy here from the Qwiklabs lab login information panel on the left of this screen. Replace all of the placeholder text, including the brackets [].
Click Cell and then Run All to run all of the cells in the notebook.
You will see exactly the same output as you did when the file was loaded and run from internal cluster storage. Moving the source data files to Cloud Storage only requires that you repoint your storage source reference from hdfs:// to gs://.
Task 3. Deploy Spark jobs
Optimize Spark jobs to run on Job specific clusters
You now create a standalone Python file, that can be deployed as a Cloud Dataproc Job, that will perform the same functions as this notebook. To do this you add magic commands to the Python cells in a copy of this notebook to write the cell contents out to a file. You will also add an input parameter handler to set the storage bucket location when the Python script is called to make the code more portable.
In the De-couple-storage Jupyter Notebook menu, click File and select Make a Copy.
When the copy opens, click the De-couple-storage-Copy1 and rename it to PySpark-analysis-file.
Open the Jupyter tab for De-couple-storage.
Click File and then Save and checkpoint to save the notebook.
Click File and then Close and Halt to shutdown the notebook.
If you are prompted to confirm that you want to close the notebook click Leave or Cancel.
Switch back to the PySpark-analysis-file Jupyter Notebook tab in your browser, if necessary.
Click the first cell at the top of the notebook.
Click Insert and select Insert Cell Above.
Paste the following library import and parameter handling code into this new first code cell:
The %%writefile spark_analysis.py Jupyter magic command creates a new output file to contain your standalone python script. You will add a variation of this to the remaining cells to append the contents of each cell to the standalone script file.
This code also imports the matplotlib module and explicitly sets the default plotting backend via matplotlib.use('agg') so that the plotting code runs outside of a Jupyter notebook.
For the remaining cells insert %%writefile -a spark_analysis.py at the start of each Python code cell. These are the five cells labelled In [x].
%%writefile -a spark_analysis.py
For example the next cell should now look as follows:
Repeat this step, inserting %%writefile -a spark_analysis.py at the start of each code cell until you reach the end.
In the last cell, where the Pandas bar chart is plotted, remove the %matplotlib inline magic command.
Note: You must remove this inline matplotlib Jupyter magic directive or your script will fail when you run it.
Make sure you have selected the last code cell in the notebook then, in the menu bar, click Insert and select Insert Cell Below.
Paste the following code into the new cell:
%%writefile -a spark_analysis.py
ax[0].get_figure().savefig('report.png');
Add another new cell at the end of the notebook and paste in the following:
%%writefile -a spark_analysis.py
import google.cloud.storage as gcs
bucket = gcs.Client().get_bucket(BUCKET)
for blob in bucket.list_blobs(prefix='sparktodp/'):
blob.delete()
bucket.blob('sparktodp/report.png').upload_from_filename('report.png')
Add a new cell at the end of the notebook and paste in the following:
%%writefile -a spark_analysis.py
connections_by_protocol.write.format("csv").mode("overwrite").save(
"gs://{}/sparktodp/connections_by_protocol".format(BUCKET))
Test automation
You now test that the PySpark code runs successfully as a file by calling the local copy from inside the notebook, passing in a parameter to identify the storage bucket you created earlier that stores the input data for this job. The same bucket will be used to store the report data files produced by the script.
In the PySpark-analysis-file notebook add a new cell at the end of the notebook and paste in the following:
BUCKET_list = !gcloud info --format='value(config.project)'
BUCKET=BUCKET_list[0]
print('Writing to {}'.format(BUCKET))
!/opt/conda/miniconda3/bin/python spark_analysis.py --bucket=$BUCKET
This code assumes that you have followed the earlier instructions and created a Cloud Storage Bucket using your lab Project ID as the Storage Bucket name. If you used a different name modify this code to set the BUCKET variable to the name you used.
Add a new cell at the end of the notebook and paste in the following:
!gcloud storage ls gs://$BUCKET/sparktodp/**
This lists the script output files that have been saved to your Cloud Storage bucket.
To save a copy of the Python file to persistent storage, add a new cell and paste in the following:
Click Cell and then Run All to run all of the cells in the notebook.
If the notebook successfully creates and runs the Python file you should see output similar to the following for the last two cells. This indicates that the script has run to completion saving the output to the Cloud Storage bucket you created earlier in the lab.
Note: The most likely source of an error at this stage is that you did not remove the matplotlib directive in In [7]. Recheck that you have modified all of the cells as per the instructions above, and have not skipped any steps.
Run the Analysis Job from Cloud Shell.
Switch back to your Cloud Shell and copy the Python script from Cloud Storage so you can run it as a Cloud Dataproc Job:
#!/bin/bash
gcloud dataproc jobs submit pyspark \
--cluster sparktodp \
--region {{{project_0.default_region | REGION }}} \
spark_analysis.py \
-- --bucket=$1
Press CTRL+X then Y and Enter key to exit and save.
Make the script executable:
chmod +x submit_onejob.sh
Launch the PySpark Analysis job:
./submit_onejob.sh $PROJECT_ID
In the Cloud Console tab navigate to the Dataproc > Clusters page if it is not already open.
Click Jobs.
Click the name of the job that is listed. You can monitor progress here as well as from the Cloud shell. Wait for the Job to complete successfully.
Navigate to your storage bucket and note that the output report, /sparktodp/report.png has an updated time-stamp indicating that the stand-alone job has completed successfully.
The storage bucket used by this Job for input and output data storage is the bucket that is used just the Project ID as the name.
Navigate back to the Dataproc > Clusters page.
Select the sparktodp cluster and click Delete. You don't need it any more.
Click CONFIRM.
Close the Jupyter tabs in your browser.
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Labs erstellen ein Google Cloud-Projekt und Ressourcen für einen bestimmten Zeitraum
Labs haben ein Zeitlimit und keine Pausenfunktion. Wenn Sie das Lab beenden, müssen Sie von vorne beginnen.
Klicken Sie links oben auf dem Bildschirm auf Lab starten, um zu beginnen
Privates Surfen verwenden
Kopieren Sie den bereitgestellten Nutzernamen und das Passwort für das Lab
Klicken Sie im privaten Modus auf Konsole öffnen
In der Konsole anmelden
Melden Sie sich mit Ihren Lab-Anmeldedaten an. Wenn Sie andere Anmeldedaten verwenden, kann dies zu Fehlern führen oder es fallen Kosten an.
Akzeptieren Sie die Nutzungsbedingungen und überspringen Sie die Seite zur Wiederherstellung der Ressourcen
Klicken Sie erst auf Lab beenden, wenn Sie das Lab abgeschlossen haben oder es neu starten möchten. Andernfalls werden Ihre bisherige Arbeit und das Projekt gelöscht.
Diese Inhalte sind derzeit nicht verfügbar
Bei Verfügbarkeit des Labs benachrichtigen wir Sie per E-Mail
Sehr gut!
Bei Verfügbarkeit kontaktieren wir Sie per E-Mail
Es ist immer nur ein Lab möglich
Bestätigen Sie, dass Sie alle vorhandenen Labs beenden und dieses Lab starten möchten
Privates Surfen für das Lab verwenden
Nutzen Sie den privaten oder Inkognitomodus, um dieses Lab durchzuführen. So wird verhindert, dass es zu Konflikten zwischen Ihrem persönlichen Konto und dem Teilnehmerkonto kommt und zusätzliche Gebühren für Ihr persönliches Konto erhoben werden.
This lab focuses on running Apache Spark jobs on Cloud Dataproc.