Create a log metric
Create a Service Level Objective(SLO)
Create an Alert on the Service Level Objective (SLO)
Troubleshooting Workloads on GKE for Site Reliability Engineers
Site Reliability Engineers (SRE) have a broad set of responsibilities, and managing incidents is a critical part of their role. You will learn how to take advantage of the integrated capabilities of Google Cloud's operations suite that includes logging, monitoring, and rich, out-of-the-box dashboards.
The troubleshooting process is an “iterative” approach where SREs form a hypothesis about the potential root cause of an incident, then filter, search, and navigate through large volumes of telemetry data collected from their systems to validate or invalidate their hypothesis. If a hypothesis is invalid, SREs will form another hypothesis and perform another iteration until they can isolate a root cause. One the Google website, see learn more about SREs at Google Site Reliability Engineering.
In this lab, you will learn how to navigate that iterative journey efficiently and effectively using Google Cloud's operations tools!
What you'll learn
In this lab, you will learn how to:
Navigate resource pages of Google Kubernetes Engine (GKE)
Leverage the GKE dashboard to quickly view operational data
Create logs-based metrics to capture specific issues
Create a Service Level Objective (SLO)
Define an Alert to notify SRE staff of incidents
Setup and requirements
Before you click the Start Lab button
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
- Access to a standard internet browser (Chrome browser recommended).
- Time to complete the lab---remember, once you start, you cannot pause a lab.
How to start your lab and sign in to the Google Cloud Console
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:
- The Open Google Console button
- Time remaining
- The temporary credentials that you must use for this lab
- Other information, if needed, to step through this lab
Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.
Tip: Arrange the tabs in separate windows, side-by-side.
Note: If you see the Choose an account dialog, click Use Another Account.
If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.
Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.
Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
Click through the subsequent pages:
- Accept the terms and conditions.
- Do not add recovery options or two-factor authentication (because this is a temporary account).
- Do not sign up for free trials.
After a few moments, the Cloud Console opens in this tab.
Activate Cloud Shell
Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.
Click Activate Cloud Shell at the top of the Google Cloud console.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. The output contains a line that declares the PROJECT_ID for this session:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
(Optional) You can list the active account name with this command:
(Optional) You can list the project ID with this command:
gcloud, in Google Cloud, refer to the gcloud CLI overview guide.
Your organization has deployed a multi-tier microservices application. It is a web-based e-commerce application called "Hipster Shop", where users can browse for vintage items, add them to their cart and purchase them. Hipster Shop is composed of many microservices, written in different languages, that communicate via gRPC and REST APIs. The architecture of the deployment is optimized for learning purposes and includes modern technologies as part of the stack: Kubernetes, Istio, Cloud Operations, App Engine, gRPC, OpenTelemetry, and similar cloud-native technologies.
As a member of the Site Reliability Engineering (SRE) team, you are contacted when end users report issues viewing products and adding them to their cart. You will explore the various services deployed to determine the root cause of the issue and set up a Service Level Objective (SLO) to prevent similar incidents from occurring in the future. Learn more about this in the following blog article: SLOs, SLIs, SLAs, oh my—CRE life lessons.
Task 1. Navigating Google Kubernetes Engine (GKE) resource pages
For the first part of this lab, you will view Google Kubernetes Engine (GKE) resource pages, then navigate to various metrics dashboards to investigate the issue reported by end users in more detail.
In Cloud Console, from the Navigation menu go to Kubernetes Engine > Clusters.
Confirm that you see the following Kubernetes cluster available:
cloud-ops-sandbox. Validate that each cluster has a green checkbox next to it to indicate it is up and running.
Click on the
cloud-ops-sandboxlink under the Name column to navigate to the cluster's Details tab.
Click on the Nodes tab to see all the nodes in the cluster. Validate that there is a single node pool.
Under the Nodes Pools section of the Nodes tab, click on the link for the first node in the table under the Name column to view more details about the node. Scroll down to the bottom of the page and click on the link for the first Node in the table at the bottom of the page.
On the resulting node details page, note the metrics of the node that are available. These should be listed under the Resource summary section and include CPU and Memory Usage as well as others. This lab generates load during provisioning and you should see metrics activity but no obvious spikes or metrics above the "requested" limit for the graphs displayed in the Summary section.
To investigate further, rather than navigating to each individual node to view its metrics, click on the three dots in the top right corner of the CPU tile and select View in Metrics Explorer.
On the Metrics Explorer page, you will see the metrics associated with the node that you just navigated from. There are three filters configured in the Metrics explorer under the Filters section.
Remove the filter for the
nodenameby expanding the filter and clicking the delete icon.
Under the How do you want to view that data section, set Group by to
Once the filters are set the visualization will update and you will be able to view the same metrics for all of the nodes in the node pool of the
Task 2. Accessing operational data through GKE Dashboards
In this next section, you will explore how to quickly navigate to detailed operational data of various resources deployed to GKE via the GKE Dashboard.
Recall that website users have reported that they cannot view product details or add items to their cart. You can verify this by opening the website.
- Navigate to Navigation menu > Kubernetes Engine > Services & Ingress. Click on the Endpoint (an IP address) for the
- Click on any product displayed on the landing page to reproduce the error reported.
Upon reproducing the error, notice that the stack trace mentions the application "failed to get product recommendations".
Investigate the recommendationservice deployed to GKE:
- Navigate to Cloud Monitoring from Cloud Console, from the Navigation Menu go to Monitoring > Dashboards.
- When the Dashboards landing page opens, click GKE.
You should see a dashboard view that provides relevant Cluster, Namespace, Workload, Service, Pod and Container related metrics for GKE resources found in the project in an aggregated view.
For this lab's scenario, you will want to view logs and metrics related to the
recommendationservice as end users are seeing errors related to product recommendations when viewing a product's landing page. You will create filters for the
cloud-ops-sandbox cluster to investigate possible symptoms and diagnose the issue further.
In the next steps, you add filters to your GKE Dashboard.
Click on the Add Filter button at the top of the GKE Dashboard page.
From the available filters, select Workloads > recommendationservice.
- Click on the Apply button once the correct filter is selected. The Filter section of the GKE Dashboard page should look similar to the following image.
This view allows you to focus your attention on the problematic
Under the Workloads section, click on the
recommendationserviceto reveal the Deployment details pane. This view presents details on Alerts, SLOs, Events, Metrics and Logs. At this point in the lab, no SLOs are present. You will add an SLO here in the next part of this lab.
Click on the Metrics tab to view metrics related to the
recommendationservice. You can change the Metrics drop down selection to alter the visualization data provided and view different metrics available for this service.
Click on the Logs tab to view logs related to the
recommendationservice. You can filter the available logs by using the Severity drop down corresponding to the log level of the entries available. This is useful in an SRE context to find errors recorded in the logs and leverage the entries to troubleshoot issues.
Set the Severity to
Errorin order to filter the
- At this point, the error related to the problematic code should be obvious. Look for the phrase
invalid literal for int() with base 10: '5.0'in the items in the result set. This error appearing in the
recommendationservicefilters confirms that the service has a bug in the code.
You will re-deploy the recommendationservice microservice to ensure that the error is no longer present.
- In Cloud Shell, run the following command:
- Then run:
Navigate to Navigation Menu > Kubernetes Engine > Clusters. Select the three dots to the right of the
cloud-ops-sandboxcluster and select the option to Connect.
On the Connect to the cluster modal dialog, click the RUN IN CLOUD SHELL button. Press Enter to run the command once populated in Cloud Shell.
Last, run the
restorecommand to update the service:
To verify the application is working correctly, navigate to Navigation Menu > Kubernetes Engine > Services & Ingress.
Click on the Endpoint for
This will take you to the Hipster Shop website used for this lab exercise. Click on any product to verify that it loads without throwing any errors.
In this section of the lab, you explored the available logs and metrics in the GKE dashboard to diagnose an issue with the application workload deployed by the DevOps team. You were able to pinpoint the exact cause of an issue and remediate it by re-deploying the problematic microservice with a bug fix.
Task 3. Proactive monitoring with logs-based metrics
To ensure that the updated
recommendationservice code is working as expected, and to prevent future incidents from occurring again, you decide to create a logs-based metric to monitor the logs and notify SRE when similar incidents occur in the future.
In this section, you will create a logs-based metric specific to the error noticed in the previous sections.
Using logs-based metrics you can define a metric that tracks errors in the logs to proactively respond to similar problems and symptoms before they are noticed by end users.
- From Cloud Console, click on the Navigation Menu > Logging > Logs Explorer.
Query resultssection click on +Create metric. This will open a new tab to create a logs based metric.
Enter the following options on the
Create logs metricpage:
- Metric Type: Counter
- Log metric name: Error_Rate_SLI
- Filter Selection: (Copy and paste the filter below)
- Click Create Metric.
Click Check my progress to verify the objective.
Task 4. Creating a SLO
After creating a logs-based metric which closely describes the user experience, the SRE team will use it to measure user happiness, these metrics are our SLIs and will be used to define a Service Level Objective (SLO) on the
recommendationservice. You use an SLO to specify service-level objectives for performance metrics. An SLO is a measurable goal for performance over a period of time. On the Google Cloud, Anthos webiste, see Designing and using SLOs for more guidance on SLO design and the filters you will use below.
Cloud Operations Suite provides service-oriented monitoring, which means that you configure SLIs, SLOs, and Burn Rate Alerts for a
Navigate to Navigation menu > Monitoring > Services. The resulting page will display a list of all services deployed to GKE for the application workload.
recommendationserviceservice from the list of available services which will take you to the Service details page.
Click on + Create SLO towards the top right of the page.
On Step 1 you will be presented with a dialog for creating a new SLI. Set the following parameters:
Choose a metric: Other
Request-based or windows-based: Request Based
On Step 2
Define SLI details, the
Performance Metricmust be set to the following value:
custom.googleapis.com/opencensus/grpc.io/client/roundtrip_latency. This will show the roundtrip latency of requests made by the client to the recommendation service.
Set the Performance metric to Less than
-∞ to 100 ms.
Click on Continue.
After configuring the SLI, on Step 3
Set your service-level objective (SLO), you will define the SLO, an SLO includes the Performance goal (the reliability target) and the Compliance period (the measurement window). To learn more, in Google's Site Reliability Workbook, see Choosing an appropriate time window. Make the following selections:
Period type: Calendar
Period length: Calendar month
Performance Goal: 99%
Click Create SLO on the last step of the wizard to complete the SLO creation process.
This will bring you back to the Monitoring > Services landing page. You should be able to see an SLO violation under the
Current status of the SLO section.
- Click on the entry listed and select the Error budget tab once expanded.
Click Check my progress to verify the objective.
Error budget fraction represents the actual percentage of error budget remaining for the compliance period. In the SLO defined, there is a period of one calendar month and a performance goal of 99% or better.
As denoted by the percentage, the error preventing product pages from loading properly in this fictitious scenario severely degraded the service-level objective defined. This may not be the case in a real world scenario as this lab ran a load test against the Kubernetes cluster hosting the application workload.
Task 5. Define an alert on the SLO
To proactively notify the SRE team of any violations of the SLO set, it is a best practice to define an alert that will trigger when the SLO is violated. The alert can invoke a notification channel of your choice, including: Email, SMS, PagerDuty, Slack, a WebHook or a subscription to a PubSub topic.
Navigate to Navigation menu > Monitoring > Services.
Click on the
recommendationserviceservice from the list of services available.
Under the section Current status of 1 SLO, you should see the SLO created in the last task. You may have to expand the browser window listing the SLO to see other options.
Click the CREATE SLO ALERT button present on the SLO. This will allow you to define an Alert policy when the SLO is violated.
On the Create SLO burn rate alert policy modal input, you will see
Lookback duration and
Burn rate threshold fields on Step 1 of the wizard. The lookback duration allows you to specify the duration of time from the present time to have the Alerting policy look back for possible burn rate violations. The burn rate threshold allows you to specify the window time-slice to split the lookback duration into in order to assess whether or not the SLO has been violated.
- Leave the default values:
Lookback duration: 60 minutes
Burn rate threshold: 10
On Step 2, you can define a notification channel to receive the alert when the violation is observed. For the purposes of this lab, you can optionally supply an email address or SMS channel to receive a notification.
- Click Next.
Step 3 is optional and allows you to supply any information to the end user receiving the notification so that they have immediate context as to what the issue may be and ways to mitigate the problem.
- Click Save.
Click Check my progress to verify the objective.
(Optional) Remove your alerting policy
If you set up an email alert as part of your alerting policy, there is a chance that you will receive a few emails about your resources even after the lab is completed.
To avoid this, remove the alerting policy before you complete your lab.
In this lab, you explored the Cloud Operations suite, which allows Site Reliability Engineers (SRE) to investigate and diagnose issues experienced with workloads deployed. In order to increase the reliability of workloads, you explored how to navigate resource pages or GKE, view operational data from GKE dashboards, create logs-based metrics to capture specific issues and proactively respond to incidents by setting service level objectives and alerts to proactively notify the SRE team about issues experienced before they cause outages.
Finish your quest
This self-paced lab is part of the Google Cloud's Operations Suite on GKE, Cloud Architecture, DevOps Essentials quests and the Measure Site Reliability using Cloud Operations Suite skill badge quest. A quest is a series of related labs that form a learning path. Enroll in a quest and get immediate completion credit if you've taken this lab. See other available quests.
Take your next lab
- Make sure to bookmark the Google Cloud Operations Suite documentation.
- Practice and learn Cloud Operations on Google Cloud with the Cloud Operations Sandbox.
Google Cloud training and certification
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated: July 13, 2022
Lab Last Tested: July 13, 2022
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.