arrow_back

Troubleshooting Workloads on GKE for Site Reliability Engineers

Join Sign in

Troubleshooting Workloads on GKE for Site Reliability Engineers

1 hour 30 minutes 1 Credit

GSP902

Google Cloud self-paced labs logo

Note: This lab usually takes less than a minute to launch. However, during periods of high demand, it can take up to 20 minutes to provision the resources used for this lab. Provisioning time will not deduct from your time to take the lab.

Overview

Site Reliability Engineers (SRE) have a broad set of responsibilities, and managing incidents is a critical part of their role. You will learn how to take advantage of the integrated capabilities of Google Cloud's operations suite that includes logging, monitoring, and rich, out-of-the-box dashboards.

The troubleshooting process is an “iterative” approach where SREs form a hypothesis about the potential root cause of an incident, then filter, search, and navigate through large volumes of telemetry data collected from their systems to validate or invalidate their hypothesis. If a hypothesis is invalid, SREs will form another hypothesis and perform another iteration until they can isolate a root cause. One the Google website, see learn more about SREs at Google Site Reliability Engineering.

In this lab, you will learn how to navigate that iterative journey efficiently and effectively using Google Cloud's operations tools!

What you'll learn

In this lab, you will learn how to:

  • Navigate resource pages of Google Kubernetes Engine (GKE)

  • Leverage the GKE dashboard to quickly view operational data

  • Create logs-based metrics to capture specific issues

  • Create a Service Level Objective (SLO)

  • Define an Alert to notify SRE staff of incidents

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
Note: Use an Incognito or private browser window to run this lab. This prevents any conflicts between your personal account and the Student account, which may cause extra charges incurred to your personal account.
  • Time to complete the lab---remember, once you start, you cannot pause a lab.
Note: If you already have your own personal Google Cloud account or project, do not use it for this lab to avoid extra charges to your account.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab Details panel with the following:

    • The Open Google Console button
    • Time remaining
    • The temporary credentials that you must use for this lab
    • Other information, if needed, to step through this lab
  2. Click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Tip: Arrange the tabs in separate windows, side-by-side.

    Note: If you see the Choose an account dialog, click Use Another Account.
  3. If necessary, copy the Username from the Lab Details panel and paste it into the Sign in dialog. Click Next.

  4. Copy the Password from the Lab Details panel and paste it into the Welcome dialog. Click Next.

    Important: You must use the credentials from the left panel. Do not use your Google Cloud Skills Boost credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
  5. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left. Navigation menu icon

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

  1. Click Activate Cloud Shell Activate Cloud Shell icon at the top of the Google Cloud console.

When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. The output contains a line that declares the PROJECT_ID for this session:

Your Cloud Platform project in this session is set to YOUR_PROJECT_ID

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  1. (Optional) You can list the active account name with this command:

gcloud auth list
  1. Click Authorize.

  2. Your output should now look like this:

Output:

ACTIVE: * ACCOUNT: student-01-xxxxxxxxxxxx@qwiklabs.net To set the active account, run: $ gcloud config set account `ACCOUNT`
  1. (Optional) You can list the project ID with this command:

gcloud config list project

Output:

[core] project = <project_ID>

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: For full documentation of gcloud, in Google Cloud, refer to the gcloud CLI overview guide.

Scenario

Your organization has deployed a multi-tier microservices application. It is a web-based e-commerce application called "Hipster Shop", where users can browse for vintage items, add them to their cart and purchase them. Hipster Shop is composed of many microservices, written in different languages, that communicate via gRPC and REST APIs. The architecture of the deployment is optimized for learning purposes and includes modern technologies as part of the stack: Kubernetes, Istio, Cloud Operations, App Engine, gRPC, OpenTelemetry, and similar cloud-native technologies.

As a member of the Site Reliability Engineering (SRE) team, you are contacted when end users report issues viewing products and adding them to their cart. You will explore the various services deployed to determine the root cause of the issue and set up a Service Level Objective (SLO) to prevent similar incidents from occurring in the future. Learn more about this in the following blog article: SLOs, SLIs, SLAs, oh my—CRE life lessons.

Task 1. Navigating Google Kubernetes Engine (GKE) resource pages

For the first part of this lab, you will view Google Kubernetes Engine (GKE) resource pages, then navigate to various metrics dashboards to investigate the issue reported by end users in more detail.

  1. In Cloud Console, from the Navigation menu go to Kubernetes Engine > Clusters.

  2. Confirm that you see the following Kubernetes cluster available: cloud-ops-sandbox. Validate that each cluster has a green checkbox next to it to indicate it is up and running.

  3. Click on the cloud-ops-sandbox link under the Name column to navigate to the cluster's Details tab.

  4. Click on the Nodes tab to see all the nodes in the cluster. Validate that there is a single node pool.

  5. Under the Nodes Pools section of the Nodes tab, click on the link for the first node in the table under the Name column to view more details about the node. Scroll down to the bottom of the page and click on the link for the first Node in the table at the bottom of the page.

Nodes tab

  1. On the resulting node details page, note the metrics of the node that are available. These should be listed under the Resource summary section and include CPU and Memory Usage as well as others. This lab generates load during provisioning and you should see metrics activity but no obvious spikes or metrics above the "requested" limit for the graphs displayed in the Summary section.

  2. To investigate further, rather than navigating to each individual node to view its metrics, click on the three dots in the top right corner of the CPU tile and select View in Metrics Explorer.

On the Metrics Explorer page, you will see the metrics associated with the node that you just navigated from. There are three filters configured in the Metrics explorer under the Filters section.

  1. Remove the filter for the nodename by expanding the filter and clicking the delete icon.

  2. Under the How do you want to view that data section, set Group by to node_name.

Once the filters are set the visualization will update and you will be able to view the same metrics for all of the nodes in the node pool of the cloud-ops-sandbox cluster.

Line chart

Note: You may notice two additional metrics are displayed as well, these are limit cores and request cores. The limit cores is the CPU cores limit of the container running on the node while the request cores metric is the number of CPU cores requested by the container running on the node. You can find out more about Kubernetes metrics on the following documentation page: Kubernetes metrics.

Task 2. Accessing operational data through GKE Dashboards

In this next section, you will explore how to quickly navigate to detailed operational data of various resources deployed to GKE via the GKE Dashboard.

Recall that website users have reported that they cannot view product details or add items to their cart. You can verify this by opening the website.

  1. Navigate to Navigation menu > Kubernetes Engine > Services & Ingress. Click on the Endpoint (an IP address) for the frontend-external service.
  2. Click on any product displayed on the landing page to reproduce the error reported.

Upon reproducing the error, notice that the stack trace mentions the application "failed to get product recommendations".

Investigate the recommendationservice deployed to GKE:

  1. Navigate to Cloud Monitoring from Cloud Console, from the Navigation Menu go to Monitoring > Dashboards.
Note: To avoid scrolling down the left menu to reach the Monitoring section for the rest of this lab, hover over the Monitoring section of the left menu and select the pin icon that appears. This will place the Monitoring section at the top of the left menu for future navigation.
  1. When the Dashboards landing page opens, click GKE.

You should see a dashboard view that provides relevant Cluster, Namespace, Workload, Service, Pod and Container related metrics for GKE resources found in the project in an aggregated view.

For this lab's scenario, you will want to view logs and metrics related to the recommendationservice as end users are seeing errors related to product recommendations when viewing a product's landing page. You will create filters for the cloud-ops-sandbox cluster to investigate possible symptoms and diagnose the issue further.

In the next steps, you add filters to your GKE Dashboard.

  1. Click on the Add Filter button at the top of the GKE Dashboard page.

  2. From the available filters, select Workloads > recommendationservice.

recommendationservice filter option

  1. Click on the Apply button once the correct filter is selected. The Filter section of the GKE Dashboard page should look similar to the following image.

Filter section of the GKE Dashboard page

This view allows you to focus your attention on the problematic recommendationservice microservice.

  1. Under the Workloads section, click on the recommendationservice to reveal the Deployment details pane. This view presents details on Alerts, SLOs, Events, Metrics and Logs. At this point in the lab, no SLOs are present. You will add an SLO here in the next part of this lab.

  2. Click on the Metrics tab to view metrics related to the recommendationservice. You can change the Metrics drop down selection to alter the visualization data provided and view different metrics available for this service.

recommendationservice section

  1. Click on the Logs tab to view logs related to the recommendationservice. You can filter the available logs by using the Severity drop down corresponding to the log level of the entries available. This is useful in an SRE context to find errors recorded in the logs and leverage the entries to troubleshoot issues.

  2. Set the Severity to Error in order to filter the recommendationservice logs.

Logs section

  1. At this point, the error related to the problematic code should be obvious. Look for the phrase invalid literal for int() with base 10: '5.0' in the items in the result set. This error appearing in the recommendationservice filters confirms that the service has a bug in the code.

You will re-deploy the recommendationservice microservice to ensure that the error is no longer present.

Note: For simplicity, you will simulate deploying a new version of the application using kubectl.
  1. In Cloud Shell, run the following command:
git clone --depth 1 --branch csb_1220 https://github.com/GoogleCloudPlatform/cloud-ops-sandbox.git
  1. Then run:
cd cloud-ops-sandbox/sre-recipes
  1. Navigate to Navigation Menu > Kubernetes Engine > Clusters. Select the three dots to the right of the cloud-ops-sandbox cluster and select the option to Connect.

  2. On the Connect to the cluster modal dialog, click the RUN IN CLOUD SHELL button. Press Enter to run the command once populated in Cloud Shell.

  3. Last, run the restore command to update the service:

./sandboxctl sre-recipes restore "recipe3"
  1. To verify the application is working correctly, navigate to Navigation Menu > Kubernetes Engine > Services & Ingress.

  2. Click on the Endpoint for frontend-external service.

This will take you to the Hipster Shop website used for this lab exercise. Click on any product to verify that it loads without throwing any errors.

Hipster Shop website.

In this section of the lab, you explored the available logs and metrics in the GKE dashboard to diagnose an issue with the application workload deployed by the DevOps team. You were able to pinpoint the exact cause of an issue and remediate it by re-deploying the problematic microservice with a bug fix.

Task 3. Proactive monitoring with logs-based metrics

To ensure that the updated recommendationservice code is working as expected, and to prevent future incidents from occurring again, you decide to create a logs-based metric to monitor the logs and notify SRE when similar incidents occur in the future.

In this section, you will create a logs-based metric specific to the error noticed in the previous sections.

Using logs-based metrics you can define a metric that tracks errors in the logs to proactively respond to similar problems and symptoms before they are noticed by end users.

  1. From Cloud Console, click on the Navigation Menu > Logging > Logs Explorer.
Note: To avoid scrolling down the Navigation menu to reach the Logging section for the rest of this lab, hover over the Logging section and select the pin icon that appears. This will place the Logging section at the top of the Navigation menu for future navigation.
  1. In the Query results section click on +Create metric. This will open a new tab to create a logs based metric.

  2. Enter the following options on the Create logs metric page:

  • Metric Type: Counter
  • Log metric name: Error_Rate_SLI
  • Filter Selection: (Copy and paste the filter below)
resource.labels.cluster_name="cloud-ops-sandbox" AND resource.labels.namespace_name="default" AND resource.type="k8s_container" AND labels.k8s-pod/app="recommendationservice" AND severity>=ERROR Note: In the next section you will leverage a different metric centered around Availability to proactively notify the SRE team of issues, however, please note that the custom, logs-based metric defined in this section could also be utilized to generate an alert when its filter condition is met.
  1. Click Create Metric.

Click Check my progress to verify the objective. Create a log metric

Task 4. Creating a SLO

After creating a logs-based metric which closely describes the user experience, the SRE team will use it to measure user happiness, these metrics are our SLIs and will be used to define a Service Level Objective (SLO) on the recommendationservice. You use an SLO to specify service-level objectives for performance metrics. An SLO is a measurable goal for performance over a period of time. On the Google Cloud, Anthos website, see Designing and using SLOs for more guidance on SLO design and the filters you will use below.

Cloud Operations Suite provides service-oriented monitoring, which means that you configure SLIs, SLOs, and Burn Rate Alerts for a service.

  1. Navigate to Navigation menu > Monitoring > Services. The resulting page will display a list of all services deployed to GKE for the application workload.

  2. Select the recommendationservice service from the list of available services which will take you to the Service details page.

  3. If you do not see the recommendationservice in the list of services on the page, click + Define Service towards the top of the page.

  4. In the Service Candidates section, select the recommendationservice. You may need to move to the next page of line items to see the service. Click on Submit once selected.

  5. Now you will be able to define the Service Level Objective of the recommendationservice. If you are on the Define Service input dialog of the services landing page, you should see a Create SLO buton. Click the Create SLO button.

  6. Set the following parameters:

  • Choose a metric: Other

  • Request-based or windows-based: Request Based

  1. Click Continue.

  2. On Step 2 Define SLI details, the Performance Metric must be set to the following value: custom.googleapis.com/opencensus/grpc.io/client/roundtrip_latency. This will show the roundtrip latency of requests made by the client to the recommendation service.

Set the Performance metric to Less than -∞ to 100 ms.

Create a Service Level Objective (SLO) page

  1. Click on Continue.

  2. After configuring the SLI, on Step 3 Set your service-level objective (SLO), you will define the SLO, an SLO includes the Performance goal (the reliability target) and the Compliance period (the measurement window). To learn more, in Google's Site Reliability Workbook, see Choosing an appropriate time window. Make the following selections:

  • Period type: Calendar

  • Period length: Calendar month

  • Performance Goal: 99%

  1. Click Continue.

  2. Click Create SLO on the last step of the wizard to complete the SLO creation process.

This will bring you back to the Monitoring > Services landing page. You should be able to see an SLO violation under the Current status of the SLO section.

  1. Click on the entry listed and select the Error budget tab once expanded.

Click Check my progress to verify the objective. Create a Service Level Objective(SLO)

The Error budget fraction represents the actual percentage of error budget remaining for the compliance period. In the SLO defined, there is a period of one calendar month and a performance goal of 99% or better.

As denoted by the percentage, the error preventing product pages from loading properly in this fictitious scenario severely degraded the service-level objective defined. This may not be the case in a real world scenario as this lab ran a load test against the Kubernetes cluster hosting the application workload.

Task 5. Define an alert on the SLO

To proactively notify the SRE team of any violations of the SLO set, it is a best practice to define an alert that will trigger when the SLO is violated. The alert can invoke a notification channel of your choice, including: Email, SMS, PagerDuty, Slack, a WebHook or a subscription to a PubSub topic.

  1. Navigate to Navigation menu > Monitoring > Services.

  2. Click on the recommendationservice service from the list of services available.

  3. Under the section Current status of 1 SLO, you should see the SLO created in the last task. You may have to expand the browser window listing the SLO to see other options.

  4. Click the CREATE SLO ALERT button present on the SLO. This will allow you to define an Alert policy when the SLO is violated.

On the Create SLO burn rate alert policy modal input, you will see Lookback duration and Burn rate threshold fields on Step 1 of the wizard. The lookback duration allows you to specify the duration of time from the present time to have the Alerting policy look back for possible burn rate violations. The burn rate threshold allows you to specify the window time-slice to split the lookback duration into in order to assess whether or not the SLO has been violated.

  1. Leave the default values:
  • Lookback duration: 60 minutes

  • Burn rate threshold: 10

  1. Click Next.

  2. On Step 2, you can define a notification channel to receive the alert when the violation is observed. For the purposes of this lab, you can optionally supply an email address or SMS channel to receive a notification.

Step 2. Who should be notified (optional) section

  1. Click Next.

Step 3 is optional and allows you to supply any information to the end user receiving the notification so that they have immediate context as to what the issue may be and ways to mitigate the problem.

  1. Click Save.

Click Check my progress to verify the objective. Create an Alert on the Service Level Objective (SLO)

(Optional) Remove your alerting policy

If you set up an email alert as part of your alerting policy, there is a chance that you will receive a few emails about your resources even after the lab is completed.

To avoid this, remove the alerting policy before you complete your lab.

Congratulations!

In this lab, you explored the Cloud Operations suite, which allows Site Reliability Engineers (SRE) to investigate and diagnose issues experienced with workloads deployed. In order to increase the reliability of workloads, you explored how to navigate resource pages or GKE, view operational data from GKE dashboards, create logs-based metrics to capture specific issues and proactively respond to incidents by setting service level objectives and alerts to proactively notify the SRE team about issues experienced before they cause outages.

Finish your quest

This self-paced lab is part of the Google Cloud's Operations Suite on GKE, Cloud Architecture, DevOps Essentials quests and the Measure Site Reliability using Cloud Operations Suite skill badge quest. A quest is a series of related labs that form a learning path. Enroll in a quest and get immediate completion credit if you've taken this lab. See other available quests.

Take your next lab

Try out Monitoring and Logging for Cloud Functions

Next steps

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated: January 6, 2022

Lab Last Tested: January 6, 2022

Copyright 2023 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.