dgt_sre08 – SRE Introduction

Module Title: dgt_sre08 – SRE Introduction

Description:

Welcome to “SRE introduction,” an introductory module designed for individuals eager to explore the field of Site Reliability Engineering. This comprehensive program is tailored for both technical and non-technical participants who wish to understand the principles, practices, and culture behind ensuring high availability and reliability in modern software systems.

module Objectives:

  1. Understand the Role of SRE: Discover what it means to be a Site Reliability Engineer and how this role integrates with other teams like development and operations.

  2. Learn Core SRE Principles: Delve into key concepts such as error budgets, service level objectives SLOs, and site reliability metrics.

  3. Explore Automation and Tooling: Gain insights into the tools and technologies used by SREs to automate processes and manage complex systems effectively.

  4. Cultivate a Reliability Culture: Understand how to foster a culture of shared responsibility for system uptime and performance within an organization.

  5. Problem-Solving Skills: Develop critical thinking skills necessary for troubleshooting and resolving issues in production environments.

module Content:

  • Introduction to SRE: Historical context, evolution, and the importance of Site Reliability Engineering.

  • Key Concepts & Metrics: Understanding SLIs Service Level Indicators, SLOs, and error budgets.

  • Automation Practices: Tools for monitoring, alerting, and incident management; scripting for automation.

  • Capacity Planning and Incident Management: Strategies to ensure system resilience under load and effective incident response.

  • Culture of Reliability: Building systems with reliability in mind from the outset and promoting a culture that values uptime.

Consideration:

As part of this module, we will reference “Site Reliability Engineering: How Google Runs Production Systems,” often considered the foundational text for understanding SRE. This book provides real-world insights into how one of the worlds largest tech companies approaches reliability engineering, offering practical advice and case studies that are invaluable to both newcomers and experienced practitioners in the field.

Target Audience:

This module is ideal for software engineers, system administrators, IT professionals, and anyone interested in enhancing their understanding of SRE practices. Prior experience with programming or systems administration may be beneficial but is not required.

Join us on this journey to become proficient in Site Reliability Engineering and learn how to build resilient systems that deliver exceptional user experiences. Enroll today and take the first step towards mastering the art and science of reliability engineering!
The students can push their exercises to the Academy DevOps & SRE GIT project. For this module, create a folder with your username as its name in the following subfolder: https://github.com/Garanti-Del-Talento/gdt_academy