Module Title: dgt_sre08 – High Availability
Description:
Welcome to “dgt_sre08 – High Availability,” a comprehensive module designed for Site Reliability Engineers SREs who aim to enhance the resilience and reliability of cluster-based systems. This module delves into the critical aspects of high availability, providing participants with the knowledge and tools necessary to build robust systems capable of maintaining continuous service delivery even in the face of failures.
Key Learning Objectives:
- Understanding High Availability:
- Explore what it means for a system to be highly available, focusing on cluster-based architectures.
-
Learn about the metrics and benchmarks used to measure availability and how they apply to real-world scenarios.
-
Quorum and Fencing Concepts:
- Gain a deep understanding of quorum in distributed systems, including its role in ensuring consistency and reliability.
-
Study fencing mechanisms that prevent split-brain scenarios and ensure data integrity during failures.
-
Resilience to Faults:
- Discover strategies for designing systems that are resilient to various types of faults, from hardware malfunctions to network partitions.
-
Learn how to implement failover techniques and redundancy to minimize downtime and maintain service continuity.
-
Service Continuity:
- Explore best practices for maintaining uninterrupted service delivery, even when parts of the system experience failures.
-
Investigate monitoring and alerting systems that help detect issues early and automate recovery processes.
-
Hands-On Experience:
- Engage in practical exercises and case studies to apply theoretical knowledge to real-world scenarios.
- Use tools and technologies commonly employed by SREs to design, implement, and test high availability solutions.
By the end of this module, participants will have a solid understanding of the principles and practices that underpin high availability. They will be equipped with the skills needed to design systems that not only meet but exceed reliability expectations, ensuring seamless service delivery in any situation. Whether you are new to the field or looking to deepen your expertise, “dgt_sre08 – High Availability” is an essential step towards mastering the art of building fault-tolerant systems.
Who Should Attend:
– Site Reliability Engineers SREs
– System Administrators
– DevOps Professionals
– IT Managers and Architects interested in enhancing system reliability
Join us on this journey to master high availability, ensuring your systems are resilient, reliable, and ready for any challenge.
The students can push their exercises to the Academy DevOps & SRE GIT project. For this module, create a folder with your username as its name in the following subfolder: https://github.com/Garanti-Del-Talento/gdt_academy