Overview
Site Reliability Engineering (SRE) refers to the application of software engineering practices to the management of IT infrastructure and operations.
This instructor-led, live training (online or onsite) is aimed at technical persons who wish to apply software engineering tools and techniques to more efficiently manage an IT system.
By the end of this training, participants will be able to:
- Apply a disciplined software engineering approach to solve IT operations problems.
- Create software to manage systems and automate IT operations tasks.
- Develop systems to increase site reliability and performance.
- Bridge the work of development and operations by applying a software engineering mindset to system administration.
Format of the Course
- Interactive lecture and discussion.
- Lots of exercises and practice.
- Hands-on implementation in a live-lab environment.
Course Customization Options
- To request a customized training for this course, please contact us to arrange.
Requirements
- A general understanding of IT infrastructure.
- A general idea of the software development process.
- Programming or scripting experience in any language.
Audience
- Developers
- System administrators
- Software Architects
- DevOps engneers
- IT Managers
Course Outline
Introduction
- How SRE marries traditional IT and software development.
- The need for automation and observability
- The role of a software engineers vs system administrators.
- Site Reliability Engineers vs DevOps engineers.
Overview of an IT System
- System architecture, on-premise and in the cloud.
Overview of SRE Principles and Practices
- Infrastructure as a Code.
- The role of containerization and orchestration (Docker, Kubernetes, etc.)
- Continuous Integration, Continuous Deployment and Continuous Delivery.
- Observability.
Evaluating an IT System
- Taking stock of the team and organizational resources.
- Maping out the systems and processes.
- Estimating the potential impact of SRE.
- The role the software engineering team.
- The role of the operational team.
- The role of management.
Maintaining the Reliability of a System
- Describing and measuring the desired reliability of a service.
- Understanding Service Level Objectives (SLOs)
- Understanding Service Level Indicators (SLIs) and Service Level Agreements (SLAs).
- Working with Error Budgets.
- Developing an SLO.
Optimizing System Administration
- Setting up a development environment
- Evaluating SRE tools
- Prioritizing tasks for automation.
- Writing software.
Deploying “Infrastructure as Code”
- Testing and iterating code
- Making a system anti-fragile
- Learning from failure
Monitoring a System
- Observing system performance.
- SRE tools and techniques.
The Future of SRE
Summary and Conclusion