Overview
The Site Reliability Engineer plays a crucial role in ensuring the reliability, performance, and scalability of the infrastructure and applications. This role is vital in maintaining a seamless and efficient operation of technology systems within the organization, and ensuring that they meet the high standards of availability and performance required by both internal and external users.
Key responsibilities
- Design and implement automation for various processes to improve efficiency and reliability
- Develop monitoring solutions to ensure the health and performance of systems
- Participate in on-call rotations and handle incident response, troubleshooting and resolution
- Create and maintain scripts for operational tasks and automation
- Conduct capacity planning and manage the scalability of the systems
- Collaborate with development teams to improve system reliability and performance
- Deploy and maintain cloud services and infrastructure
- Define and implement service level objectives and indicators
- Ensure security best practices are followed in all aspects of infrastructure and services
- Perform system and application performance tuning and capacity forecasting
- Conduct post-incident reviews and implement preventive measures
- Participate in the design and implementation of disaster recovery plans
- Document procedures, configurations, and processes
- Contribute to the continuous improvement of processes and tools
- Stay updated with industry trends and best practices
Required qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field
- Proven experience in a Site Reliability Engineer or similar role
- Strong understanding of software development, system administration, and networking
- Proficiency in scripting (e.g., Python, Shell, Perl)
- Experience with monitoring and alerting tools (e.g., Nagios, Datadog, Prometheus)
- Expertise in cloud services and infrastructure (e.g., AWS, GCP, Azure)
- Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
- Experience with CI/CD pipelines and configuration management tools (e.g., Jenkins, Ansible)
- Solid understanding of TCP/IP, HTTP, DNS, and other network protocols
- Ability to analyze and troubleshoot complex systems and applications
- Experience with incident management and on-call responsibilities
- Familiarity with security best practices and tools
- Excellent communication and collaboration skills
- Certifications such as AWS Certified SysOps Administrator or Google Professional Cloud DevOps Engineer is a plus
- Continuous learning and self-improvement mindset