Overview:
The Senior Site Reliability Engineer plays a critical role in ensuring the reliability, scalability, and performance of our systems and services. They are responsible for designing and implementing tools and automated solutions to improve system reliability, monitoring, and incident response.
Key Responsibilities:
- Develop and maintain infrastructure as code using tools like Terraform and Ansible
- Implement and maintain monitoring, alerting, and reporting systems
- Collaborate with cross-functional teams to improve system reliability and performance
- Perform system capacity planning and demand forecasting
- Automate routine operational tasks and processes
- Participate in incident response and on-call rotation
- Optimize the performance and efficiency of various systems and platforms
- Conduct system failure analysis and provide root cause analysis
- Implement and manage CI/CD pipelines
- Conduct periodic performance and security audits
- Lead efforts to improve overall system architecture
- Troubleshoot and resolve complex technical issues
- Collaborate with development teams to improve application deployment processes
- Ensure compliance with security and data protection best practices
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or a related field
- 6+ years of experience in a site reliability engineering or related role
- Strong experience with Linux system administration and troubleshooting
- Proficiency in scripting and programming languages such as Python, Shell, or Go
- Experience with automation and configuration management tools like Puppet, Chef, or Ansible
- Solid understanding of networking concepts and protocols
- Expertise in cloud computing platforms such as AWS, Azure, or GCP
- Proven track record of designing and implementing scalable, reliable, and maintainable systems
- Experience with containerization and orchestration tools like Docker and Kubernetes
- Knowledge of continuous integration and continuous deployment (CI/CD) practices and tools
- Excellent problem-solving and troubleshooting skills
- Strong communication and collaboration abilities
- Relevant certifications such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator, or similar
- Ability to work effectively in a fast-paced, dynamic environment
- Experience with incident management and on-call support