Job Description
We are looking for a skilled
Senior Site Reliability Engineer (SRE) with deep expertise in
Prometheus, Grafana, and Kubernetes to join our remote team. In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.
Key Responsibilities
- Monitoring and Observability: Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
- Kubernetes Orchestration: Deploy, manage, and optimize applications on Kubernetes clusters, ensuring optimal performance and scalability.
- Automation and Scripting: Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
- Incident Management: Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
- Performance Tuning: Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
- CI/CD Integration: Collaborate with development teams to integrate monitoring into the CI/CD pipeline and ensure smooth deployments.
- Capacity Planning: Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
- Post Deployment Support: Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.
Required Skills
- Grafana: Advanced experience in setting up Grafana dashboards for real-time monitoring and alerting.
- Prometheus: Proficient in configuring, tuning, and managing Prometheus for large-scale environments.
- Kubernetes: Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
- Scripting: Proficiency in scripting languages such as Python or Bash automate tasks.
- Alerting & Incident Management: Experience setting up advanced alerting and incident management processes.
- Infrastructure as Code (IaC): Experience with tools like Helm.
- CI/CD Pipelines: Knowledge of CI/CD tools and automation frameworks for seamless deployment.
Preferred Skills
- Familiarity with external storage for prometheus (ex. Mimir) for high-scale storage backends.
- Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
- Knowledge of microservices architecture and REST APIs.
Qualifications
- 6+ years of hands-on experience as an SRE, DevOps Engineer, or similar role in managing complex infrastructure systems.
- 2+ years of hands-on experience with implementing Grafana dashboard and alert integration with various tools.
- Strong understanding of DevOps practices and infrastructure automation.
- Proven experience in large-scale monitoring systems and high-availability environments.
- Excellent troubleshooting, analytical, and problem-solving skills.