Site Reliability Engineer (Grafana)

Bay Area Tek Solutions LLC • Seattle, WA, US • 5m ago

Job Description

We are looking for a skilled Senior Site Reliability Engineer (SRE) with deep expertise in Prometheus, Grafana, and Kubernetes to join our remote team. In this role, you will manage and optimize the infrastructure supporting a large-scale hardware monitoring project, ensuring high availability, reliability, and scalability for thousands of server hardware.

Key Responsibilities

Monitoring and Observability: Design, implement, and maintain comprehensive monitoring systems using Prometheus and Grafana to track and visualize metrics from thousands of hardware servers.
Kubernetes Orchestration: Deploy, manage, and optimize applications on Kubernetes clusters, ensuring optimal performance and scalability.
Automation and Scripting: Develop and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
Incident Management: Troubleshoot, diagnose, and resolve infrastructure incidents, ensuring the uptime and reliability of services.
Performance Tuning: Optimize system performance, ensuring efficient data storage, querying, and alerting in Prometheus and Grafana environments.
CI/CD Integration: Collaborate with development teams to integrate monitoring into the CI/CD pipeline and ensure smooth deployments.
Capacity Planning: Perform capacity analysis and ensure that systems are appropriately scaled to handle increasing load.
Post Deployment Support: Support for monitoring solution once monitoring solution is implemented, troubleshooting incidents.

Required Skills

Grafana: Advanced experience in setting up Grafana dashboards for real-time monitoring and alerting.
Prometheus: Proficient in configuring, tuning, and managing Prometheus for large-scale environments.
Kubernetes: Strong hands-on experience with managing Kubernetes clusters, deployments, and container orchestration.
Scripting: Proficiency in scripting languages such as Python or Bash automate tasks.
Alerting & Incident Management: Experience setting up advanced alerting and incident management processes.
Infrastructure as Code (IaC): Experience with tools like Helm.
CI/CD Pipelines: Knowledge of CI/CD tools and automation frameworks for seamless deployment.

Preferred Skills

Familiarity with external storage for prometheus (ex. Mimir) for high-scale storage backends.
Experience with any Cloud Platforms (ex. AWS, GCP, Azure) for deploying infrastructure.
Knowledge of microservices architecture and REST APIs.

Qualifications

6+ years of hands-on experience as an SRE, DevOps Engineer, or similar role in managing complex infrastructure systems.
2+ years of hands-on experience with implementing Grafana dashboard and alert integration with various tools.
Strong understanding of DevOps practices and infrastructure automation.
Proven experience in large-scale monitoring systems and high-availability environments.
Excellent troubleshooting, analytical, and problem-solving skills.