Senior Site Reliability Engineer

Inmar Intelligence • North Carolina, United States • 3w ago

As a Site Reliability Engineer (SRE), you'll play a pivotal role in ensuring the health, reliability, performance, and scalability of our applications. You'll bridge the gap between development and operations, leveraging your technical expertise and problem-solving skills to triage production issues, automate operations, optimize processes, and maintain high availability.

Key Responsibilities:

•Steward of Application Health: Work closely with application developers to design resilient, scalable, and maintainable applications, ensuring they meet operational requirements and minimize downtime.

•Collaboration: Participate in code review; mentor and train peers; advocate DevOps principals to application developers.

•Infrastructure Automation: Develop and maintain automation and tools to streamline deployments, configuration management, and infrastructure provisioning.

•Monitoring and Alerting: Implement robust monitoring systems to proactively identify and address performance bottlenecks, anomalies, and security threats.

•Capacity Planning: Forecast resource needs and optimize infrastructure utilization to ensure high availability and performance.

•Change Management: Collaborate with development teams to ensure smooth deployment of new features and updates, minimizing disruptions.

•Security and Compliance: Adhere to security best practices and implement measures to protect systems from vulnerabilities and threats.

•Guardian of SLA: Actively monitor and maintain the health and performance of applications, ensuring they meet Service Level Agreements. Respond to, triage and mitigate emergent problems in production.

•On-Call Support: Participate in on-call rotation to provide timely support and resolution for critical issues.

Required Skills and Experience:

•Strong programming skills in languages like Python, Ruby, Bash, Rust, Go.

•Experience with cloud platforms (AWS, GCP, Azure) and infrastructure as code tools (Cloudformation, Terraform, Ansible, Chef)

•Deep understanding of containerization technologies (Docker, Kubernetes)

•Proficiency with linux (Ubuntu)

•Proficiency in monitoring and alerting tools (Cloudwatch, Prometheus, Grafana)

•Knowledge of DevOps practices and methodologies (CI/CD, Agile)

•Excellent problem-solving and troubleshooting skills

•Strong communication and collaboration abilities

Preferred Skills and Experience:

•Knowledge of asynchronous processing (Kafka, Celery)