Site Reliability Engineer

Jose Merciline • Jersey City, New Jersey, United States • 3w ago

Responsibilities:

Senior Site Reliability Engineer with a proven track record for delivering software infrastructure while working closely with software engineering teams.

You will develop, maintain and scale our infrastructure for deploying and monitoring our software using the latest tools and

methodologies, including agile, CI/CD, and infrastructure as code.

Contribute to the advancement of both software development and cloud infrastructure efforts

Partner with developers to apply best practices to ensure full working test and production environments using logging/monitoring tools, alerting/notification tools, and any other tools that help reduce time-to-detect/time-to-mitigate & tools for disaster recovery, high availability, and business continuity

Design, build and maintain CI/CD, testing, and operations infrastructure for our systems

Create documentation, run-books, and operational standards with a focus on automation

Support Production systems and respond to operational incidents

Required Skills:

You have 5+ years of relevant experience in Site Reliability Engineer (SRE) / DevOps roles

You have 3+ years of experience with a public cloud provider (AWS, Azure, GCP)

You have previous experience in Kubernetes (K8s) and its related ecosystem of tooling

You have a solid command of Linux systems and the networking stack

You have experience debugging distributed systems (DataDog, APM, Monitoring and Alerting)

You have experience automating everything you can with infrastructure automation tools (e.g. Terraform)

Automation skills in shell bash, Python, and/or other languages

You have experience implementing and mentoring others on best practices, including automated testing, security, continuous integration, and continuous deployment (CI/CD)

Strong networking knowledge of TCP/IP

Apply