Site Reliability Engineer (SRE)

Futurex • Bulverde, TX, US • 5m ago

The Site Reliability Engineer (SRE) will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services. Working closely with cross-functional teams, the SRE will design, implement, and maintain tools and processes to monitor, manage, and automate our infrastructure. The ideal candidate is passionate about building robust and resilient systems, with a strong focus on automation and continuous improvement.

Responsibilities:

System Monitoring and Incident Response:
Design and implement monitoring solutions to detect and mitigate system issues proactively
Respond to alerts and incidents promptly, troubleshoot issues, and implement effective solutions to minimize downtime
Infrastructure Automation:
Develop and maintain automation scripts and tools to streamline deployment, configuration, and scaling of infrastructure components
Implement Infrastructure as Code (IaC) practices to manage and provision infrastructure resources efficiently
Performance Optimization:
Identify performance bottlenecks and inefficiencies in the system and work collaboratively with development teams to optimize performance
Conduct capacity planning and scalability assessments to ensure our systems can handle current and future demands
Reliability Engineering:
Design and implement fault-tolerant and resilient architectures to ensure high availability of services
Conduct post-mortem analysis of incidents to identify root causes and implement preventive measures
Continuous Improvement:
Stay current with industry best practices and emerging technologies related to site reliability and infrastructure automation
Drive initiatives to continuously improve the reliability, scalability, and performance of our systems

Requirements

Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
Proven experience in a Site Reliability Engineer, DevOps Engineer, or similar role
Proficiency in scripting and automation using languages such as Python, Bash, or PowerShell
Strong understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and container orchestration technologies (e.g., Kubernetes)
Experience with configuration management tools (e.g., Ansible, Puppet, Chef) and version control systems (e.g., Git)
Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)
Excellent problem-solving skills and the ability to troubleshoot complex issues in a production environment
Strong communication and collaboration skills, with the ability to work effectively in a cross-functional team environment

Benefits

Health, dental, vision, life, and short/long-term disability insurance
Paid vacation, holidays, and sick leave
Competitive compensation and opportunities for advancement
Retirement plan with employer contribution match
Welcoming, family-style corporate culture uniquely suited to fast-paced, entrepreneurial, and motivated individuals
One of San Antonio's "Best Places to Work" for nine consecutive years