The Site Reliability Engineer (SRE) will play a crucial role in ensuring the reliability, scalability, and performance of our systems and services. Working closely with cross-functional teams, the SRE will design, implement, and maintain tools and processes to monitor, manage, and automate our infrastructure. The ideal candidate is passionate about building robust and resilient systems, with a strong focus on automation and continuous improvement.
Responsibilities:
- System Monitoring and Incident Response:
- Design and implement monitoring solutions to detect and mitigate system issues proactively
- Respond to alerts and incidents promptly, troubleshoot issues, and implement effective solutions to minimize downtime
- Infrastructure Automation:
- Develop and maintain automation scripts and tools to streamline deployment, configuration, and scaling of infrastructure components
- Implement Infrastructure as Code (IaC) practices to manage and provision infrastructure resources efficiently
- Performance Optimization:
- Identify performance bottlenecks and inefficiencies in the system and work collaboratively with development teams to optimize performance
- Conduct capacity planning and scalability assessments to ensure our systems can handle current and future demands
- Reliability Engineering:
- Design and implement fault-tolerant and resilient architectures to ensure high availability of services
- Conduct post-mortem analysis of incidents to identify root causes and implement preventive measures
- Continuous Improvement:
- Stay current with industry best practices and emerging technologies related to site reliability and infrastructure automation
- Drive initiatives to continuously improve the reliability, scalability, and performance of our systems
Requirements
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience)
- Proven experience in a Site Reliability Engineer, DevOps Engineer, or similar role
- Proficiency in scripting and automation using languages such as Python, Bash, or PowerShell
- Strong understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and container orchestration technologies (e.g., Kubernetes)
- Experience with configuration management tools (e.g., Ansible, Puppet, Chef) and version control systems (e.g., Git)
- Familiarity with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)
- Excellent problem-solving skills and the ability to troubleshoot complex issues in a production environment
- Strong communication and collaboration skills, with the ability to work effectively in a cross-functional team environment
Benefits
- Health, dental, vision, life, and short/long-term disability insurance
- Paid vacation, holidays, and sick leave
- Competitive compensation and opportunities for advancement
- Retirement plan with employer contribution match
- Welcoming, family-style corporate culture uniquely suited to fast-paced, entrepreneurial, and motivated individuals
- One of San Antonio's "Best Places to Work" for nine consecutive years