Responsibilities:
• Perform deep dives into both systemic and latent reliability issues; partner with software and systems engineers across the organization to produce and roll out fixes.
• Troubleshoot issues across the entire stack. Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions
• Identify and drive opportunities to improve automation
• Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
• Participate in periodic on call duties.
• Represent the SRE team in design reviews and operational readiness exercises for new and existing services
Minimum qualifications:
• BS degree in Computer Science or related technical field, or equivalent practical experience.
• Minimum 5+ years of managing services in an internet scale *nix environment
• Practical knowledge of various aspects of service design, including messaging protocols & behavior, caching strategies and software design practices
• Experience in one or more of: Java, Tomcat, Elastic Search, MySQL or scripting experience in Shell and Python.
• Experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols.
• Strong hands on experience with configuration management tools like Ansible, Puppet, or Chef
• Experience with network theory e.g. TCP/IP, UDP, ICMP, etc., MAC addresses, IP packets, DNS, OSI layers, and load balancing.
• Must work well with and be able to influence myriad personalities at all levels
• Ability to prioritize tasks and work independently
• Must be adaptable and able to focus on the simplest, most efficient & reliable solutions
• Track record of successful practical problem solving, excellent written and interpersonal communication, and documentation skills
Desired qualifications:
• Expertise in designing, analyzing and troubleshooting large-scale distributed systems.
• In-depth knowledge of operating systems (processes, threads, concurrency issues, locks, mutexes, semaphores, monitors and how they work).
• Familiarity with algorithms, data structures and complexity analysis.
• Hands on Java and Apache optimization, performance tuning and configuration
• Systematic problem solving approach, coupled with a strong sense of ownership and drive.