DescriptionSolve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence. Design, write, and deploy software to improve the availability, scalability, and efficiency of Oracle products and services. Design and develop designs, architectures, standards, and methods for large-scale distributed systems. Facilitate service capacity planning and demand forecasting, software performance analysis, and system tuning.
ResponsibilitiesAbout Oracle SaaS Cloud SRE
Oracle SaaS Cloud SRE plays a critical role in delivering and supporting best-of-breed cloud solutions to Oracle customers.
Oracle Cloud is the industry's broadest and most integrated public cloud. It offers best-in-class services across software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS), and even lets you put Oracle Cloud in your own data center. Oracle Cloud helps organizations drive innovation and business transformation by increasing business agility, lowering costs, and reducing IT complexity.
The Oracle Cloud has shown strong adoption, supporting 70+ million users and more than 30+ billion transactions each day. It runs in 19 data centers around the world.
Our team delivers cross-team visibility and execution on the most challenging reliability issues impacting Oracle's SaaS customers. We engage deeply with service owners and stakeholders to deeply understand and improve critical issues that impair service experience.
About the Job
A unique opportunity to join a rapidly growing world-class team to improve the cutting-edge Oracle Cloud technologies and infrastructure that make up the Oracle Cloud solutions. As part of the SRE team, you will be continually challenged and have an opportunity to contribute to the Oracle Cloud success every day, working closely with our development partners.
As a Site Reliability Engineer, you will solve exciting technical challenges by analyzing, troubleshooting, and designing vital Oracle Cloud services, platforms, and infrastructure while always thinking about reliability, scalability, resilience, security, and performance.
What You'll Do
- Service Accountability –You will be part of the SRE team, whose mission is the shared full stack reliability of a collection of services and technology areas, with our Development partners.
- Ownership Scope – As an SRE, you will understand the end-to-end configuration, technical dependencies, and overall behavioral characteristics of the production services you collaborate with. In partnership with your Development collogues, you will have the responsibility to ensure that services are designed and delivered to be mission critical with a focus on security, resiliency, scale, and performance.
- Operations Engineering – You will understand and be able to communicate the scale, capacity, security, performance attributes, and requirements of the services you own. We are subject matter experts, able to understand and communicate every characteristic of our service stack, such as:
- degradation and behavior under load of the services and their dependencies
- end-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
- Instrumentation and metrics that clearly describe the service behaviors
- scaling requirements and patterns
- resiliency and recoverability, ensuring that backup/restore and disaster recovery capabilities are implemented, tested and maintained
- Automation – You will have a clear understanding of automation and orchestration principles, and will be eager to help automate, wherever and whenever the possibility arises, while simultaneously eliminating technical debt. Automation must be part of your DNA.
- Technical Experts - You will have a deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. You will bring this expertise to bear in driving reliability improvements in the services you engage with.
- Database knowledge – Databases are foundational to the Oracle SaaS Cloud services, so you will bring a deep understanding of troubleshooting and tuning for Oracle RDBMS systems.
- Broad Interests - SREs are a rare mix of sysadmins and Development Engineers, and as such, have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems. They are driven by professional curiosity, and a desire to develop deep understanding of their services and their dependencies.
- Cross-team collaboration – You will engage with and present to a wide variety of audiences, ranging from individual contributors and teams to executive leadership
What You Need to Have
A BS or MS in Computer Science, or equivalent
Knowledge of:
- Database Architecture and Internals
- Server hardware configuration
- Linux internals
- Standard Internet services, such as DNS, HTTP, etc.
- Database performance metrics and fluency to understand reports
- Oracle FMW database administration
- Exadata architecture, design, best practices
- Cloud computing patterns
- IT Security and compliance
- 5+ year experience of running large scale databases
- Most importantly, the aptitude to be a good team player and the willingness to learn and implement new Cloud technologies as needed
- Methodical approach to troubleshooting complex problems
What the Perfect Candidate Will Have
Understanding of:
- Oracle DBA
- Oracle SOA and BPEL
- Oracle Fusion Middleware
- Oracle Enterprise Manager
- Defining and documenting technical architecture of complex and highly scalable products