Site Reliability Engineer
About this position:
Bluetent’s DevOps team is growing! The Site Reliability Engineer will work with the CTO and Cloud Systems Engineer to continue to ensure that our global service platform is always ready to answer to growing business needs and opportunities. This position is an engineering discipline that combines your systems engineering and software skills to build and run applications on a cutting edge cloud native infrastructure using kubernetes, docker and more.
The site reliability engineer will help improve and maintain our platforms SLOs, help support the development team with CI/CD tools and run the production environment by monitoring availability and health of workloads.
In this role, you will:
- Support the development, testing, deployment, monitoring and maintenance of Bluetent’s large-scale, distributed, fault-tolerant platform software, marketing and eCommerce services.
- Participate in the automation, monitoring and maintenance of Bluetent’s cloud-native and multi-cloud Kubernetes clusters.
- Develop tools and automated solutions in support of platform services
- Monitor and manage internal devops cases from intake to resolution in support of client services, software implementation, technical support and product/platform engineering.
- Troubleshoot performance, reliability, and scalability issues.
- Collaborate with application developers in the improvement of developer experience and toolsets.
- Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
- Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
- Practice sustainable incident response and blameless postmortems.
- Maintain and administer data stores ensuring proper backup, replication and failover strategies.
- Ensure proper security, monitoring, alerting and reporting for production infrastructure.
- Take broad, conceptual ideas and turn them into functional architecture and software designs to solve customers use cases.
- Troubleshoot and resolve issues related to application development, deployment and operations.
The skills/qualifications we are looking for are:
Minimum qualifications:
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
- Two years full-time professional experience, or equivalent, with at least a few of the following:
- Linux/Unix Server Administration
- Cloud systems administration (AWS, Google, Azure, etc.)
- CI/CD Automation (Jenkins, JenkinsX, CircleCI, etc.)
- Kubernetes cluster administration in AWS, Google (Kubectl, Helm, Minikube, etc.)
- Application Containers (Docker, Docker for Mac)
- Scripting/programming languages (PHP, Bash, Go, Node.js, Perl, Python, etc.)
- Databases, NoSQL, Queues, PubSub and Cache (MySQL, MariaDB, AWS Aurora DB, Google Cloud SQL, Scylla, Apache Solr, ElasticCache, Redis, Kafka, Amazon SQS, etc.)
- Websites, Web Services, Microservices (HTTP(S), Drupal, Wordpress, Nginx, Apache HTTPD, Kong, SOAP, OAS/Swagger, JSON, CSS, JS, CDN, gRPC, Protocol Buffers, AWS Lambda, Serverless, Letsencrypt, etc.)
- Application Monitoring and Profiling (Cloudwatch, Stackdriver, Graylog, Grafana, Prometheus, New Relic, etc.)
Other qualifications:
- BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics).
- Ability to debug, profile and optimize code and application design and automate routine tasks.
- Experience creating and maintaining Jenkins pipelines.
- Experience with algorithms, data structures, complexity analysis and software design.
- Excellent communication skills both verbal and written