Site Reliability Engineer

Graylog, Inc. • Remote (Colorado, United States, US) • 4d ago

Graylog: Empowering Threat Detection, Investigation, & Response Solutions with Cutting-Edge Technology

Graylog specializes in delivering top-notch Threat Detection, Investigation, & Response (TDIR) solutions, backed by our latest addition, the Graylog API security platform. As a renowned centralized log management (CLM) and Security Information Event Management (SIEM) provider, we offer unparalleled fast and efficient log analysis capabilities in critical areas such as security, compliance, operations, and DevOps.

Our enterprise solution enables organizations globally to capture, store, and analyze terabytes of machine data in near-real time while our open-source product has been deployed in more than 50,000 installations worldwide, empowering individuals and small teams to perform basic log consolidation, analysis, and search functions at no cost.

We're a remote-friendly company with locations in Hamburg, Munich, London, Boulder, and headquarters in Houston, TX. If you live near an office and want to be part of said office great. Nearish to an office and want to have the ability to hot desk? No problem, and if you're not near an office and wish to work remotely, all good!

Recent achievements for Graylog have been inclusion in the 2021 Deloitte Technology Fast 500™, we took home two of the most prestigious cybersecurity awards in SIEM and DevSecOps from Cyber Defence Magazine at RSA in 2023, and 2024 has seen us take home gold and become the Globee Winner for Security Information & Event Management and the 2024 Globee Winner for Threat Hunting, Detection, Intelligence, and Response.

Graylog has recently been named a “Leader” and “Fast Mover” in GigaOM’s 2024 Radar Report for SIEM.

Who we’re looking for;

We’re currently recruiting for a Site Reliability Engineer to join our multinational cloud services team.

As a Site Reliability Engineer here at Graylog you will provide architectural guidance and technical solutions for adapting our product in a 24x7 support cloud offering, with a focus on delivering a product that is highly available, resilient, secure, scalable, cost-efficient, and consistently delivers valuable product outcomes to consumers.

Our Site Reliability Engineers work with state-of-the-art technologies as we ensure you have the right tools to make a significant impact in managing our systems and to drive their continuous improvement while shaping the future of our cloud strategy.

We believe that the best ideas can come from anywhere, and we value your input and initiative. Here, you will not just be a guardian of our infrastructure; you’ll be an innovator, a problem-solver, and a leader.

This role is a full-time permanent position based in North America and will report to our Engineering Manager, Site Reliability.

Additional responsibilities will include but are not limited to;

Cloud Infrastructure Management: Writing pull requests (PRs) to make changes that improve and optimize our AWS+Terraform+Kubernetes setup, centring around ensuring its high availability, scalability, and resilience
Security & Compliance: Implementing security measures, auditing the cloud environment, and ensuring adherence to compliance standards
Tool Development: Expanding our internal tool base, focusing on Infrastructure as a Code and configuration management improvements
Issue Resolution: Collaborating with teams to identify and resolve infrastructure-related issues swiftly, minimizing any impact on product performance
Cloud Strategy Advocacy: Championing cloud strategies that align with and advance our business objectives, especially during pitch cycles and other planning meetings
Knowledge Sharing: Connecting with Cloud Engineers, Site Reliability Engineers, and application engineers, documenting key decisions where possible and making sure critical knowledge isn't siloed in a single spot in the organization

What you can expect your first 12 months will look like;

Infrastructure Knowledge: Within six months, acquire expert understanding of and submit an approved peer-reviewed pull request (APRPR) for each of the following technologies: Terraform, Flux, Kustomize, and Argo
Stability Improvements: In the first 6-9 months, deliver a POC for a technology improvement centred around improving or maintaining uptime, reducing the reliance on single points of failure, or reducing the Time to Recovery after an incident
Signal and Metrics Improvement: Within six months, contribute to at least one cycle of signal and metrics improvement and show that the overall number of alerts decreased in the following cycle and/or a requested metric or set of metrics has been made available for use
Security and Compliance: In the first 12 months, contribute to at least one of the following: AWS Product and Architecture Review, SOC 2 compliance review, Disaster Recovery (DR) plan review and drill, Security Penetration Test (Pen Test) review and remediation

Little bit about you;

Cloud Infrastructure Management: Proficiency in managing cloud infrastructures, especially AWS, along with associated tools like Terraform and Kubernetes, ensuring high availability, scalability, and resilience
Experience with Infrastructure as Code (IaC): Hands-on experience with IaC tools and techniques, including configuration management and cloud provisioning
Software Development: Basic programming skills in at least one language, such as Python, for tool development and automation tasks
Security Best Practices: Knowledge of security protocols and compliance requirements specific to cloud environments, with experience in implementing security measures
Troubleshooting & Issue Resolution: Experience in diagnosing and resolving infrastructure-related issues, working closely with development and support teams
Monitoring and Metrics: Familiarity with cloud monitoring tools and performance metrics to continuously evaluate and improve the infrastructure
CI/CD Practices: Understanding of continuous integration and continuous deployment practices for efficient and reliable product releases
Documentation & Communication: Ability to document technical processes clearly and effectively communicate architectural decisions and changes to various stakeholders

Just some of the reasons why to join Graylog;

Management team with deep programming, technical, and product experience
Opportunity to work with a globally distributed and diverse team
Grow and develop professionally and personally in a fast-growing environment
Choice of the latest equipment to help you succeed
Monthly allowance to support your commute costs and support outfitting your work-from-home environment

Here at Graylog, you'll find a diverse group of experienced professionals who love to have fun while meeting the needs of our customers with the best solution and customer service available.

Our values;

Openness- As a global company, we encourage our people to bring their backgrounds, ideas, and perspectives to our collective work. We lead with integrity and are committed to doing what is best for the Graylog community.

Collaboration- Through mutual respect, trust, and candid communication across all teams, we deliver the best ideas and results.

Useful Innovation- We take calculated risks to find new ways to innovate. By continuously improving ourselves, processes, and technologies, we deliver the best solution for our customers.

Ownership- As owners, we take the initiative to solve internal and external problems while supporting peer success and holding ourselves accountable for delivering the best work. We do this from a place of high trust.

Do the Right Thing!- Comfort and safety come from knowing that everyone will do the right thing, even when nobody's looking.

For further information please submit an application and a member of the Graylog People Team will be in touch.