Data Platform Reliability: Maintain the health, performance, and uptime of data infrastructure (including data lakes, warehouses, and pipelines) through proactive monitoring and automated systems.
Monitor and Optimize Data Pipelines: Build observability into data pipelines to identify and resolve bottlenecks or failure points in real-time. Utilize A/B testing and canary releases to test pipeline changes in production safely.
Automate Incident Response: Develop automated incident response systems and playbooks that detect anomalies, resolve common issues, and reduce Mean Time to Recovery (MTTR).
Scalable Data Infrastructure: Partner with data engineers to ensure systems scale efficiently, managing storage and compute resources within cloud environments (AWS, Azure, GCP).
DataOps Implementation: Work within an Agile framework to apply DataOps methodologies, driving continuous integration, delivery, and deployment for data systems.
SLA/SLI/SLO Management: Establish and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for data platforms. Monitor Service Level Agreements (SLAs) and ensure data systems meet defined expectations for uptime and performance.
Reliability Automation: Build and manage tools for automated testing, health checks, and disaster recovery processes to ensure reliable data ingestion, transformation, and storage.
Collaboration: Work closely with data engineering, data analysts, and business teams to ensure data systems meet organizational needs while maintaining reliability and agility.
Cost and Performance Optimization: Identify opportunities to optimize costs for storage and compute resources while maintaining or improving performance.
Incident and Problem Management: Lead incident resolution for data outages or pipeline failures, conduct post-mortems, and implement preventative measures to reduce future incidents.
Continuous Improvement: Identify and implement ongoing improvements for data reliability, observability, and operational excellence within the platform.