Overview
Are you passionate about enabling large-scale compute efforts and enhancing the software infrastructure to support seamless research experiences? As a Software Engineer, Infrastructure, you will play a critical role in scaling our systems and creating robust tools for our team. You will build and manage tools, debug distributed systems, and design improvements to manage secrets, configurations, and stateful components. Your work will ensure our infrastructure is resilient and reliable, allowing other engineers to work more effectively.
Responsibilities
- Tool Development: Build and manage wrapper tools that allow code written for single hosts to scale seamlessly to large GPU clusters
- Debugging: Debug distributed exceptions and improve our logging and tracing stack to enhance system reliability
- System Improvements: Design and implement improvements to systems managing secrets, configurations, ongoing jobs, and other stateful components
- Prototyping: Search out and prototype additions to our software stack that address pain points in typical workflows
- Code Enhancement: Dive into open-source or third-party code, including C/C++ libraries, to add debugging information or enhance performance
- Collaborative Design: Work collaboratively with team members to debug, provide guidance, and design resilient software solutions
- Abstraction Creation: Develop good abstractions to enable other engineers to work at a higher level, ensuring seamless integration and operation
Example Projects
- Develop tools to scale single-host code to large GPU clusters
- Enhance the logging and tracing stack for better debugging of distributed systems
- Design systems for managing secrets, configurations, and ongoing jobs
- Prototype new software stack additions to improve workflow efficiencies
- Enhance performance and debugging capabilities in open-source or third-party code
- Collaborate on designing resilient software that handles hardware and network failures effectively
Requirements
- Software Engineering Expertise: Proficient in writing and reading Python, bash, and other scripting languages
- Infrastructure Focus: Passionate about creating good tooling for developers to interact with infrastructure in an automated way
- DevOps Experience: Experienced with DevOps and capable of making informed decisions about various approaches and technologies
- Detail-Oriented: Careful and detail-oriented, with a strong emphasis on robustness and correctness in scientific infrastructure
Requirements
Required Skills:
- Python
- Bash
- DevOps
- Debugging
- Logging
- Tracing
- Prototyping
- Collaboration
- Scalability
- Reliability
Benefits
Benefits
- Competitive Salary: $190,000 - $350,000 annually
- Health Insurance: Comprehensive medical, dental, and vision coverage
- Retirement Plans: 401(k) plan with company matching
- Paid Time Off: Generous PTO policy including vacation, sick leave, and holidays
- Professional Development: Opportunities for continuous learning and career growth, including access to conferences, workshops, and online courses
- Flexible Work Arrangements: Options for remote work and flexible scheduling to support work-life balance
- Parental Leave: Paid parental leave for new parents
- Wellness Programs: Access to mental health resources, wellness programs, and fitness reimbursements
- Employee Assistance Program: Support for personal and professional issues through our EAP
- Stock Options: Equity options to share in the company's success
- Commuter Benefits: Pre-tax commuter benefits for public transportation and parking
- Technology Stipend: Annual stipend for tech equipment and home office setup
- Company Events: Regular team-building activities, social events, and company retreats
- Diversity and Inclusion: Commitment to fostering an inclusive and diverse workplace