The
Incident & Disaster Recovery Commander combines the leadership and decision-making responsibilities of an Incident Commander with the technical expertise required for Disaster Recovery Technician roles. This individual is responsible for managing the response to critical incidents and disasters, ensuring the swift restoration of services, and executing disaster recovery procedures. They lead cross-functional teams during emergency situations, mitigate disruptions, and oversee the recovery of IT infrastructure and business operations. This role demands a proactive, organized individual with a strong technical background and the ability to manage high-stakes incidents and technical recovery processes simultaneously.
Incident Command Leadership
- Lead and manage the response to major incidents or disasters, including IT outages, natural disasters, cyberattacks, and other critical events.
- Take command of the incident from the onset, directing the immediate response and ensuring coordination between various teams and/or Business Units (e.g., Technical Operations Center, IT Services, AppEngineering, InfoSec, Customer Care, HR, etc).
- Establish and maintain clear communication with senior leadership, key stakeholders, and teams, providing updates and strategic direction.
- Assess the severity of the incident, prioritize resources, and manage escalation protocols to mitigate the impact.
Disaster Recovery Execution
- Oversee and execute disaster recovery plans for IT infrastructure, applications, and critical systems, ensuring data integrity and system availability.
- Manage recovery efforts for IT systems, networks, and business operations, ensuring minimal downtime and business continuity.
- Work with the Systems team to perform regular backups and test recovery procedures to ensure the effectiveness of disaster recovery strategies.
- Identify potential vulnerabilities and implement improvements to reduce recovery time and potential data loss during an incident.
Crisis Management And Coordination
- Coordinate and mobilize response teams to address immediate operational needs during a disaster or incident.
- Maintain situational awareness through monitoring tools and communications, adjusting response strategies as the situation evolves.
- Ensure effective allocation of resources, including personnel, technology, and external vendors, during the incident resolution process.
- Manage communications with external agencies or third-party vendors involved in the crisis resolution or recovery efforts.
Post-Incident Analysis And Reporting
- Within 24 hours of mitigation of incident, confirm with stakeholders impact to system uptime (bonus) metrics.
- Within 48 hours of resolution of incident, lead post-incident reviews (PIRs) and root cause analysis (RCA) to assess the effectiveness of the response and recovery efforts, documenting lessons learned and recommending improvements.
- Responsible for the creation of detailed reports on incident timelines, recovery actions, root cause analysis(RCA), and recovery outcomes.
- Ensure that key metrics and KPIs related to major incidents (e.g., response time, resolution time, stakeholder satisfaction) are tracked and reported to the Site Reliability Engineer (SRE).
- Analyze incident trends and weaknesses in the disaster recovery processes, working with teams to refine and strengthen future response strategies.
- Provide metrics
Risk Management And Preparedness
- Identify and assess potential risks to business operations, data, and IT infrastructure, implementing mitigation strategies where necessary.
- Lead and participate in disaster recovery drills and simulations to ensure teams are prepared for a wide range of potential crises.
- Continuously update and maintain disaster recovery documentation, recovery plans, and business continuity strategies.
- Ensure that recovery plans are aligned with industry standards, regulatory requirements, and best practices.
Training And Awareness
- Provide training and guidance to staff members on incident management procedures, disaster recovery protocols, and the use of recovery tools.
- Foster a culture of preparedness within the organization, ensuring teams understand their roles in incident command and disaster recovery.
- Conduct workshops and seminars to improve the overall knowledge of incident management and disaster recovery processes.
Qualifications
Education: Bachelor’s degree in Information Technology, Emergency Management, Business Continuity, or a related field.
Certifications: ITIL, Disaster Recovery Institute International (DRII), Certified Business Continuity Professional (CBCP), CompTIA Network+, or equivalent certifications.
Experience
- 3-5+ years of experience in incident management, disaster recovery, or IT operations.
- Proven experience leading or coordinating incident response efforts, especially in high-pressure environments.
- Hands-on experience with disaster recovery systems, backup tools, and business continuity solutions.
- Experience with incident management software (e.g., ServiceNow, Jira) and recovery tools (e.g., Cohesity, Veeam, Acronis).
Skills
- Strong leadership and decision-making skills, with the ability to manage teams during high-stress situations.
- Excellent communication skills, with the ability to interact with both technical teams and senior leadership.
- In-depth technical knowledge of IT infrastructure, disaster recovery processes, and recovery strategies.
- Ability to prioritize tasks, manage multiple incidents simultaneously, and handle crisis situations effectively.
- Strong analytical and problem-solving abilities to identify and resolve underlying issues during and after an incident.
Working Conditions
- Flexibility to work irregular hours, including nights, weekends, and holidays, during major incidents or disasters.
- Ability to be on-call and respond quickly to crisis situations as they arise.
- May require travel to disaster recovery sites, operational locations, or affected areas during incidents.
Primary Location
United States-Arizona-Scottsdale
Job
Information Technology
Schedule
Full-time
Travel
No
Job Posting
Dec 18, 2024
Unposting Date
Dec 24, 2024