Site Reliability Engineering Incident Manager

Sorry, this job was removed at 2:43 p.m. (EST) on Thursday, February 3, 2022
Find out who's hiring in Burlington.
See all Developer + Engineer jobs in Burlington
Apply
By clicking Apply Now you agree to share your profile information with the hiring company.

The SRE Incident Manager, will be responsible for overall operational efficiency & improve customer experience on Dragon Medical One Azure Platform & services. The Site Reliability Engineering (SRE) Team is responsible formultidisciplinary engineering organization tasked with leading quality and reliability holistically across the Azure platform.One of our responsibilities is to meet the 99.99% up time.
Responsibilities

  • Responsible for leading,coordinating, directing all facets of the incident response effort during the critical incident conference call.
  • Responsible for driving complex multiservice outages to resolution in a timely and effective manner through coordination of internal SRE team and key stakeholders.
  • Work effective under pressure, broad technical, analytical, and problem-solving expertise, ability to confidently collaborate with varied partners, and great written and spoken communication.
  • Responsible for building and evolving the practice of Incident Management across Azure platform, using Post Incident Reviews, developing processes and systems automation to leverage the related telemetry and metrics to identify and drive platform improvements globally.
  • The successful candidate will be able to demonstrate breadth while managing complex, highly available services with a deep understanding of the underlying components (e.g. Application & its interaction with Azure infrastructure), and work directly Technical Customer Support, SRC team (first level support), Engineering, business, and management teams. In this role, you will be surrounded by SRE Cloud Architect, data scientists, SRE engineers, and colleagues that obsess over quality, improving customer and platform experience
  • During incident conference bridge, this role:
    • Owns all communication during a major system outage, ensuring stakeholders in the critical incident call (e.g. SRE engineers, Technical support, business, management, etc.) are kept updated until the incident is resolved.
    • Drives continuous swift momentum towards mitigation, asking technical questions, offering suggestions around troubleshooting direction, as well as providing clear and concise communication to stakeholders and our SRE engineers & management.
  • Post Incident:
    • Lead Post Incident Reviews and Problem Management meetings with key stakeholders and service owners to review events and opportunities for ongoing improvement in both technical and procedure areas.
    • Work with Technical Support, First level support team (SRC), and Customer Success Organization (CSO) teams in provide the necessary information about the incident to generate communication report to the customers.
    • Drive/contribute to a cross team projects to ensure the fixes and improvements resulted from incidentsare put in place to ensure the prevention in the future. (e.g.implementation of monitoring tools, alerts, infrastructure changes, application changes, etc.)
  • Develop & document workflow / process /life cycle of incident escalation, incident management, incident response, and incident resolution. Continue to inspect and adapt the process.
  • Lead the effort in developing playbooks.
  • Identify opportunities and take ownership for automation and/or continuous improvement of Incident Management processes and best practices.
  • Develop & implement in arotational on-call schedules for the SRE team.
  • Participateon a on call rotation.
  • Regularly review, update, improve, and clean up the Incident response platform (PagerDuty) set up to ensure that appropriate people and levels are paged properly.
  • Review existing trainingonboarding process for other SRE engineers to support the system and to handle the critical incidents.Continue to inspect, identify areas of improvements, develop & execute the plan for improvements with regards to the process / tools /techniques for new as well as existing SRE engineers to ensure high fluency in the production escalation and incident response.
  • Design and put a plan towards "Follow the Sun" global shift rotation.
  • Responsible for 4-6 direct reports. We operate in a Metrix system. So, you will be required to work with other managers / leads to set goals / objectives for your direct reports, hold regular 1:1's, help remove roadblocks for your direct report, manage & develop talents.


Qualifications

  • Must be technically skilled and be able to articulate technical issues in a meaningful way to both engineers and management level management.
  • Strong communication skills (both verbally and in a written form), organized, attention to details,analytical, and problem-solving
  • Effectively & fluently communicate and coordinate production incidents across multiple organizations.
  • Proficiency in documenting processes and monitoring performance metrics.
  • Crisis management skills: able to set priorities, pursue multiple threads at the same time, accurately reflect current state and drive towards desired state.
  • Working effectiveness under pressure, ability to confidently collaborate with varied partners.
  • Ability to maintain calm during stressful situations; demonstrated leadership skills under fast-paced, highly dynamic situations.
  • Strong design, scripting, problem solving and debugging skills.
  • Experience managing complex projects spanning multiple teams and organizations.
  • Knowledge of Microsoft Azure, AWS, GCP or similar cloud computing platforms.
  • 5-7 years' experience in incident management orequivalent work experience.
  • Minimum of 3 years' experience in team management


Education:

  • B.S. in Electrical or Computer Engineering, Computer Science or relevant work experience


#LI-NS1
#LI-HYBRID
#SRE, #SiteReliabilityEngineer, #Manager
Nuance offers a compelling and rewarding work environment. We offer market competitive salaries, bonus, equity, benefits, meaningful growth and development opportunities and a casual yet technically challenging work environment. Join our dynamic, entrepreneurial team and become part of our continuing success.
Nuance celebrates diversity and is proud to be an equal employment opportunity and affirmative action workplace. We consider all qualified applicants without regard to race, color, religion, sex (including pregnancy), sexual orientation, gender identity or expression, national origin, military and veteran status, disability, genetics, or any other category protected by law or Nuance policy. If you need an accommodation because of a disability for any part of the employment process, please call 781-565-5086 and let us know.

Read Full Job Description
Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.

Location

Our headquarters is in Burlington, 30 minutes from downtown Boston, right off 128 and across the street from Wayside Commons (hello, shopping!).

Similar Jobs

Apply Now
By clicking Apply Now you agree to share your profile information with the hiring company.
Learn more about NuanceFind similar jobs