Site Reliability Engineering Manager at Nuance
Nuance is the pioneer and leader in conversational artificial intelligence (AI) innovations that bring intelligence to everyday work and life. We deliver solutions that understand, analyze, and respond to people, amplifying human intelligence to increase productivity and improve security. With decades of both domain and AI expertise, we work with thousands of organizations across a wide range of industries.
Join our team! At Nuance, we are constantly reinventing how people connect with technology and with each other. Our AI-powered solutions empower organizations to transform “business as usual.” For decades, the world’s leading financial, healthcare, telecommunications, retailers, and government organizations have trusted Nuance to bring them award-winning solutions that deliver more meaningful outcomes and empower a smarter, more connected world. From clinical speech recognition technologies that free physicians to spend more time caring for patients to real-time intelligence that powers billions of customer interactions, we’re deeply committed to helping organizations push the boundaries of what’s possible.
The Site Reliability Engineering (SRE) Manager is a disciplined position that combines software and systems engineering to support, help build and run cloud-scale, distributed, fault-tolerant systems. Our team ensures that Nuance services have the reliability and uptime to meet the needs of our ever-growing customer base in a mission critical industry: Healthcare hosted products. Practices such as event response, major incident management, minimizing operational work, deep post-mortem exercises, and prevention of potential outages are factoring into the iterative improvement work that the SRE focuses on.
In this role you will spend a majority of time supporting and being a central point of contact in the Site Reliability Engineering with a line of business through phone based, direct contact support, as well as through ticketing systems. You will also spend a portion of each day working with other team members on a variety of tasks from monitoring, incident management, completing capacity and deployment based service planning, defining plan / procedure to train and upskill the team to support the hosted system, and playing a pivotal role in our major incident response team should an incident impact the availability or reliability of one of the Healthcare products. Because of this breadth, the Lead Systems Engineer maintains a unique position to see the entire division and interact across all teams. The team lead will own overall process improvement, progress, communication, and keep their manager up to date on current and ongoing issues. Candidate needs to demonstrate excellent communication skills and have experience in leading in a matrix management organization
Minimum years of work experience: 7+ years of experience in large complex information systems, and/or Cloud environments.
- Responsible for monitoring an organization’s servers, networks, and computer systems for irregularities and performance issues.
- Assess system data and error logs, along with user reports, to determine areas for improvement or repair. In this aspect of the role, an IT operations manager may also determine when systems or servers are due for upgrades.
- Monitor environments, technical assets and/or services for behavior or performance outside of standards or SLAs. Identify potential cause and evaluate impact on infrastructure, delivery or services. Determine appropriate next steps (e.g. closer monitoring, further review or immediate action). Alert appropriate team (per process) when a threshold has been reached or a change/failure has occurred. Provide advice and guidance to others in monitoring and analysis of assets, systems and services.
- Provide oversight, technical direction, and expertise to the other SRE teams as it relates to data analysis, monitoring tools and processes, and event detection
- Responsible for major IT systems incident management from initiation until an acceptable work-around is in place or resolved.
- Responsible for training team members and putting process & procedure in place to support the system and to handle the critical incidents.
- Coordinate appropriate resources to resolve critical incidents in accordance with service level agreements and operational level agreements.
- Own all communication during a major system outage, ensuring IT management and the businesses are kept updated until the incident is resolved.
- With thorough understanding of technology assets/environments/services, business needs and SLAs, lead the creation, revision and implementation of monitoring tools, processes and reports.
- Regularly review and identify process improvement opportunities and implement changes in collaboration with process owner and other technology functions. Champion and provide oversight to ensure adherence to established processes, tools and methodologies.
- Engage in establishment of environment and technical asset and service availability, reliability and maintainability requirements.
- Review availability information and identify developing issues and opportunities for improvement. Ensure effective hand-offs with appropriate technology function(s). Provide input into and drive availability improvement plans.
- Document concerns and findings, collecting all pertinent data (to include comparison of exception data and normal data). Ensure incident/event tracking tools are current (per established guidelines and procedures). Review, improve and champion the accuracy and maintenance of knowledge base content and known error database
- Develop, implement and participate on rotational on-call schedules for the SRE team.
- Broad experience in troubleshooting large-scale distributed systems covering application, OS, networking and storage areas.
- Self-motivated and proactive, with demonstrated creative and critical thinking capabilities
- Ability to manage / lead 5+ team members
- Strategic relationship and partnership building skills
- Excellent time management, organizational, communication skills
- Familiarity with cloud support engineering practices.
- Well versed in Azure cloud environments and management including direct work with customer support for maintenance and repair requests.
- Ability to failover and handle datacenter region outages.
- Good hands-on experience on any of these technologies including MSSQL, Grafana, Sumologic, Nagios, SaltStack, Zenoss, HP Openview, Remedyforce, Confluence, Jira, Pagerduty
- Working experience in Linux and Windows based production environments and strong knowledge in fundamentals and internals – file systems, memory management, threads and processes
- Strong understanding of networking protocols, IP packets, DNS, OSI layers and load balancing.
- Experience with system monitoring and alerting for availability, reliability and performance.
- Excellent analytical and problem-solving skills.
- Ability to solve operational related challenges through automation or process related improvements
- Ability to develop and plan for longer term projects to directly impact the SRE and Line Of Business (LOB) relationship and our understanding and ability to support the related products.
- B.S. in Electrical or Computer Engineering, Computer Science or relevant work experience
Nuance offers a compelling and rewarding work environment. We offer market competitive salaries, bonus, equity, benefits, meaningful growth and development opportunities and a casual yet technically challenging work environment. Join our dynamic, entrepreneurial team and become part of our continuing success.
Nuance celebrates diversity and is proud to be an equal employment opportunity and affirmative action workplace. We consider all qualified applicants without regard to race, color, religion, sex (including pregnancy), sexual orientation, gender identity or expression, national origin, military and veteran status, disability, genetics, or any other category protected by law or Nuance policy. If you need an accommodation because of a disability for any part of the employment process, please call 781-565-5086 and let us know.