Manager Site Reliability
About us:
Agero is powering the next generation of software-enabled driver safety services and technology, pushing the limits of big data to transform the entire driving experience. The majority of leading vehicle manufacturers and insurance providers use Agero’s roadside assistance, accident management, dispatch, consumer affairs and telematics innovations to strengthen their businesses and create stronger, lasting connections with their customers. Together, we’re making driving smarter and safer for everyone.
About the Role:
This position leads the Site Reliability team and oversees the process and procedures for maintaining the reliability of our products. The SRE team has oversite across product and technology for implementing the standards for testing, monitoring, and release stability of our applications. The SRE manager collaborates with product and engineering teams to ensure that the solutions adhere to our standards as well as collaborates on improving those standards. Works closely developing standards for runbooks, incident response, and blameless post-mortems. Plays a pivotal role in major incident response team should an incident impact the availability or reliability of one of the products.
Key Outcomes:
- Build and invest in relationships with key partners while learning the business and supporting model
- Implement AIOps machine learning solutions to automate the detection, consolidation, and remediation of alerts, events, and metrics in our platforms.
- Modernize processes to enable automation for change control, runbooks, documentation publishing, and monitoring solutions.
- Drive adoption of unified processes for Monitoring, Alerting, Incident Response and cross-product visibility as the enterprise product portfolios evolve.
The Day to Day:
- Responsible for monitoring an organization’s servers, networks, and computer systems for irregularities and performance issues.
- Assess system data and error logs, along with user reports, to determine areas for improvement or repair. In this aspect of the role, an IT operations manager may also determine when systems or servers are due for upgrades.
- Monitor environments, technical assets and/or services for behavior or performance outside of standards or SLAs. Identify potential cause and evaluate impact on infrastructure, delivery or services. Determine appropriate next steps (e.g. closer monitoring, further review or immediate action). Alert appropriate team (per process) when a threshold has been reached or a change/failure has occurred. Provide advice and guidance to others in monitoring and analysis of assets, systems and services.
- Provide oversight, technical direction, and expertise to the other teams as it relates to data analysis, monitoring tools and processes, and event detection
- Set standards for L1 & L2 support processes, runbooks, response, and incident management
- Recommend stack design improvements to facilitate automated remediation of production events
- Research, develop and introduce tools and methodologies to increase application uptime
- Provide strategies for improving application platforms with a focus on reliability, stability, performance and total cost of ownership
- Maintain understanding of industry best practices and leading edge technologies and adopt as appropriate
- Drive down inefficiencies and enhance cost savings for operational workflow across all platforms
- Responsible for major IT systems incident management from initiation until an acceptable work-around is in place or resolved.
- Responsible for training team members and putting process & procedure in place to support the system and to handle the critical incidents.
- Coordinate appropriate resources to resolve critical incidents in accordance with service level agreements and operational level agreements.
- Own all communication during a major system outage, ensuring IT management and the businesses are kept updated until the incident is resolved.
- With thorough understanding of technology assets/environments/services, business needs and SLAs/SLOs, lead the creation, revision and implementation of monitoring tools, processes and reports.
- Regularly review and identify process improvement opportunities and implement changes in collaboration with process owner and other technology functions. Champion and provide oversight to ensure adherence to established processes, tools and methodologies.
- Engage in establishment of environment and technical asset and service availability, reliability and maintainability requirements.
- Review availability information and identify developing issues and opportunities for improvement. Ensure effective hand-offs with appropriate technology function(s). Provide input into and drive availability improvement plans.
- Document concerns and findings, collecting all pertinent data (to include comparison of exception data and normal data). Ensure incident/event tracking tools are current (per established guidelines and procedures). Review, improve and champion the accuracy and maintenance of knowledge base content and known error database
Skills, Experiences and Education:
- B.S. in Electrical or Computer Engineering, Computer Science or relevant work experience
- 7+ years of experience in large complex information systems, and/or Cloud environments.
- Broad experience in troubleshooting large-scale distributed systems covering application, cloud, OS, networking, and storage areas
- Self-motivated and proactive, with demonstrated creative and critical thinking capabilities