Site Reliability Engineer/DevOps at Interactions
Headquartered in the Boston area, Interactions, LLC is the world’s largest independent AI company. We operate at the intersection of customer experience and AI – two of today’s most innovative and dynamic industries. Leading global brands in a variety of industries rely on Interaction’s conversational AI technology to communicate with their customers every day.
At Interactions we are committed to transforming customer experience and passionate about the professional and personal development of our talented and enthusiastic team. We endeavor to create opportunities that advance the skills, interests, careers and lives of our employees. Come join our growing team!
Creating tools to proactively monitor and improve end-to-end system performance, identify deficiencies, and potential failures throughout our infrastructure. Build deep, end-to-end knowledge of the complexity of our platform and continuously create improvements and automation to enhance durability, performance and supportability of the platform.
Essential Job Functions:
- Lead development of processes and software necessary to maintain services post-deployment through data collection and monitoring ensuring overall health of the services provided.
- Address service and infrastructure monitoring alerts.
- Develop new metrics/monitoring dashboards as additional coverage events become necessary.
- Monitor and continuously improve the availability and performance of infrastructure, systems and applications.
- Create and maintain documentation for processes, supported infrastructure resources and services.
- Drive supportability improvements by improving automation, automatic alerting, self-healing architectures, etc.
- Create new alerts, find anomalies, fix things, and ask why something broke.
- Manage, monitor, and troubleshoot daily processes and make improvements to current processes related to production operations.
- Capture and analyze data on Systems Availability, MTBF, and MTTR across all Digital channels; identify patterns and drive changes to both systems and processes to provide sustained improvements.
- As a technology subject matter expert (SME), you will mentor engineers to stretch their knowledge and perspective.
- Troubleshoot and debug software delivered by various development teams and ensure that more junior members of the team are capable of the same; coach team members in this practice.
- Document automation and the interaction of software and system as necessary to enable in others; ensure that other members of the team meet the same high standard of documentation.
Preparation, Knowledge, Skills and Abilities:
- BS Degree in Computer Science or equivalent experience / technical degree
- 5+ years experience with supporting SaaS environment at scale
- 5+ years experience in scripting and software development
- Ability to triage multiple issues simultaneously and work well under pressure
- Solid understanding/experience of networking, virtualization, storage, and monitoring
- Experience with Linux systems administration.
- Knowledge of and experience with network stack, protocols, network management and monitoring tools (Nagios, Check_MK, Splunk, Grafana)
- Solid project management and time management skills – ability to adjust to shifting priorities
- Driven to learn and try new things
- Experience supporting applications with a 24x7 SLA, and providing on-call support as part of a group rotation.
- Experience working in a NOC environment
- Experience in a technical supervisory role
- Excellent interpersonal and communication (oral, listening and writing) skills are required, especially to non-technical and senior leadership audiences.
- Strong collaboration skills and ability to communicate all aspects of the requirements, including the creation of formal documentation.
- Presentation skills to a variety of audiences
- Ability to demonstrate Interactions Values of:
- Being passionate about customer service
- Obsessing with our customer’s success
- Respecting each other
- Creating opportunity
- Embracing disruption
- Doing what we say we will do
- Experience working with cloud providers
- Knowledge of SIP / VoIP
- Experience with scripting (Python or other scripting languages)
- Hands on experience in an agile environment such as Scrum, Scrumban or XP.