Senior Site Reliability Engineer

Humana Studio_h

Sorry, this job was removed at 5:57 p.m. (EST) on Monday, October 28, 2019

View 696 Jobs

Find out who's hiring in Greater Boston Area.

See all Developer + Engineer jobs in Greater Boston Area

View 696 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Description

Site Reliability Engineers are software development experts who handle the following responsibilities in a company: improving application lifecycle, evolving software systems to increase their reliability, monitoring application performance, and ensuring overall system health such as: high availability, low latency, top performance, high efficiency, effective change management, continuous monitoring & alarming, emergency response, and capacity planning. They act as a bridge between development and operations teams by applying a software engineering mindset to system administration topics.

Responsibilities

Job Description Overview:

Building software to help operations and support teams: SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Fixing support escalation issues: A site reliability engineer can expect to spend time fixing support escalation cases. Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.

Optimizing on-call rotations and processes: Site reliability engineers will need to take on-call responsibilities. The SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes. SRE teams will help add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools and documentation to help prepare on-call teams for future incidents.

Documenting “tribal” knowledge: SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.

Conducting post-incident reviews: SRE teams need to keep teams honest and ensure that everyone – software developers and IT professionals – are conducting post-incident reviews, documenting their findings and taking action on their learnings. Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.

Responsibilities (representative examples):

Capacity planning and management – create, use, maintain a capacity model for cloud based implementations.
Performing continuous integration and delivery as well as to Implement, test and monitor new microservices & trouble shooting of related deployment issues on Linux systems.
Collect and maintain a complete inventory of all systems. Identify and retire unused systems to recycle resources and reduce maintenance costs.
Create and maintain documentation of systems and processes for existing and new systems; as well as Configure and maintain Puppet/Ansible/Chef cookbooks for all deployed environments
Deploy and monitor instances and services in cloud based environments as well as to Identify and correct the root cause of various system alarms; as well as recommend changes to avoid their recurrence.
Provide systems support by participating in rotational on-call support by executing emergency recovery, maintenance and upgrades during weekend and evening hours when required.
Serve as an escalation point for other Systems Administrators, Engineers, and other technology teams in the resolution of server and system problems.
Lead & contribute in the proof-of-concept, implementation and maintenance of automation tools used in the management of our infrastructure.
Plan, schedule, test and perform software installation and upgrades.
Build, administer, and troubleshoot all mission critical environments (Production, Stage, Dev, Test, QA)
Leverage automation tools, especially Bash, Powershell and Puppet, in order to decrease end-to-end deployment times, reduce downtime, and increase reliability.
Implement and maintain monitoring solutions at the server and application level in order to increase visibility into day-to-day operations and issues, utilizing Nagios & Elk/Splunk
Lead initiatives to transition critical software services into the Cloud, and provide training for other employees on the Cloud transition process for other portions of the product/organization.
Generating well defined and documented standard processes for the enterprise.
Provide solutions for performance management, disaster recovery, monitoring and access management
Work/support business users to understand issues, develop root cause analysis and work with the team for the development of enhancements/fixes
Provide engineering design across different workloads including incident & problem management, change management, security and compliance
Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth

Required Skills:

5+ years Industry (post-graduation) experience in designing/developing, testing and supporting a highly scalable, highly available online service.
5+ years Industry (post-graduation) experience in working with a cloud based environments (AWS and/or Google and/or Azure)
5+ years Industry (post-graduation) experience working with Linux and the Windows operation systems.
2 + years Industry (post-graduation) experience in configuration management frameworks and experience using tools such as Puppet, Ansible and Chef.
2 + years Industry (post-graduation) experience in distributing processing frameworks like Spark and orchestration frameworks like Kubernetes and Docker Swarm for microservices.
2 + years Industry (post-graduation) experience in scripting languages (Bash, Python & PowerShell).
Working knowledge of TCP/IP, TCP/UDP as well as working knowledge of routers, switches, firewalls/VPNs and higher-level protocols like HTTP and DNS.
Working knowledge of monitoring & alarming tools like Nagios and Ele/Splunk
Working knowledge of relational and non-relational databases: MS SQL, MySQL, Postgres, Oracle & Mongo
Ability to troubleshoot run time service issues (memory leaks, race conditions, etc.) with appropriate tools (Dynatrace, JMeter, etc.).
Ability to define, document & explain technical architecture of complex and highly scalable products.
Required Education:

Bachelor of Science in an engineering discipline (Preferred: Computer Science, Computer Engineering, Computer Technology, Software Engineering, etc.) or equivalent experience

Desired Certifications:

Linux: Linux Foundation Certified System Administrator (LFCS) and/or Linux Foundation Certified Engineer (LFCE); Red Hat Certified System Administrator (RHCSA) and/or Red Hat Certified Engineer (RHCE) and/or Red Hat Certified Technician (RHCT)
Windows: Microsoft Certified Systems Administrator (MCSA) and/or Microsoft Certified Systems Engineer (MCSE)
Cloud: AWS Certified SysOps Administrator and/or AWS DevOps Engineer
Cloud: Azure Solution Architect; Azure DevOps Engineer; Azure Administrator Associate; Azure Developer Associate; Azure Security Engineer Associate
Network: Cisco Certified Network Associate or Professional -CCNA/ CCNP MCITP Server.
CompTIA Server+ and/or CompTIA Cloud+

Read Full Job Description

Senior Site Reliability Engineer

Location

Similar Jobs