McAfee

Senior Site Reliability Engineer

Posted 11 Days Ago

Be an Early Applicant

In-Office

Frisco, TX

Senior level

In-Office

Frisco, TX

Senior level

Responsible for maintaining service levels, ensuring application availability, and collaborating with teams for operational improvements in a hybrid SRE role.

The summary above was generated by AI

Role Overview:

As the SRE engineer, you will be accountable & responsible to maintain the appropriate service levels (availability, latency, and reliability) to serve our customers' needs, and reduce the friction for managing change. Your responsibilities will include engaging with DevOps, Engineering & other teams to understand and support the business needs and initiatives. Every SRE is responsible for the availability, scalability, security, performance, cost, and compliance requirements of our services. You will ensure applications on-boarded to SRE are instrumented for full-stack observability and continuous testing, introduce continuous improvement, integrate into IT Service Operations, and share support responsibilities for critical customer journeys, business flows, and applications.
This is a Hybrid position located in Frisco, TX. You will be required to be onsite on an as-needed basis, typically 1 to 6 times a month. We are only considering candidates within a commutable distance to one of the two locations and are not offering relocation assistance at this time.

About the Role

Responsible for proactive monitoring of mission critical production environment and respond quickly in response to breach in trends or issues.
Troubleshoot, debug, and escalate issues with proper analysis to concerned teams to ensure maximum availability.
Troubleshoot problems in real-time, interacting with DevOps/Engineering and internal support representatives to deliver maximum customer satisfaction.
Detect and triage of all operational incidents and requests.
Work extensively to help reduce the Mean Time to Restore (MTTR) & improve Mean Time To Detect (MTTD)
Work across Engineering and Support teams to ensure we meet our goals for service reliability, availability, and efficiency.
Ensure security events and alerts are addressed in a timely manner.
Own availability and performance of mission critical services. Automation to prevent problem recurrence, and responses to all non-exceptional service conditions.
Help maintain and improve service operations by following established processes and procedures and periodic update of SOP and documents in confluence page.
Create and manage day to day processes including Change Management, Incident Management, and Problem Management
Support automation initiatives to enhance Mean Time to Restore (MTTR) and Mean Time To Detect (MTTD).
Help track Key Performance Indicators (KPIs) to support operational performance and service reliability.
Participate in incident retrospectives and assist in managing the incident lifecycle.
Planning and deployment of patches and product enhancements to our environments.
Engage in readiness reviews before changes or deployments into production environments.
Support product engineering teams on SRE related activities to establish optimal SLAs for all pre-defined activities and provide a high-quality customer experience.
Provide detail summary of all high priority issues to stakeholders ensuring quality in data provided.
Participate early in the SDLC to ensure reliability is built in from the beginning and creating plans for successful implementations/launches and transition into SRE team smoothly.
Create accurate root cause of Production issues and help to provide long term solutions to fix them.
Continually evaluate and adopt the latest industry technologies to optimize costs and streamline processes.
Communicate effectively and present team progress to leadership.
Lead by example technically and establish credibility with quality technical execution.
Mentor, coach, other SRE team members.

About You

4 to 5+ years of software development and/or technical operations experience, and experience running large-scale applications.
Prior experience in SRE / DevOps, Infrastructure Engineering, and Systems Engineering required.
Experience in defining and monitoring for highly resilient and reliable applications.
Experience maintaining and operating production systems (> 99.95% SLA) on Cloud.
Able to Monitor, Debug & RCA for any service failures.
Exceptional communication skills that cross both team and geographical boundaries
Advanced knowledge and skills within a specific technical or professional discipline with understanding of the impact of work on other areas of the organization.
Enjoy working with a large variety of services and technologies.
Experience with Monitoring, logging, APM & other tools: APMs. Grafana, CloudWatch, etc.
Experience with CI/CD tools: Git, Jenkins, Harness, etc.
Experience with container technologies: Kubernetes, Docker
Experience with both Windows and Linux Operating Systems
Strong knowledge of AWS cloud service offerings covering serverless and containerized workloads
Good to have ITIL, HDI, AWS, any other Cloud certifications
Working experience in very well in a fast-paced, high-growth environment
You are consistently learning and looking for opportunities to improve the efficiency with technology and tools
Ability to work some non-standard hours to support a global team and initiatives.

#LI-Remote

Company Overview

McAfee is a leader in personal security for consumers. Focused on protecting people, not just devices, McAfee consumer solutions adapt to users’ needs in an always online world, empowering them to live securely through integrated, intuitive solutions that protects their families and communities with the right security at the right moment.

Company Benefits and Perks:

We work hard to embrace diversity and inclusion and encourage everyone at McAfee to bring their authentic selves to work every day. We’re proud to be Great Place to Work® Certified in 10 countries, a reflection of the supportive, empowering environment we’ve built where people feel seen, valued, and energized to reach their full potential and thrive.

We offer a variety of social programs, flexible work hours and family-friendly benefits to all of our employees.

Bonus Program
Pension and Retirement Plans
Medical, Dental and Vision Coverage
Paid Time Off
Paid Parental Leave
Support for Community Involvement

We're serious about our commitment to diversity which is why McAfee prohibits discrimination based on race, color, religion, gender, national origin, age, disability, veteran status, marital status, pregnancy, gender expression or identity, sexual orientation or any other legally protected status.

Top Skills

AWS

Cloudwatch

Docker

Git

Grafana

Harness

Jenkins

Kubernetes

Similar Jobs

JPMorganChase

Site Reliability Engineer

16 Days Ago

Hybrid

Fort Worth, TX, USA

Senior level

Financial Services

As a Senior Lead Site Reliability Engineer, you will implement observability solutions, mentor junior engineers, drive adoption of SRE principles, and communicate with stakeholders to ensure high system reliability and performance.

Top Skills: AngularDatadogDynatraceGrafanaJavaPrometheusPythonSplunkTerraform

SecurityScorecard

Senior Site Reliability Engineer

19 Days Ago

In-Office

Austin, TX, USA

100K-150K

Senior level

100K-150K

Senior level

Information Technology • Security • Cybersecurity

As a Staff Site Reliability Engineer, you will design, optimize, and maintain Kubernetes infrastructure and CI/CD systems while collaborating with teams to enhance automation and reliability.

Top Skills: Argo CdBashCi/CdClickhouseDatadogFlinkGithub ActionsGitlab CiGoGrafanaHelmJenkinsKafkaKubernetesOpentelemetryPrometheusPythonSpinnakerTerraform

Imubit

Senior Site Reliability Engineer

10 Hours Ago

In-Office or Remote

Houston, TX, USA

Mid level

Artificial Intelligence • Machine Learning • Energy

The Site Reliability Engineer designs and maintains cloud infrastructure at Imubit, optimizing deployment processes, managing incidents, and collaborating with teams to enhance system reliability and performance.

Top Skills: AnsibleAWSAws Secrets ManagerGCPGitGoGrafanaHashicorp VaultKubernetesNew RelicPostgresPrometheusPythonSplunkTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories