Senior Site Reliability Engineer - GPU Clusters

Sorry, this job was removed at 08:12 p.m. (EST) on Friday, May 30, 2025

Be an Early Applicant

In-Office

4 Locations

In-Office

4 Locations

Similar Jobs

Gusto

Senior Product Manager

An Hour Ago

Easy Apply

Remote or Hybrid

Easy Apply

152K-230K

Senior level

152K-230K

Senior level

Fintech • HR Tech

The Senior Product Manager will design and build AI-powered workflows and tools for HR professionals, collaborating across teams to create impactful solutions for small businesses.

Top Skills: AIAutomation WorkflowsInternal PlatformsLlms

Wells Fargo

Security Engineer

7 Hours Ago

Hybrid

119K-206K Annually

Senior level

119K-206K Annually

Senior level

Fintech • Financial Services

Lead the design, implementation, and management of endpoint and network security controls to protect over 1 million devices from cyber threats.

Top Skills: AWSAzureCrowdstrikeEdrFirewallsGCPIso 27001Microsoft Defender For EndpointMitreNacNistTanium

Wells Fargo

2026 Analytics and Data Summer Internship - Early Careers

7 Hours Ago

Hybrid

41-41

Entry level

41-41

Entry level

Fintech • Financial Services

This internship provides 10 weeks of hands-on experience in analytics and data management, focusing on professional development and team collaboration.

Top Skills: MS OfficePower BIPythonSASSQLTableau

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions from artificial intelligence to autonomous cars. NVIDIA is looking for phenomenal people like you to help us accelerate the next wave of artificial intelligence. We are seeking a highly skilled and experienced Staff Software Engineer to lead the design, deployment, and management of our large-scale GPU clusters. These clusters will power AI workloads across multiple teams and projects, making a significant impact on the future of machine learning and artificial intelligence at NVIDIA. Join our engineering team and collaborate with researchers, AI engineers, and infrastructure teams to ensure our GPU clusters perform efficiently, scale well, and remain reliable.

The ideal candidate has a passion for operational excellence, automation, and working in a multi-cloud environment. You will collaborate with a diverse and experienced team, constantly improving infrastructure provisioning and resiliency to ensure a high level of service availability.

What you will be doing:

Design, deploy and support large-scale, distributed GPU clusters to run high-performance AI and machine learning workloads.
Continuously improve infrastructure provisioning, management, and monitoring through automation.
Ensure the highest level of uptime and quality of service (QoS) through operational excellence, proactive monitoring, and incident resolution.
Support a globally distributed, cloud environment like AWS, GCP, Azure or OCI as well as on prem.
Define and implement service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure infrastructure quality.
Write high-quality Root Cause Analysis (RCA) reports for production-level incidents and work towards preventing future occurrences.
Participate in the team's on-call rotation to support critical infrastructure.
Drive the evaluation and integration of new GPU - like GB200 - and cloud technologies to improve system performance.

What we need to see:

Minimum BS degree in Computer Science (or equivalent experience), with 7+ years of software engineering experience, including at least 3+ years managing GPU clusters or similar high-performance computing environments.
Expertise in designing, deploying, and running production-level cloud services.
Proficiency with orchestration and containerization tools like Kubernetes, Docker, or similar.
Experience coding/scripting in at least two high-level programming languages (e.g., Python, Go, Ruby).
Strong proficiency with Linux operating systems and TCP/IP fundamentals.
Proficient in modern CI/CD techniques, GitOps, and Infrastructure as Code (IaC) using tools such as Terraform or Ansible.
Diligent with strong communication and documentation skills.

Ways to stand out from the crowd:

Experience managing large-scale Slurm and/or BCM deployments in production environments.
Expertise in modern container networking and storage architectures.
Proven track record to define and drive operational excellence in highly distributed, high-performance environments.

The base salary range is 184,000 USD - 356,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories