NVIDIA Logo

NVIDIA

HPC Operations Manager – Hardware Engineering

Reposted 22 Days Ago
Be an Early Applicant
In-Office
4 Locations
272K-426K
Senior level
In-Office
4 Locations
272K-426K
Senior level
The HPC Operations Manager will lead a team, ensure HPC cluster reliability, oversee infrastructure improvements, and communicate statuses to senior management.
The summary above was generated by AI

Widely considered to be one of the technology world’s most desirable employers, NVIDIA is an industry leader with groundbreaking developments in High-Performance Computing, Artificial Intelligence and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. Our work opens up new universes to explore, enables outstanding creativity and discovery and powers what were once science fiction inventions from artificial intelligence to autonomous cars. We are now looking for a highly motivated HPC Operations Manager to join this multifaceted and innovative infrastructure team to craft global and dynamic HPC clusters used by Nvidia’s hardware design teams. We are looking for leaders to help us grow and evolve a reliable computing environment to enable our hardware designers to build the next generation of GPUs and SOCs.

What You'll be Doing:

  • A huge part of the day-to-day job is collaborating with partners to develop programs driving around storage, networking, and compute in our growing fleet of data centers.

  • Lead, cultivate, and mentor a multi-national team of sysadmins and devops engineers, in support of the chip design teams

  • Ensure the highest reliability of HPC clusters. Develop critical metrics, program schedules to measure program health, predictability, and achievements

  • Identify failures, lead retrospective analysis, and help to develop improvement action plans. Build standard methodologies that cut through complexity and can be used across Nvidia and influence other partners for continuous improvement

  • Evaluate the latest technologies (hardware and cloud computing) and recommend future evolution of the infrastructure. Plan deployments and refresh of hardware (compute, storage, network equipment), and associated software stack (e.g. OS)

  • Work multi-functionally with hardware engineering leaders to support their future chip design needs, understand their workflow characteristics, and engineer an efficient HPC environment. Work with IT and engineering infrastructure teams on the different subsystems that comprise the computing environment.

  • Lead all aspects of the HPC scheduler (LSF), set/adjust policy, ensure delivery of forecasted compute demand to each hardware division, and drive high utilization.

  • Track software licensing servers and drive efficient license utilization

  • Develop and manage program schedules, milestones and deliverables. Adjust in the face of a highly fluid customer product roadmap.

  • Regularly communicate program status and key issues to senior management at NVIDIA’s headquarters. Accurately represent the importance of issues and call out issues appropriately. Be the evangelist of data driven project management

What We Need to See:

  • B.S. or M.S. in Computer Science, Computer Engineering, Information Science (or equivalent experience)

  • 15+ years overall

  • 5+ years managing IT infrastructure teams of 10+ people

  • 10+ years experience running Linux servers, NFS storage, and Ethernet networks

  • Knowledge of HPC schedulers (IBM LSF preferred)

  • Knowledge of hardware design workflows (EDA tools and methodology)

  • Experience using project management and capacity planning software

  • Datacenter operations (rack and stack, maintenance)

Ways to stand out from the crowd:

  • HPC storage (e.g. Netapp, Pure Storage, Lustre, ZFS, Isilon)

  • Infiniband (operations, debugging, performance tuning)

  • Software development, especially in a devops context

  • Knowledge of relational databases, data lakes, metrics/visualization/analytics platforms

  • Deploying and maintaining FlexLM-based software license servers

  • Established relationships with enterprise-level equipment suppliers

The base salary range is 272,000 USD - 425,500 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and benefits. NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Eda Tools
Ethernet
Flexlm
Hpc
Ibm Lsf
Infiniband
Isilon
Linux
Lustre
Netapp
Nfs
Pure Storage
Zfs

Similar Jobs

An Hour Ago
In-Office
Denton, TX, USA
66K-83K
Senior level
66K-83K
Senior level
Cloud • Information Technology • Machine Learning
The Data Center Technician will maintain operations, conduct hardware diagnostics, provide support, and ensure service reliability during overnight shifts.
Top Skills: BashHardware TroubleshootingNetworkingPython
An Hour Ago
Hybrid
Austin, TX, USA
Mid level
Mid level
Cloud • Information Technology • Security • Software • Cybersecurity
Join Cloudflare's Enterprise Integrations Team as a software engineer to build and maintain integrations across SaaS applications, collaborating with teams and adhering to best practices.
Top Skills: GitlabGoGrafanaKibanaKubernetesPostgresPrometheusRestful ApisSentry
An Hour Ago
Hybrid
Austin, TX, USA
Senior level
Senior level
Cloud • Information Technology • Security • Software • Cybersecurity
Lead the SaaS application portfolio focusing on cost efficiency, governance, and strategic value. Collaborate with executives to align SaaS strategies with business objectives and enhance resource utilization while managing vendor relations and compliance.
Top Skills: AnalyticsInfosecLicense Management ToolsSaaSZylo

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

  • Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
  • Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
  • Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
  • Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account