NVIDIA

Senior Product Manager - Observability and Resilience

Posted 19 Days Ago

In-Office or Remote

2 Locations

208K-328K

Senior level

In-Office or Remote

2 Locations

208K-328K

Senior level

The Senior Product Manager will lead the development of tools for resiliency and observability in AI applications, coordinating across teams and driving innovation in reliability tooling.

The summary above was generated by AI

NVIDIA has become the platform upon which every new AI-powered application is built. From healthcare research applications to autonomous vehicles, or voice-recognition systems, there is a need to simplify and deliver predictability for AI applications and workflows ... and NVIDIA is right in the center of this revolution. Resiliency and Observability are key to delivering customer value and exhilarating customer experience. This product manager will lead the development of foundational tools dedicated to ensuring the resiliency and observability of large-scale accelerated computing platforms. By creating essential tools for system diagnostics, performance monitoring, and automated recovery, they will empower customers to confidently operate both complex AI training and demanding inference workloads with maximum uptime and efficiency.

What you will be doing:

Be a subject‑matter expert on resiliency and observability. Deeply understand failure modes across the GPU hardware, network, and software stack, along with the telemetry signals that reveal them, and how they correlate to workload health and SLOs. Master modern reliability architectures. Keep up-to-date with the industry trends.
Build for all that want to use. Drive joint project planning. Define concrete achievements, tasks, and work for resiliency and observability initiatives with external partners.
Fuel innovation in reliability tooling. Lead ideation sessions to propose novel approaches and shape new proof‑of‑concepts.
Bridge development, SRE, and partner teams. Facilitate clear communication, triage emergent issues rapidly, and ensure feedback loops between engineering and customer operations remain tight.
Coordinate execution across different functions. Work with engineering, design, operations, sales, and marketing to embed resiliency and observability requirements into every product launch, capacity expansion, and lifecycle transition.

What we need to see:

BS or MS in Computer Science, Computer Engineering, or a related field (or equivalent experience) and 12+ years of product‑management experience in enterprise technology.
Experience with GPU observability (DCGM, NVML, etc.) and integration into large‑scale telemetry systems.
Deep knowledge of AI/ML infrastructure, high‑performance computing (HPC), networking, and cloud technologies (IaaS, PaaS) including containerization, Kubernetes, and automation tools.
Familiarity with modern observability stacks: metrics, logs, traces, OpenTelemetry, Prometheus/Grafana, ELK/OpenSearch.
Experience building and preferably deep understanding of secure, compliance‑focused telemetry pipelines (SOC2, FedRAMP).
Ability to articulate trade‑offs among latency, throughput, cost, and reliability to both engineering and executive audiences.
Data-driven approach: defines SLIs/SLOs, manages error budgets, and develops value models.
Strong cross‑functional execution: writes clear specs and PRDs, produces GTM collateral, and leads agile processes.

Ways to stand out from the crowd:

Masters/Phd or Expertise in distributed systems, performance modeling, or fault‑tolerant computing.
Experience with MLOps and LLMOps ecosystems and integrating with enterprise platforms; deployments at modern data‑center scale; delivered ML/AI observability solutions for LLMOps, predictive incident detection, or anomaly classification.
Startup or 0 -> 1 experience building cloud‑native observability or resilience tools; proven success bringing open‑source observability products to market and shaping GTM strategy.
Familiarity with MLOps toolchains and integrations with monitoring platforms such as Splunk, Datadog, and Grafana Cloud.
Expertise with containerization technologies like Docker and Kubernetes, plus virtualization. Proficiency in network architecture and high‑performance interconnects (InfiniBand, Ethernet, RoCE).

We have some of the most forward-thinking and hardworking people in the world working for us and, due to outstanding growth, our elite engineering teams are growing fast. NVIDIA is widely considered to be one of the industry's most desirable employers. NVIDIA is at the center of Deep Learning, Artificial Intelligence, and Autonomous Vehicles. If you're looking for a challenge, thrives in an ambiguous environment and shares our passion for technology, we want to hear from you. We are looking for great people to help us accelerate the next wave of artificial intelligence.

#LI-Hybrid

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 208,000 USD - 327,750 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until August 21, 2025.NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Top Skills

Ai/Ml

Datadog

Dcgm

Docker

Elk

Gpu

Grafana

Hpc

Iaas

Kubernetes

Nvml

Opensearch

Opentelemetry

Paas

Prometheus

Similar Jobs

CDW

HPE Principal Technologist - Financial Services and Non-Profit

12 Minutes Ago

Remote or Hybrid

115K-144K Annually

Senior level

115K-144K Annually

Senior level

Artificial Intelligence • eCommerce • Information Technology • Internet of Things • Automation

The HPE Principal Technologist engages with large customer accounts, providing HPE solutions, training, and technical assessments while driving sales and strategic opportunities.

Top Skills: Hpe TechnologiesSolution SellingTechnical Consulting

Capital One

Product Manager

29 Minutes Ago

Remote or Hybrid

158K-197K Annually

Mid level

158K-197K Annually

Mid level

Fintech • Machine Learning • Payments • Software • Financial Services

As a Product Manager at Capital One, you will lead product initiatives, leveraging data analysis and collaboration with engineering teams to refine products and enhance customer experiences.

Top Skills: APIsBig DataPythonRSQL

Capital One

Consultant

29 Minutes Ago

Remote or Hybrid

133K-167K Annually

Mid level

133K-167K Annually

Mid level

Fintech • Machine Learning • Payments • Software • Financial Services

The Manager, Associate Relations Consultant will resolve associate issues, consult with business leaders on metrics, and ensure legal compliance within the HR Associate Relations team.

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories