Lambda

HPC Support Engineer - Named Accounts

Reposted 13 Days Ago

Be an Early Applicant

Remote

Hiring Remotely in USA

137K-206K Annually

Senior level

Remote

Hiring Remotely in USA

137K-206K Annually

Senior level

As a Super Intelligence HPC Support Engineer, you'll manage incidents for hyperscale GPU clusters, ensuring reliability and performance, and collaborating with engineering teams.

The summary above was generated by AI

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU.

If you'd like to build the world's best AI cloud, join us.

About this role:

As a Super Intelligence HPC Support Engineer, you’ll be part of a specialized team dedicated to Lambda’s most strategic and complex customers — organizations operating hyperscale GPU clusters and pushing the boundaries of AI/ML at unprecedented scale.

You’ll serve as a technical expert and trusted partner, ensuring their environments remain reliable, performant, and ready for mission-critical workloads. This role requires deep expertise in HPC and cluster orchestration, the ability to navigate complex incidents with precision, and the judgment to know when and how to engage engineering, data center, and product teams.

This is a customer-facing engineering role where the stakes are high: downtime has real business impact, and your expertise directly influences trust with some of the largest AI companies in the world.

What You’ll Do

Act as the primary technical point of escalation for Super Intelligence customers running hyperscale GPU clusters.
Lead incident response for complex issues, ensuring rapid triage, clear communication, and timely resolution.
Proactively identify risks in large environments (firmware, performance bottlenecks, orchestration issues) and drive preventative improvements.
Partner closely with Lambda Engineering and Product teams to influence roadmap decisions based on real customer needs.
Contribute to runbooks, best practices, and operational guides tailored for hyperscale environments.
Train and mentor other support engineers, raising the bar across Lambda’s support organization.
Participate in a rotating on-call schedule, owning critical incidents and high-priority alerts for SI customers.

You

7+ years of experience in HPC or cloud support engineering, with customer-facing responsibilities.
Proven experience managing large-scale Linux clusters and distributed HPC/AI workloads.
Deep expertise in orchestration tools such as Kubernetes and/or Slurm.
Strong knowledge of GPU technologies (CUDA, NCCL, MIG, NVLink, GPUDirect RDMA).
Skilled in high-throughput networking (InfiniBand, RoCE) and cluster storage solutions.
Familiarity with monitoring/logging platforms (Prometheus, Grafana, Datadog).
Experience leading incident management and communicating directly with enterprise or hyperscale customers.
Ability to balance deep technical troubleshooting with clear, concise communication to executives and stakeholders.

Nice to Have

Python automation experience (venv, conda, pyenv).
Certifications in NVIDIA or InfiniBand technologies.
Familiarity with infrastructure-as-code tools (Terraform, Ansible, Puppet, Chef).
Hands-on experience with storage providers and technologies (VAST, CEPH, Lustre, Weka, DDN).
Experience operating in high-availability, 24/7 environments.

Salary Range Information

This is a salaried non-exempt role, eligible for overtime. The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

About Lambda

Founded in 2012, with 500+ employees, and growing fast
Our investors notably include TWG Global, US Innovative Technology Fund (USIT), Andra Capital, SGW, Andrej Karpathy, ARK Invest, Fincadia Advisors, G Squared, In-Q-Tel (IQT), KHK & Partners, NVIDIA, Pegatron, Supermicro, Wistron, Wiwynn, Gradient Ventures, Mercato Partners, SVB, 1517, and Crescent Cove
We have research papers accepted at top machine learning and graphics conferences, including NeurIPS, ICCV, SIGGRAPH, and TOG
Our values are publicly available: https://lambda.ai/careers
We offer generous cash & equity compensation
Health, dental, and vision coverage for you and your dependents
Wellness and commuter stipends for select roles
401k Plan with 2% company match (USA employees)
Flexible paid time off plan that we all actually use

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

Equal Opportunity Employer

Lambda is an Equal Opportunity employer. Applicants are considered without regard to race, color, religion, creed, national origin, age, sex, gender, marital status, sexual orientation and identity, genetic information, veteran status, citizenship, or any other factors prohibited by local, state, or federal law.

Top Skills

Ansible

Ceph

Chef

Cuda

Datadog

Ddn

Gpu

Gpudirect Rdma

Grafana

Hpc

Infiniband

Kubernetes

Lustre

Mig

Nccl

Nvlink

Prometheus

Puppet

Python

Roce

Slurm

Terraform

Vast

Weka

Similar Jobs

Voltage Park

Infrastructure Operations Engineer

An Hour Ago

Remote

United States

Senior level

Artificial Intelligence • Cloud • Hardware • Machine Learning • Software • Infrastructure as a Service (IaaS)

The Infrastructure Operations Engineer at Voltage Park will design and implement infrastructure solutions, ensure system stability, support AI workloads, and collaborate with various teams.

Top Skills: AnsibleAWSBashCephElk StackGoKubernetesLinuxNfsPrometheusPythonTerraform

Voltage Park

Infrastructure Engineer

An Hour Ago

In-Office or Remote

San Francisco, CA, USA

8-8 Annually

Expert/Leader

8-8 Annually

Expert/Leader

Artificial Intelligence • Cloud • Hardware • Machine Learning • Software • Infrastructure as a Service (IaaS)

Design and operate observability platforms for metrics, logs, and alerts. Collaborate on infrastructure projects and enhance operational transparency.

Top Skills: BashElkGoGrafanaKafkaOtelPrometheusPromtailPythonVictoriametrics

Voltage Park

Technical Account Manager

An Hour Ago

In-Office or Remote

San Francisco, CA, USA

Mid level

Artificial Intelligence • Cloud • Hardware • Machine Learning • Software • Infrastructure as a Service (IaaS)

The Technical Account Manager will manage customer relationships, ensure satisfaction, and optimize use of GPU cloud infrastructure for various workflows.

Top Skills: AICloud InfrastructureGpuMachine Learning

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories