Hyphen Connect Limited
LLM Pre-training & Distributed Engineer (AI Infrastructure)
Be an Early Applicant
Design, orchestrate, and optimize large-scale LLM pre-training runs across 1,000+ GPUs using PyTorch/DeepSpeed/Megatron-LM. Improve networking (InfiniBand/RDMA) and memory management, implement 3D parallelism, and automate checkpointing and failure recovery for month-long training jobs on SLURM or Kubernetes GPU clusters.
We are seeking a highly skilled LLM Pre-training & Distributed Systems Engineer. This role is essential for orchestrating large-scale machine learning training runs and optimizing distributed infrastructure. The ideal candidate will have a deep understanding of GPU clusters and extensive experience in system engineering to ensure efficient and reliable training processes.
Responsibilities:
- Orchestrate distributed training runs across 1,000+ GPUs using PyTorch, DeepSpeed, or Megatron-LM.
- Optimize networking (InfiniBand/RDMA) and memory management to prevent out-of-memory errors.
- Automate checkpointing and failure recovery during month-long training runs.
Required Skills:
- Deep expertise in 3D parallelism (Data, Tensor, Pipeline).
- Experience managing SLURM or Kubernetes-based GPU clusters.
- Strong systems engineering background (C++, CUDA, Python).
Similar Jobs
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
As a Manager in Oracle HCM, you'll help clients optimize HR processes by implementing Oracle solutions, leading teams, and ensuring project success through effective problem-solving and innovation.
Top Skills:
Cc&BEbsFusionHyperionOracle ApplicationsOracle Hcm CloudPeoplesoftRiceSiebel
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Lead and deliver specialized tax strategies focused on R&D tax credits, manage client engagements from planning to completion, analyze complex tax regulations, mentor junior staff, uphold professional standards, embrace technology to improve delivery, and build strong client relationships to identify tax optimization opportunities.
Artificial Intelligence • Professional Services • Business Intelligence • Consulting • Cybersecurity • Generative AI
Lead supply chain optimization and consulting engagements, including transportation, distribution, procurement, inventory management, integrated business planning, and process improvement. Analyze trends, develop strategies to reduce costs and improve operations, build client relationships, and guide junior team members to deliver operational excellence and transformative supply chain solutions.
Top Skills:
Supply Chain Management Software
What you need to know about the Boston Tech Scene
Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.
Key Facts About Boston Tech
- Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
- Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
- Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
- Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

