AllCloud Logo

AllCloud

GPU Engineer

Posted 11 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in United States
Senior level
Remote
Hiring Remotely in United States
Senior level
The GPU Engineer will design and optimize GPU-based infrastructure for LLM training, focusing on performance enhancements and resource management in a cloud environment.
The summary above was generated by AI
Description

GPU Engineer

Location: US / Canada (Eastern Time) - Home based

Job Type: Full-time, Permanent 

About AllCloud

AllCloud is a global professional services company providing organizations with cloud enablement and transformation tools. As an AWS Premier Consulting Partner and audited MSP, a Salesforce Platinum Partner, and a Snowflake Premier Partner, AllCloud helps clients connect their front and back offices by building a new operating model to harness the benefits of cloud technology and data and analytics.

Job Summary

We are seeking an experienced GPU Engineer to join our innovative AI team at AllCloud. This role will be responsible for designing, implementing, and optimizing GPU-based infrastructure for large-scale LLM training and inference. The ideal candidate will have deep expertise in GPU architecture, parallel computing, and performance optimization for machine learning workloads. You'll work closely with our LLM Architects and ML Engineers to build and maintain the high-performance computing environment required for training our custom transformer-based language models.

Responsibilities

  • Design and implement scalable GPU clusters on AWS infrastructure for distributed LLM training
  • Optimize GPU memory usage, computational throughput, and inter-node communication for transformer model training
  • Configure and tune GPU acceleration libraries (CUDA, cuDNN, NCCL) for maximum performance
  • Implement mixed precision training and other optimization techniques to improve training efficiency
  • Architect and deploy GPU-based inference solutions that balance latency, throughput, and cost
  • Create benchmarking tools to measure and improve model training and inference performance
  • Establish monitoring and management systems for GPU resources to maximize utilization and reliability
  • Collaborate with LLM Architects to implement parallelization strategies (model, data, pipeline parallelism)
  • Troubleshoot hardware and software issues affecting GPU performance
  • Keep current with advancements in GPU technology and AI accelerator hardware


Requirements

Summary of Key Requirements

  • 5+ years of experience optimizing GPU infrastructure for machine learning workloads
  • Advanced knowledge of NVIDIA GPU architecture and CUDA programming
  • Strong understanding of HPC computing, AI network architecture, and physical layer management.
  • Experience with AWS GPU instances (e.g., P4d, P5, G5) and AWS Batch for ML workloads
  • Strong background in distributed computing and parallel processing techniques
  • Familiarity with transformer architecture and deep learning frameworks like PyTorch or TensorFlow
  • Expertise in performance profiling and bottleneck identification in GPU workloads
  • Experience with containerization (Docker) and orchestration (Kubernetes)
  • Understanding of memory optimization techniques for large language models
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (Master's preferred)

Certifications

  • AWS Certified Solutions Architect - Professional (Strongly Preferred)
  • NVIDIA-Certified Professional: Accelerated Data Science (Preferred)
  • NVIDIA-Certified Professional: AI Infrastructure or AI Networking (NCP-AIN) (Preferred)

Why work for us? 

Our team inspires progress in each other and in our customers through our relentless pursuit of excellence; you will work with leaders who promote learning and personal development.


AllCloud is an Equal Opportunity Employer and considers applicants for employment without regard to race, color, religion, sex, orientation, national origin, age, disability, genetics or any other basis forbidden under federal, provincial, or local law.


Top Skills

AWS
Cuda
Cudnn
Docker
Gpu Clusters
Kubernetes
Nccl
PyTorch
TensorFlow

Similar Jobs

12 Days Ago
Remote
CA, USA
148K-288K
Senior level
148K-288K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
You will develop SOC drivers, build automation tools, validate drivers, and support various operating system drivers while collaborating with global teams.
Top Skills: AcpiArm MicroarchitectureCC++GccGdbLinuxLlvmMsvcPythonWindbgWindows
15 Days Ago
Remote
2 Locations
184K-357K
Senior level
184K-357K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Senior Software Engineer will develop system software solutions for GPUs, focusing on display features, optimization strategies, and collaborating with teams on architecture specifications.
Top Skills: CDevice DriverEdpHdmiOperating System InternalsReal-Time Embedded Operating SystemsVesa Display Port Standards
16 Days Ago
Remote
2 Locations
144K-270K
Senior level
144K-270K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The role involves automating and maintaining GPU clusters, driving CI/CD processes, streamlining release management, and resolving operational issues in a collaborative environment.
Top Skills: AnsibleCi/CdGrafanaInfinibandLinuxNvlinkPrometheusPythonShellSlurm

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

  • Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
  • Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
  • Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
  • Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account