fal Logo

fal

Sr Linux System Administrator

Posted 8 Days Ago
Be an Early Applicant
Easy Apply
Remote
Hiring Remotely in USA
Senior level
Easy Apply
Remote
Hiring Remotely in USA
Senior level
Responsible for the lifecycle management of GPU servers, including provisioning, automation, security hardening, and performance tuning for AI workloads.
The summary above was generated by AI

You are an expert Linux systems operator who keeps fleets of servers healthy, secure, and performant at scale. At fal, you will be responsible for the bare-metal and OS-level foundation that our entire GPU cloud runs on. From provisioning and imaging thousands of GPU nodes to kernel tuning, storage management, and security hardening, you will ensure every machine in our fleet is production-ready and running at peak efficiency. You are deeply comfortable in a terminal, you think in terms of uptime and automation, and you take pride in infrastructure that just works.

 Key Responsibilities
  • Own the full lifecycle of our bare-metal GPU server fleet: provisioning, imaging, configuration management, patching, and decommissioning across multiple data centers and providers.
  • Build and maintain our server automation stack using Ansible, Terraform, and custom tooling to manage OS configuration, kernel parameters, driver versions, and firmware updates at scale.
  • Tune Linux systems for AI workloads: kernel parameters, NUMA topology, CPU pinning, hugepages, I/O schedulers, and GPU driver stack optimization (NVIDIA drivers, CUDA, container runtimes).
  • Manage and optimize distributed and local storage systems supporting model weights, checkpoints, and ephemeral scratch: NVMe arrays, NFS, parallel file systems, and object storage.
  • Implement and enforce OS-level security: hardening baselines, SELinux/AppArmor policies, SSH key management, vulnerability scanning, and compliance automation.
  • Own system observability: deploy and maintain node-level metrics collection, log aggregation, and alerting using Prometheus, node_exporter, Loki, and Grafana.
  • Collaborate with the Compute platform team to ensure smooth integration between our infrastructure layer (K8s, Nomad, FluxCD) and the underlying Linux hosts.
Requirements
  • 8+ years of experience administering Linux systems at scale, ideally in GPU cloud, HPC, or large bare-metal environments.
  • Deep expertise in Linux internals: systemd, kernel tuning (sysctl, cgroups, namespaces), boot process, package management, and performance profiling (perf, bpftrace, sar).
  • Strong experience with configuration management and infrastructure-as-code: Ansible, Terraform, cloud-init, PXE/iPXE, and custom imaging pipelines.
  • Solid understanding of storage technologies: LVM, RAID, NVMe, NFS, Lustre or GPFS, and Linux I/O stack tuning.
  • Familiarity with the NVIDIA GPU software stack: drivers, CUDA toolkit, nvidia-smi, MIG, and container runtimes (nvidia-container-toolkit).
  • Proficiency in Python and Bash scripting for automation, monitoring, and fleet management tooling.
  • Excellent communication and a self-starter mindset—you take ownership and constantly seek improvement.
Nice to Have
  • Experience operating Kubernetes on bare metal (kubeadm, Kubespray) and managing GPU scheduling in K8s (device plugins, MIG slicing).
  • Hands-on experience with BMC/IPMI/Redfish for out-of-band server management and firmware lifecycle automation.
  • Familiarity with fleet-scale observability: Prometheus federation, Thanos, or Victoria Metrics for multi-cluster monitoring.
  • Contributions to open-source infrastructure tooling or Linux distributions.
  • Experience with compliance frameworks relevant to cloud providers (SOC 2, ISO 27001).
What we offer at fal
  • Interesting and challenging work
  • Competitive salary and equity
  • A lot of learning and growth opportunities
  • We offer visa sponsorship and will help you relocate to San Francisco.
  • Health, dental, and vision insurance (US)
  • Regular team events and offsite
Location
  • Remote

Top Skills

Ansible
Apparmor
Bash
Cuda
Gpu
Grafana
Kubernetes
Linux
Nfs
Nvme
Prometheus
Python
Raid
Selinux
Terraform

Similar Jobs

13 Days Ago
Easy Apply
Remote
USA
Easy Apply
80K-90K Annually
Senior level
80K-90K Annually
Senior level
Cloud • Information Technology
As a Senior Linux System Administrator, you will mentor junior staff, manage infrastructure, respond to incidents, and perform advanced troubleshooting.
Top Skills: BashCaching SolutionsCentosCephCumulus LinuxDatabasesDebianFirewallsKubernetesLibvirtLinuxLoad BalancingNetworkingPHPPythonUbuntuVirtualizationWeb Servers
An Hour Ago
Easy Apply
Remote
USA
Easy Apply
Mid level
Mid level
Cloud • Information Technology • Consulting • Cybersecurity • Data Privacy
Drive new business through outbound prospecting, build and maintain partner relationships and a strong sales pipeline, conduct sales meetings/demos, and achieve quota while managing information flow across the sales cycle.
Top Skills: CopperGoogle WorkspaceMonday.Com
An Hour Ago
Easy Apply
In-Office or Remote
New York, NY, USA
Easy Apply
Junior
Junior
Cloud • Information Technology • Consulting • Cybersecurity • Data Privacy
Execute outbound prospecting to generate and qualify leads through cold calling, email, and social selling. Build prospect lists, collaborate with marketing and AEs, maintain CRM, meet outreach targets, and stay current on industry and product knowledge to drive pipeline growth.
Top Skills: AmplemarketHubspotSalesforceSalesloftZoominfo

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

  • Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
  • Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
  • Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
  • Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account