Easy Apply
Easy Apply
Design and operate large-scale networks for GPU workloads, automate network infrastructure, ensure network performance, and troubleshoot issues.
You are a seasoned networking engineer who has designed, deployed, and operated high-performance networks at scale. At fal, our platform orchestrates AI inference workloads across thousands of GPUs spread over multiple data centers and cloud providers. You will own the network layer that ties it all together—ensuring that model traffic, storage I/O, and control-plane communication are fast, reliable, and secure. You think in terms of packets per second, tail latency, and fabric utilization, and you automate everything you touch.
Key Responsibilities
- Design, build, and operate the network fabric that interconnects our GPU fleet, including spine-leaf architectures, RDMA/RoCEv2 networks for distributed inference, and overlay networks for tenant isolation.
- Own L2/L3 network design across bare-metal and cloud environments, including BGP peering, ECMP, VXLAN/EVPN, and high-bandwidth interconnects between data centers.
- Develop and maintain network automation using Ansible, Terraform, and custom tooling to provision, configure, and validate switches, routers, DPUs, and SmartNICs at scale.
- Instrument deep network observability—build dashboards, alerting, and anomaly detection across our fabric using Prometheus, Grafana, and packet-level telemetry.
- Partner with the Compute and ML Performance teams to tune network paths for AI workloads, minimizing latency for model serving and maximizing throughput for large tensor transfers.
- Drive incident response and root-cause analysis for network-related production issues and build automation to prevent recurrence.
- Evaluate and qualify new networking hardware and software—NICs, switches, DPUs, SONiC, Cumulus, and similar—as we scale to next-generation GPU clusters.
- 8+ years of experience building and operating large-scale networks, ideally in GPU cloud, HPC, or hyperscale environments.
- Deep expertise in Linux networking internals: kernel networking stack, iptables/nftables, tc, eBPF, network namespaces, bonding/teaming, and SR-IOV.
- Strong command of routing and switching protocols: BGP, OSPF, ECMP, VXLAN, EVPN, MPLS, and segment routing.
- Hands-on experience with high-performance networking for AI/ML: RDMA, RoCEv2, InfiniBand, GPUDirect, and NCCL tuning.
- Proficiency automating network infrastructure with Ansible, Python, Go, and Git.
- Experience with network-as-code workflows.
- Familiarity with modern network operating systems such as SONiC, Cumulus Linux, Arista EOS, or Nokia SR Linux.
- Experience with network observability stacks: Prometheus, Grafana, sFlow/NetFlow, and packet capture tools.
- Experience with DPU/SmartNIC programming (NVIDIA BlueField, AMD Pensando) and SDN/NFV architectures.
- Contributions to open-source networking projects (SONiC, FRR, DPDK, eBPF/XDP).
- Experience operating networks that support Kubernetes and container-native workloads (Calico, Cilium, MetalLB).
- Familiarity with data center physical layer design, optics, and cabling at scale
- Interesting and challenging work
- Competitive salary and equity
- A lot of learning and growth opportunities
- We offer visa sponsorship and will help you relocate to San Francisco.
- Health, dental, and vision insurance (US)
- Regular team events and offsite
Remote
Top Skills
Ansible
Arista Eos
Cumulus Linux
Ebpf
Evpn
Git
Go
Gpudirect
Grafana
Infiniband
Linux
Mpls
Network Observability
Prometheus
Python
Rdma
Rocev2
Sonic
Terraform
Vxlan
Similar Jobs
Aerospace • Artificial Intelligence • Hardware • Robotics • Security • Software • Defense
Lead integration of Anduril's Lattice OS and maritime platforms into IL5/IL6 and sovereign cloud environments. Design secure edge-to-cloud architectures, integrate UUVs and tactical networks, support multinational field tests, produce integration documentation, and liaise with allied stakeholders to ensure interoperable mission systems.
Top Skills:
Lattice Os,Devops,Devsecops,Encrypted Mesh Networking,Edge Compute,Distributed Systems,Tactical Networking,Classified Networks
Big Data • Fintech • Information Technology • Business Intelligence • Financial Services • Cybersecurity • Big Data Analytics
This role involves managing enterprise sales in marketing solutions for media and entertainment sectors, focusing on data-driven marketing and analytics.
Top Skills:
AnalyticsAudience DataData-Driven MarketingDevice/Identity GraphsIdentity ManagementIdentity Resolution
Artificial Intelligence • Hardware • Information Technology • Security • Software • Cybersecurity • Big Data Analytics
Manage day-to-day Vesta 911 customer service contracts, ensure contractual obligations, drive resources to resolve complex issues, manage escalations, coordinate cross-functional teams, and support lifecycle upgrades and technical performance reporting.
Top Skills:
Vesta 9-1-1,Servicenow,Google Workspace,Windows
What you need to know about the Boston Tech Scene
Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.
Key Facts About Boston Tech
- Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
- Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
- Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
- Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories



