Rhino Federated Computing Logo

Rhino Federated Computing

Senior SRE

Posted 6 Days Ago
In-Office
Boston, MA
Senior level
In-Office
Boston, MA
Senior level
The Senior SRE will enhance reliability of the Rhino FCP, promote automation, and manage incidents while ensuring system scalability and performance.
The summary above was generated by AI
About the role

The Senior Site Reliability Engineer will be responsible for engineering the reliability and resilience of Rhino’s Federated Computing Platform (Rhino FCP). This distributed infrastructure supports cutting-edge AI/ML research and development across highly regulated industries, including healthcare, finance, and life sciences, by enabling secure, privacy-preserving data collaboration around the world.

You will apply a software engineering discipline to operations, focusing on the production environment across a fleet of installations deployed behind the firewalls of partner organizations and our centralized cloud orchestration layer. You'll help define and monitor the Service Level Objectives (SLOs) for the platform’s core services. You'll collaborate closely with backend engineers and devops engineers to integrate reliability directly into the FCP’s architecture.

This role involves proactive ownership of production risk: from defining reliability metrics and reducing operational toil to designing failure-resilient systems and leading blameless incident response. It’s ideal for someone who thrives on solving complex distributed systems problems, views automation as a primary engineering function, and is excited to drive guaranteed stability in secure, distributed AI platforms.


Key Responsibilities
  • Help Define and Monitor Reliability Standards: Define, monitor, and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services.
  • Toil Reduction and Automation: Systematically identify, prioritize, and automate repetitive operational work to eliminate manual tasks and improve system predictability and consistency.
  • Technical Infrastructure and Installation Support: Support internal stakeholders and work with customers on technical issues related to platform installation and platform infrastructure, leveraging learnings from support issues to drive further improvements in processes and automation.
  • Capacity and Performance Optimization: Conduct load testing and capacity planning to predict and address scaling bottlenecks, specifically optimizing the performance and resource utilization of the federated AI/ML computing clusters.
  • Production Readiness Engineering: Partner with development teams from the design phase (shifting left) to ensure new infrastructure and core components are inherently scalable, observable, and failure-resilient before they reach production.
  • Incident and Emergency Response: Be the escalation point for critical incidents, lead the swift restoration of service, and drive blameless retrospectives and corrective action follow-ups.

About the candidate

Candidates should have 5+ years of professional experience with a mix of the experiences described below:

  • 5+ years of experience in SRE roles utilizing cloud platforms (AWS, GCP, and/or Azure).
  • 5+ years of experience with Linux
  • 5+ years of experience with Bash/Python
  • 3+ years of experience working with IaC and CM tools (Terragrunt, Ansible)
  • 3+ years of experience designing and developing infrastructure components with Kubernetes
  • 5+ years of experience implementing and maintaining observability solutions (Prometheus/VictoriaMetrics, Grafana, etc) and reporting on SLIs and SLOs.
  • Deep understanding of networking, particularly in complex, high-security contexts: mTLS, gRPC, network policies within Kubernetes, overlay networks (e.g., WireGuard/VPNs) for distributed client deployments, and isolation in confidential computing environments.
  • Experience working in a startup environment
  • Advantage for experience with AI/ML-based products or platforms
  • Advantage for experience with distributed systems
  • Advantage for experience with products with a focus on data security and privacy (e.g., PII data protection)
  • The role is open to candidates who are based in Boston, MA (hybrid work environment)

Top Skills

Ansible
AWS
Azure
Bash
GCP
Grafana
Grpc
Kubernetes
Linux
Mtls
Prometheus
Python
Terragrunt
Victoriametrics
Vpns
HQ

Rhino Federated Computing Boston, Massachusetts, USA Office

22 Boston Wharf Rd, Boston, MA 02210, United States, Boston, MA , United States, 02210

Similar Jobs

10 Days Ago
In-Office
Andover, MA, USA
Senior level
Senior level
Software
The Senior Site Reliability Engineer will lead automation, solve technical issues, ensure security, maintain cloud environments, and improve operations and deployment efficiency.
Top Skills: AzureCi/CdDockerGitKubernetesPackerTerraform
12 Days Ago
In-Office or Remote
Newton, MA, USA
119K-165K Annually
Senior level
119K-165K Annually
Senior level
Security • Software
Manage AWS infrastructure, automate cloud deployments, ensure availability and recoverability, and develop tools for reliability within a SaaS environment.
Top Skills: AnsibleAWSC#C++CloudFormationCloudwatchDatadogDockerEc2EksElkGrafanaJavaPythonS3TerraformVpc
24 Days Ago
Easy Apply
Remote or Hybrid
6 Locations
Easy Apply
118K-231K Annually
Senior level
118K-231K Annually
Senior level
Big Data • Cloud • Software • Database
The Senior Site Reliability Engineer will support, maintain and grow the Atlas platform, focusing on automating processes and running multi-cloud environments.
Top Skills: AWSAzureDnsGCPGoHTTPLinuxPythonRubyTls

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

  • Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
  • Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
  • Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
  • Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account