Fabric Health

Staff Site Reliability Engineer

Posted Yesterday

Be an Early Applicant

Remote

Hiring Remotely in USA

140K-170K Annually

Senior level

Remote

Hiring Remotely in USA

140K-170K Annually

Senior level

The Staff Site Reliability Engineer will architect and manage AWS and Kubernetes infrastructure, focusing on automation, observability, and compliance in healthcare operations.

The summary above was generated by AI

About Fabric Health

At Fabric Health, we are powering boundless care by solving healthcare’s biggest challenge: clinical capacity. We aren’t here to disrupt healthcare; we’re here to fix it. We unify the care journey from intake to treatment, using intelligent automation to remove administrative burdens and make care delivery 2-10x more efficient. Our technology empowers clinicians to move faster and focus on what matters most: the patient.

We are a mission-driven team of brilliant minds trusted by leading organizations including Intermountain Health, OSF HealthCare, SSM Health, and MUSC Health. Our vision is backed by premier investors such as Thrive Capital, GV (Google Ventures), General Catalyst, and Salesforce Ventures. We move quickly for good reason, listen deeply to solve big challenges, and build products with the same care and quality we’d want for our own loved ones.

Learn more: About Us | News & Press | LinkedIn | Careers

About the Role

As a Staff Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices.

What You'll Do

As a Staff Site Reliability Engineer, you will be a steward of Fabric’s production integrity, leading the strategy for infrastructure automation, observability, and system resilience. Your primary responsibilities include:

Infrastructure & Kubernetes Orchestration
- Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
- Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
- Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability.
AI-Assisted Operations & Automation
- Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
- Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
- Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
Observability & Incident Management
- Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
- Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
- Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
Compliance & Collaboration
- Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
- Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.

Why You Might Be a Good Fit

You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design.
You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch.
You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient.
You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety.

This Might Not Be The Right Fit If...

You prefer working on static infrastructure rather than evolving systems through code and automation.
You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow.
You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems.

Your Qualifications

8+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.
A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.

The national pay range for this role is $140,000.00 – $170,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.At Fabric, we believe that a diverse workforce is essential to our success. We are an equal opportunity employer and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, color, religion, sex, national origin, age, disability, veteran status, or any other legally protected characteristic. We actively encourage individuals from all backgrounds to apply.

Recruitment Fraud Alert: Protect Yourself

Fabric Health is aware of scammers attempting to impersonate employers. To ensure that any recruiting contact you receive is legitimate, please adhere to the following:

Verify the Domain: Official recruitment emails will only come from addresses ending in @fabrichealth.com or @gem.com. No other domain names are legitimate.
Official Interview Tools: We use Gem for our recruitment process and Google Meet for all video interviews. Google Meet is always the platform used for your first interview; you will never be sent a Zoom link to set up or conduct an initial interview. All interviews are conducted via video unless specifically stated by our team as an audio call. We never conduct interviews via chat, social media, Skype, or WhatsApp.
Zoom Usage: Zoom is utilized only for specific meetings set directly by our team for purposes outside of the standard interview process (e.g., coordination or onboarding discussions). It is never the first link you will receive from us.
Authorized Contact & Texting: Fabric will only contact you if you have submitted an application or if you are connected to a current employee who shared your information with us. We will only send text messages if you have provided explicit authorization and consent, either through your application or while communicating directly with our team. If you have not explicitly authorized us to reach out, treat any SMS or unsolicited outreach as fraudulent and do not respond.
Sensitive Data: We will never ask you for sensitive personal or financial documents (ID, banking info, SSN) during the application, interview, or candidacy stages. All sensitive data is handled through secure internal systems post-offer.
Verify the Team: You can reference LinkedIn to verify members of our recruiting team; however, please remain vigilant as scammers may create fraudulent profiles. Always cross-reference the sender's email domain with our official @fabrichealth.com address.

If you question the validity of a contact or receive a suspicious message, do not click any links. Report the issue immediately to [email protected].

Please note: The security inbox is for reporting fraudulent activity only. Do not email this address for application status updates or to share application materials, as these will not be reviewed. Applications are only accepted and reviewed if submitted through our official application portal, and no application status information will be provided via the security email.

Top Skills

AWS

Bash

Ci/Cd Systems

Datadog

Eks

Github Actions

Kubernetes

Python

Ruby

Semaphore

Terraform

Similar Jobs

MongoDB

Site Reliability Engineer

Yesterday

Easy Apply

Remote or Hybrid

United States

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

The Senior Site Reliability Engineer will lead security design and implementation for cloud infrastructures, mentor teams, and automate security solutions.

Top Skills: AnsibleAWSAzureCloud Security ToolsCloudFormationGCPGoTerraform

NBCUniversal

Staff Software Engineer

10 Days Ago

Remote or Hybrid

130K-170K Annually

Senior level

130K-170K Annually

Senior level

AdTech • Cloud • Digital Media • Information Technology • News + Entertainment • App development

Oversee operational support of SAP BTP CPI applications, manage incidents, lead support specialists, and collaborate on architecture and governance for finance processes.

Top Skills: Abap ProxiesAemCapmCloud ConnectorCloud FoundryEdge Integration CellIdocJSONMessage QueuesOauthOdataRestSAMLSap BtpSfapiSftpSoapXML

Jellyfish

Site Reliability Engineer

10 Days Ago

Remote or Hybrid

United States

165K-235K Annually

Mid level

165K-235K Annually

Mid level

Big Data • Cloud • Productivity • Software • Database • Analytics • Automation

The Site Reliability Engineer will automate tasks, enhance platform infrastructure, improve observability, and lead incident response efforts for optimal performance.

Top Skills: AWSGrafanaHoneycombLinuxPythonTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories