StackBlitz Jobs

Staff Site Reliability Engineer

StackBlitz

Staff Site Reliability Engineer

Posted Yesterday

Be an Early Applicant

Remote

Hiring Remotely in USA

Senior level

Remote

Hiring Remotely in USA

Senior level

Lead reliability practices across teams: embed early in projects, define SLIs/SLOs, build multi-cloud paved roads with Terraform, run on-call, drive org-wide incident maturity and tooling.

The summary above was generated by AI

🚀 About Us

We’re Bolt.new by StackBlitz!

We’re the team that brought you WebContainers, the first-of-its-kind technology that made it possible to run Node.js right inside your browser. That breakthrough kicked off our journey in 2019, and it’s what powers the blazing-fast online IDE used by over 1 million developers every month.

But we didn’t stop there. We doubled down on everything we learned and built Bolt.new — the fastest way to go from idea to production without writing traditional code. It’s a next-gen, AI-powered app builder that helps you create, edit, and deploy full-stack web and mobile apps instantly, right in your browser. No installs. No setup. Just smart automation and instant dev environments that let you move at the speed of thought.

We’re a fully remote team, globally distributed, deeply collaborative, and seriously passionate about building the future of software development.

This is your chance to join a small team with a big vision. If you love shipping fast, solving real problems, and pushing the boundaries of what’s possible, we’d love to meet you.

✨ About This Opportunity

As a Staff Site Reliability Engineer, you'll be the reliability conscience of our engineering organization, embedding with product and platform teams from the earliest stages of a project, shaping designs, and making sure what we build is observable, scalable, and operable long before it reaches production. The heart of this role is making the pager ring less over time, but the pager is real. Every SRE here shares our on-call rotation, and sometimes the work genuinely is rolling up your sleeves and digging into a live incident.

You'll set technical direction, define the standards other engineers build against, and drive initiatives that span multiple teams. This is a high-influence individual-contributor role: you won't manage people, but you will change how the whole organization thinks about reliability. You'll respond to incidents and share the on-call rotation alongside the rest of the team, but your lasting impact is the incidents that never happen because reliability was designed in from the start, at the scale of millions of developers building real products on Bolt.new every day.

🛠️ How You'll Contribute

Embed With Teams Early: Partner with development teams throughout the project lifecycle, from design and architecture reviews through launch readiness. Bringing an SRE perspective before code is written, not after it breaks. Shepherd projects to completion with reliability designed in.
Define Production-Readiness Standards: Establish and evolve the design reviews, launch checklists, and operational acceptance criteria that projects pass through, and own how teams adopt them across the org.
Make Reliability Measurable: Define meaningful SLIs, SLOs, and error budgets in collaboration with product and engineering, and help teams use them to make real prioritization decisions.
Build the Paved Roads: Create the frameworks, tooling, and golden paths across AWS, GCP, and Azure, with Terraform as the common backbone, that make the reliable way the easy way for every engineer.
Cross-Team Leadership: Partner across engineering, product, and design to align reliability work with business objectives. Influence roadmaps, resolve technical disagreements, identify process and technical debt across the organization, and propose solutions that accelerate velocity for multiple teams. Mentor senior and mid-level engineers, raising the bar for operational excellence everywhere.
Mature Our Incident Practice: Lead by influence on incident management and blameless postmortems, turning failure modes and operational signals into systematic, durable improvements.
Represent Us Externally: Build relationships with our cloud and infrastructure provider teams to influence roadmaps and unlock early access to new capabilities, and represent StackBlitz in customer trust conversations and the broader reliability community.
On-call rotation: Every SRE shares our on-call rotation, currently one week per month.

💡 Qualifications

Multi-Cloud Fluency: General fluency across AWS, GCP, and Azure matters more to us than deep specialization in any one, we run across all three. Terraform is our common infrastructure-as-code layer everywhere.
Our Stack: Comfort supporting and contributing to TypeScript (frontend and backend) and Ruby on Rails (backend) services. We're opinionated about our stack, and you'll work alongside it daily.
SRE / Production Engineering Experience: Significant experience as an SRE, production/platform engineer, or software engineer with a deep reliability focus, including time operating at scale.
Software Engineering Excellence: Strong software engineering fundamentals; you write production-quality code and can go deep with the teams you partner with, balancing immediate needs against long-term maintainability.
Technical Leadership & Influence: A track record of changing how teams work, not just how systems run, leading across team boundaries without formal authority.
Strategic Execution: Ability to take ambiguous, high-scope problems and drive them to completion with minimal oversight.
Systems Thinking: Ability to identify process, communication, and technical debt across the organization and propose solutions that accelerate velocity for multiple teams.
Data-Driven Leadership: Experience building measurement and evaluation frameworks, identifying patterns in operational data, and translating findings into organizational improvements.
Strong verbal and written English communication skills are required, as this role involves frequent collaboration with team members, stakeholders, customers, and external audiences where English is the primary working language.

🎯 Bonus Points

Experience standing up or maturing an SRE practice at a growth-stage company.
Background working as an embedded SRE or partnering closely with product teams.
Experience designing chaos/resilience testing or progressive delivery practices.

📌 A Few Notes

You do not need a college degree to apply
You do not need to be located in the U.S. — we’re remote-friendly
You do not need to meet every qualification listed above

Similar Jobs

MongoDB

Site Reliability Engineer

15 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

Maintain and improve multi-cloud Kubernetes infrastructure, CI/CD (Argo Workflows/ArgoCD), observability, and networking. Build reliable continuous deployment tooling and onboarding flows, provide internal support, collaborate across Platform Engineering, contribute upstream (open-source/operators), and participate in a 24/7 on-call rotation to resolve deployment infrastructure issues.

Top Skills: AlertingArgo WorkflowsArgocdAWSAzureCi/CdContainersDnsGCPGoKubernetesLinuxLoad BalancerObservabilityPythonService MeshTcp/IpTls

Domino Data Lab

Site Reliability Engineer

Yesterday

Easy Apply

Remote or Hybrid

Easy Apply

200K-230K Annually

Senior level

200K-230K Annually

Senior level

Artificial Intelligence • Machine Learning

Lead development of AI-assisted reliability tooling, own incident response end-to-end, improve observability and SLO/SLI frameworks, scale single-tenant SaaS operations, mentor engineers, and reduce recurring operational toil through engineering and automation.

Top Skills: Cloud PlatformsGoKubernetesLinuxLlm/Ai ToolingLogs And TracingObservability ToolingPythonSlo/Sli Frameworks

Coinbase

Site Reliability Engineer

23 Days Ago

Easy Apply

Remote

USA

Easy Apply

218K-257K Annually

Senior level

218K-257K Annually

Senior level

Artificial Intelligence • Blockchain • Fintech • Financial Services • Cryptocurrency • NFT • Web3

Own reliability, monitoring, and incident response for AI infrastructure; build automation and CI/CD tooling; manage Kubernetes/Docker production workloads; partner with infrastructure, security, and compliance; improve observability and documentation; develop internal full‑stack tooling in Go or Python.

Top Skills: AnsibleAWSBashChefCi/CdDockerEc2GitGoKubernetesLinuxLog AggregationNetwork SecurityPuppetPythonRubySaltTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories