Blitzy

Senior Site Reliability Engineer

Posted 3 Hours Ago

Be an Early Applicant

In-Office

Cambridge, MA, USA

160K-200K Annually

Senior level

In-Office

Cambridge, MA, USA

160K-200K Annually

Senior level

The Senior Site Reliability Engineer at Blitzy will ensure system reliability, scalability, and operational excellence for an AI software platform, engaging in infrastructure design, CI/CD pipeline management, and observability tools.

The summary above was generated by AI

About Blitzy

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work. We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA — Kendall Square HQ (In-Office)

Salary: $160,000 - $200,000

The Role

As a Senior Site Reliability Engineer at Blitzy's Kendall Square headquarters, you will be a foundational force behind the reliability, scalability, and operational excellence of our AI-powered software development platform. Sitting at the intersection of software engineering and infrastructure, you'll ensure that the systems enabling enterprise customers to autonomously build production-ready software remain performant, resilient, and always available. This is a high-ownership, high-impact role for an engineer who operates with urgency, thinks in systems, and takes pride in building infrastructure that doesn't break.

What Success Looks Like

Blitzy's platform maintains industry-leading uptime — incidents are rare, and when they occur, they are resolved quickly with clear root cause analysis and lasting fixes.
SLOs and error budgets are defined for every critical service and actively used to drive engineering decisions, not just tracked passively.
Observability is a first-class capability — engineers across the company have the dashboards, traces, and alerts they need to understand system behavior without asking SRE.
Deployment pipelines are fast, safe, and reliable — releases go out with confidence and rollbacks are automated when something goes wrong.
Infrastructure is entirely codified — no manual provisioning, no configuration drift, every environment reproducible from source.
Engineering teams are more productive because of your work — platform friction is low, developer tooling is sharp, and SRE is seen as an accelerant, not a gatekeeper.
You are a trusted technical leader at HQ, influencing how Blitzy thinks about reliability as we scale our platform and our team.

Areas of Ownership

Design, build, and operate highly available, fault-tolerant infrastructure across cloud environments supporting Blitzy's AI platform and enterprise customers.
Define and own SLOs, SLAs, and error budgets for critical services; lead blameless postmortems and drive systemic improvements that prevent recurrence.
Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure that empower engineers to ship with speed and safety.
Own the full observability stack — logging, metrics, distributed tracing, and alerting (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
Manage Kubernetes clusters and container infrastructure supporting AI agent workloads and production application services.
Drive infrastructure-as-code practices using Terraform; ensure all provisioning is automated, auditable, and version-controlled.
Partner with engineering teams at HQ to embed reliability and operational best practices early in the development lifecycle.
Lead capacity planning, performance benchmarking, and cloud cost optimization as the platform scales.

Required Experience

5–8 years of experience in Site Reliability Engineering, DevOps, or Platform Engineering.
Deep expertise in Kubernetes — cluster management, workload deployment, scaling strategies, and troubleshooting in production.
Strong proficiency with at least one major cloud platform (AWS preferred); experience designing and operating distributed, high-availability systems.
Hands-on Terraform experience for infrastructure-as-code provisioning and management.
Proven ability to define and operationalize SLOs, SLAs, and incident response processes.
Strong scripting and automation skills in Python, Go, or Bash.
Experience designing and maintaining comprehensive observability systems across complex, multi-service environments.
Excellent cross-functional communication skills — able to partner with software engineers, product teams, and leadership equally well.

What Makes You Stand Out

Experience operating infrastructure for AI or ML workloads, including GPU scheduling or model serving infrastructure.
Familiarity with MLOps tooling (MLflow, Kubeflow, or similar) and the operational challenges unique to AI-driven services.
Knowledge of service mesh technologies (Istio, Linkerd) and advanced networking patterns.
CKA (Certified Kubernetes Administrator) certification or equivalent demonstrated expertise.
Prior experience at a high-growth startup where you built reliability foundations from the ground up.
A track record of influencing engineering culture — not just fixing infrastructure, but raising the bar for how teams think about reliability.

What Makes This Role Different

Most SRE roles have you defending the status quo. At Blitzy, you're building reliability infrastructure for a platform that is actively rewriting how enterprises create software — there is no playbook, and that's the point. You'll be based at our Kendall Square headquarters, working daily alongside our co-founders and core engineering team, with direct influence over how we architect and operate systems at the frontier of AI. You'll receive meaningful equity, giving you real ownership in a company that is defining a new category. If you want to do the most consequential infrastructure work of your career, this is the role.

Our Culture

Who we are:

Led by two pioneering co-founders we are one of the fastest growing companies in the U.S., creating our own category of enterprise autonomous software development. We automate thousands of hours of software development for our customers, which includes strong representation within the Fortune 500.

How we work:

We move Blitzy Fast: Time is both our company's and our clients' most precious asset. We move quickly and decisively to innovate internally and deliver exceptional software externally.

Championship Mindset: We operate like a professional sports team. We win as a team by holding ourselves and each other to high standards, collaborating in-person, and remaining focused on the mission.

Passion for Invention: We're pushing the frontier of what's possible, requiring constant innovation and iteration.

We Work for the Customer: We focus on delivering outsized value to the customers we work with and expanding those relationships into deep, meaningful partnerships.

We believe in being 'everyday athletes'—taking care of ourselves so we can bring our best minds to work. We promote great sleep, movement, and restorative activities for optimal mental performance. It makes for a happier and more productive team.

Blitzy is an equal opportunity employer committed to building a diverse and inclusive team. We believe different perspectives make us stronger.

Boston, Massachusetts, United States, 02215

1 Kendall Sq, Floor 2, Cambridge, Massachusetts, United States, 02139

Similar Jobs

Applied Systems

Senior Site Reliability Engineer

14 Days Ago

Remote or Hybrid

65K-160K Annually

Senior level

65K-160K Annually

Senior level

Cloud • Insurance • Payments • Software • Business Intelligence • App development • Big Data Analytics

As a Senior Site Reliability Engineer, you will ensure software reliability and scalability, manage IAC, CI/CD, monitor systems, and mentor junior engineers while collaborating across teams.

Top Skills: AnsibleArgocdBashDatadogGithub ActionsGitlabGoHashicorp ConsulHelmKubernetesPackerPostgresPowershellPythonSQL ServerTerraformTypescript

MongoDB

Site Reliability Engineer

18 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

As a Senior Site Reliability Engineer, you'll design and build complex systems, support Atlas platform operations, automate processes, and ensure high availability of services.

Top Skills: AWSAzureDnsGCPGoHTTPLinuxPythonRubyTls

MongoDB

Senior Site Reliability Engineer

21 Days Ago

Easy Apply

Remote or Hybrid

Easy Apply

127K-249K Annually

Senior level

127K-249K Annually

Senior level

Big Data • Cloud • Software • Database

Develop and maintain Kubernetes runtime environments, support developers, resolve critical issues, and participate in on-call rotations for production systems.

Top Skills: AWSAzureCert-ManagerCorednsCrdsCriCsiGatekeeperGCPGoHelmKubernetesKustomizeOperatorsPythonTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Blitzy

Senior Site Reliability Engineer

Blitzy Boston, Massachusetts, USA Office

Blitzy Cambridge, Massachusetts, USA Office

Similar Jobs

Senior Site Reliability Engineer

Site Reliability Engineer

Senior Site Reliability Engineer

What you need to know about the Boston Tech Scene

Key Facts About Boston Tech