Blitzy Jobs

Senior Site Reliability Engineer

Blitzy

Senior Site Reliability Engineer

Reposted 21 Days Ago

In-Office

Cambridge, MA, USA

160K-180K Annually

Senior level

In-Office

Cambridge, MA, USA

160K-180K Annually

Senior level

Lead design, build, and operation of scalable, fault-tolerant cloud infrastructure. Define SLOs/SLAs, improve observability and incident response, own CI/CD and deployment automation, partner with engineering teams on reliability, capacity planning, performance benchmarking, cost optimization, and security for an AI platform.

The summary above was generated by AI

About Blitzy

Blitzy is a Cambridge, MA based AI software development platform on a mission to revolutionize the software development life cycle by autonomously building custom software to unlock the next industrial revolution. We're transforming how enterprises build software, turning enterprise requirements into production-ready code with an agentic software development platform that can autonomously execute 80% of the quantum of software development work. We're backed by multiple tier 1 investors, and have proven success as founders of previous start-ups.

Location: Cambridge, MA (In-Office)

Compensation: $160,000 - $180,000 + equity eligibility based on performance

The Role

As a Senior Site Reliability Engineer at Blitzy's Cambridge headquarters, you will be the backbone of our platform's reliability, scalability, and operational excellence. You'll work at the intersection of software engineering and infrastructure, ensuring our AI-powered development platform remains highly available and performant as we scale rapidly. This is a high-impact, hands-on role for an engineer who thrives in a fast-moving environment and takes deep ownership of the systems they build.

What Success Looks Like

In 30 days: You have a deep understanding of Blitzy's infrastructure architecture, have identified key reliability risks, and are actively contributing to on-call rotations.
In 90 days: You have shipped meaningful improvements to observability, incident response workflows, and deployment pipelines that measurably reduce MTTR and increase system uptime.
In 6 months: You have driven at least one major reliability initiative from inception to production, established SLO/SLA frameworks for critical services, and are a trusted technical voice shaping our infrastructure roadmap.

Areas of Ownership

Design, build, and operate scalable, fault-tolerant infrastructure across cloud environments (AWS, GCP, or Azure).
Define and enforce SLOs, SLAs, and error budgets; lead blameless postmortems and drive systemic improvements.
Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure.
Own observability: design and maintain logging, metrics, tracing, and alerting stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
Partner closely with software engineering teams to embed reliability practices into the development lifecycle.
Drive capacity planning, performance benchmarking, and cost optimization across our infrastructure.
Champion security best practices within the infrastructure and deployment layers.

Required Experience

5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
Strong proficiency in at least one major cloud platform (AWS preferred); experience with Kubernetes and container orchestration at scale.
Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or equivalent).
Proven track record designing and maintaining high-availability, distributed systems.
Deep expertise in observability tooling, incident management, and on-call practices.
Strong scripting and automation skills (Python, Go, Bash, or similar).
Excellent communication skills with the ability to collaborate across engineering teams and present technical findings to leadership.

What Makes You Stand Out

Experience supporting AI/ML workloads or GPU-accelerated infrastructure.
Prior experience in a high-growth startup environment where you wore multiple hats.
Familiarity with eBPF, service mesh technologies (Istio, Linkerd), or advanced networking.
Contributions to open-source SRE/DevOps tooling or communities.
Experience building global, multi-region infrastructure with strict latency and availability requirements.

What Makes This Role Different

You won't be maintaining legacy systems or fighting fires in a sprawling monolith. At Blitzy, you're building reliability into a greenfield AI platform that is redefining how the world creates software. You'll have direct influence over architectural decisions, work side-by-side with world-class engineers, and see the tangible impact of your work as we scale to serve Fortune 500 customers. As a founding member of the Pune SRE team, you'll help shape the culture and technical standards of a team that will grow with the company.

Our Culture

Who we are:

Led by two pioneering co-founders we are one of the fastest growing companies in the U.S., creating our own category of enterprise autonomous software development. We automate thousands of hours of software development for our customers, which includes strong representation within the Fortune 500.

How we work:

We move Blitzy Fast: Time is both our company's and our clients' most precious asset. We move quickly and decisively to innovate internally and deliver exceptional software externally.

Championship Mindset: We operate like a professional sports team. We win as a team by holding ourselves and each other to high standards, collaborating in-person, and remaining focused on the mission.

Passion for Invention: We're pushing the frontier of what's possible, requiring constant innovation and iteration.

We Work for the Customer: We focus on delivering outsized value to the customers we work with and expanding those relationships into deep, meaningful partnerships.

We believe in being 'everyday athletes'—taking care of ourselves so we can bring our best minds to work. We promote great sleep, movement, and restorative activities for optimal mental performance. It makes for a happier and more productive team.

Blitzy is an equal opportunity employer committed to building a diverse and inclusive team. We believe different perspectives make us stronger.

Similar Jobs

Tulip

Senior Site Reliability Engineer

10 Days Ago

Easy Apply

Hybrid

Somerville, MA, USA

Easy Apply

160K-200K Annually

Senior level

160K-200K Annually

Senior level

Enterprise Web • Hardware • Internet of Things • Software

Lead observability and reliability efforts: mentor teams on SLIs/SLOs, maintain triage/remediation workflows, perform incident response, debug production systems, and design core infrastructure and tooling for engineering teams.

Top Skills: AlloyClaude SkillsGemini GemsGoGrafanaKubernetesLokiMimirMongoDBOpentelemetryPostgresPrometheusPromqlTempoTypescript

DFIN

Senior Site Reliability Engineer

10 Days Ago

Remote or Hybrid

United States

Senior level

Fintech • Software

Lead SRE efforts for DFIN SaaS: ensure availability, performance, scalability, and automation. Implement monitoring, CI/CD, IaC, container orchestration, AI-enhanced observability, incident response, RCA, and runbook automation while collaborating across engineering teams.

Top Skills: .NetAiopsAksAnsibleAppdynamicsAWSAzureAzure DevopsBashC#Ci/CdCloud Ai ServicesContainersCosmosDatadogDynatraceEksFirewallHarnessIdera Sql Diagnostic ManagerInfrastructure As Code (Iac)JavaJenkinsKubernetesLinuxLoad BalancingNew RelicPowershellPythonRedgate Sql MonitorSolarwinds Database Performance AnalyzerSQLTerraformWindows

DraftKings

Senior Site Reliability Engineer

4 Days Ago

Hybrid

Boston, MA, USA

128K-160K Annually

Senior level

128K-160K Annually

Senior level

Digital Media • Gaming • Information Technology • Software • Sports • Esports • Big Data Analytics

Lead the design, automation, and scaling of global compute infrastructure across data centers, cloud, and on-prem. Operate GitOps with Rancher Fleet/Flux/Helm, build self-healing tooling, own cluster autoscaling and capacity strategy, define SLOs using Datadog, and participate in on-call rotation while mentoring peers.

Top Skills: AWSContainerdDatadogDockerFluxGCPGitopsGoHelmHpaInfrastructure As Code (Iac)KarpenterKedaKubernetesLinuxNutanixPythonRancher FleetVsphere

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories