Lytx

Sr. Staff SRE

Posted Yesterday

In-Office or Remote

Hiring Remotely in San Diego, CA

207K-261K Annually

Senior level

In-Office or Remote

Hiring Remotely in San Diego, CA

207K-261K Annually

Senior level

Lead technical strategy for observability, operational intelligence, and reliability. Architect telemetry and automation platforms, drive AIOps and large-scale IaC, lead incident response, mentor senior engineers, and standardize SLO/SLI and reliability practices across AWS cloud-native environments.

The summary above was generated by AI

Why Lytx:

At Lytx, our engineering culture is built around being hungry, low-ego, and highly capable. We are pragmatic engineers who take ownership, collaborate openly, and focus on delivering measurable operational impact. Our mission is to design, operate, and continuously evolve the cloud infrastructure and operational platforms that power mission-critical SaaS and IoT services at global scale.

We are growing rapidly and expanding the use of AI across our platform and engineering operations. As our systems scale in complexity and business criticality, we are investing in next-generation observability, intelligent automation, and AIOps capabilities to enable proactive, insight-driven operations.

The Site Reliability Engineering (SRE) organization is responsible for the availability, reliability, observability, and resilience of our cloud-native environments. This includes building the operational platforms, telemetry strategy, and automation frameworks that allow engineering teams to operate confidently and efficiently.

This role sits at the center of operational intelligence for the company. As a Sr. Staff SRE, you will define the technical vision for observability and operational automation, influence architecture across the organization, and lead initiatives that reduce operational risk, improve system insight, and enable predictive, automated response at scale.

If you enjoy building foundational platforms, shaping engineering standards, and driving the evolution toward AI-enabled operations, this role provides an opportunity to have broad organizational impact.

Responsibilities / You’ll get to :

Strategic Technical Leadership: Define and drive the long-term strategy for observability, operational intelligence, and reliability engineering across the organization, aligning technical direction with business growth, customer experience, and service-level objectives.

Operational Intelligence & AIOps: Lead the evolution toward intelligent operations by designing capabilities such as event correlation, anomaly detection, alert noise reduction, predictive signal detection, and automated remediation to improve MTTD, MTTR, and operational efficiency.

Observability Platform Architecture: Architect and lead the end-to-end observability platform across metrics, logs, traces, and events. Establish scalable telemetry standards, instrumentation patterns, and onboarding models that enable consistent visibility across AWS and cloud-native services.

Automation at Scale: Drive large-scale automation initiatives that reduce operational toil, including self-service infrastructure workflows, policy-as-code guardrails, reliability automation, and automated response for common failure scenarios.

Reliability & Resilience Engineering: Partner with product, platform, and data teams to embed reliability, performance, cost efficiency, and fault tolerance into system design. Lead capacity modeling, resilience planning, and architecture improvements for multi-AZ and multi-region environments.

Incident Leadership & Continuous Learning: Provide technical leadership during high-severity incidents and guide blameless postmortems that identify systemic issues and drive long-term reliability improvements.

Organizational Standards & Governance: Define and standardize SLO/SLI frameworks, error budget practices, telemetry conventions, and infrastructure patterns to ensure consistent operational excellence across teams.

Innovation & Technology Evaluation: Evaluate and introduce emerging AWS-native, cloud-native, and AI-enabled observability and automation technologies. Lead proofs-of-concept and guide organization-wide adoption.

Mentorship & Influence: Mentor Staff and Senior SREs, raising the bar for system design, operational rigor, and engineering judgment while fostering a culture of ownership, learning, and continuous improvement.

Cross-Organizational Influence: Act as a senior technical authority for reliability and observability, shaping engineering roadmaps and influencing architectural decisions across product and platform domains.

Requirements / You’ll Need:

8–10+ years of experience in SRE, platform engineering, or cloud infrastructure roles supporting large-scale production environments.

Demonstrated experience leading architecture, reliability strategy, or operational platforms across multiple teams or organizational domains.

Proven track record operating in 24/7 production environments, including incident leadership, postmortem practices, and proactive reliability management.

Cloud & Architecture

Deep expertise designing and operating large-scale AWS environments, including services such as VPC, EC2, EKS/ECS, RDS/DynamoDB, S3, ALB/NLB, IAM, KMS, Route 53, and multi-account architectures.

Experience designing resilient, fault-tolerant systems using multi-AZ/multi-region patterns, graceful degradation, rate limiting, and capacity management.

Observability & Operational Intelligence

Senior-level experience with observability platforms (metrics, logs, traces, events) such as New Relic, Datadog, Prometheus/Grafana, OpenTelemetry, or similar.

Experience defining telemetry standards, instrumentation strategies, centralized dashboards, and low-noise alerting practices.

Experience improving operational signal quality through correlation, noise reduction, or advanced analytics.

AIOps / Intelligent Automation (Preferred)

Experience implementing or evaluating AIOps capabilities such as anomaly detection, event correlation, predictive alerting, automated remediation, or AI-assisted incident analysis.

Familiarity with applying machine learning or AI techniques to operational data, incident trends, or reliability workflows.

Automation & Infrastructure as Code

Expert-level experience with Infrastructure-as-Code using Terraform and/or CloudFormation, including reusable modules, GitOps workflows, and policy-as-code guardrails.

Strong scripting or programming skills (Python, Go, Bash, or similar) for automation and operational tooling.

Systems & Platform Expertise

Expert understanding of Linux systems, networking (TCP/IP, DNS, TLS), and distributed system behavior.

Expert with Kubernetes and cloud-native architecture patterns.

Leadership & Impact

Demonstrated ability to influence technical direction without direct authority.

Experience mentoring senior engineers and setting organization-wide engineering standards.

Ability to operate effectively in complex, high-impact environments and drive initiatives from concept through adoption.

Benefits:

Medical, dental and vision insurance
Health Savings Account
Flexible Spending Accounts
Telehealth
401(k) and 401(k) match
Life and AD&D insurance
Short-Term and Long-Term Disability
FTO or PTO
Employee Well-Being program
11 paid holidays plus 1 inclusive holiday per year
Volunteer Time Off
Employee Referral program
Education Reimbursement Program
Employee Recognition and Appreciation program
Additional perk and voluntary benefit programs

Salary is based on a number of factors including market location and may vary depending on job-related knowledge, skills, and experience. This position is also eligible for an incentive compensation plan. The expected hiring salary for this position is:

$207,000.00 - $261,000.00

Innovation Lives Here

You go all in no matter what you do, and so do we. At Lytx, we’re powered by cutting-edge technology and Happy People. You want your work to make a positive impact in the world, and that’s what we do. Join our diverse team of hungry, humble and capable people united to make a difference.

Together, we help save lives on our roadways!

Lytx, Inc. is proud to be an equal opportunity employer. We’re committed to building a diverse and inclusive workforce and do not discriminate based on race, color, religion, sex, sexual orientation, gender identity or expression, gender, genetic information, uniformed service, national origin, age, veteran status, disability, pregnancy, or any other status protected by federal or state law. We are committed to providing reasonable accommodation for candidates with disabilities who need assistance during the hiring process. To request a reasonable accommodation, please email [email protected].  Lytx conducts background checks on applicants who receive a conditional offer of employment in accordance with applicable local, state, federal and regional laws. Qualified applicants with arrest or conviction records will be considered. Background check results may potentially result in the withdrawal of a conditional offer of employment and will be made in accordance with all applicable local, state, federal and regional laws.

Top Skills

Alb

Aws (Vpc

Bash

CloudFormation

Datadog

Dns

DynamoDB

Ec2

Ecs

Eks

Gitops

Grafana

Iam

Kms

Kubernetes

Linux

Multi-Account Architectures)

New Relic

Nlb

Opentelemetry

Policy-As-Code

Prometheus

Python

Rds

Route 53

Tcp/Ip

Terraform

Tls

492 Old Connecticut Path, 601, Framingham, MA, United States, 01701

Similar Jobs

Circle

Site Reliability Engineer

2 Days Ago

In-Office or Remote

153K-205K Annually

Senior level

153K-205K Annually

Senior level

Blockchain • Fintech • Payments • Financial Services • Cryptocurrency • Web3

Design, maintain, and secure cloud infrastructure and CI/CD pipelines; automate operations with Go/Python; manage Kubernetes and blockchain nodes; implement disaster recovery; use AI tools for monitoring, anomaly detection, and capacity planning; participate in on-call rotations; mentor team members to improve reliability and performance.

Top Skills: Go,Python,Shell,Terraform,Crossplane,Aws Lambda,Kubernetes,Helm,Ethereum,Solana,Arbitrum,Base,Avalanche,Postgresql,Redis,Opensearch,Apache Airflow,Aws Dms,Snowflake,Github Copilot,Gemini,Chatgpt,Llms,Apm,Rum,Telemetry

Zeta Global

Senior Site Reliability Engineer

6 Days Ago

Easy Apply

Remote or Hybrid

United States

Easy Apply

140K-170K Annually

Senior level

140K-170K Annually

Senior level

AdTech • Artificial Intelligence • Marketing Tech • Software • Analytics

The Senior Site Reliability Engineer will enhance system reliability, develop production-grade code, implement observability tools, conduct root cause analyses, and collaborate on system design for scalability.

Top Skills: ArgocdCi/CdDockerGitopsGoGrafanaHoneycombJenkinsKubernetesOpentelemetryPrometheusPythonTerraform

StarCompliance

Devops Engineer

22 Days Ago

Remote

Senior level

Fintech • Analytics • Financial Services

The Site Reliability Engineer will enhance system reliability, implement observability tools, and collaborate with teams to improve SaaS applications.

Top Skills: AWSAzureAzure DevopsBashDatadogGoNew RelicPowershellPrometheusPythonTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Lytx

Sr. Staff SRE

Top Skills

Lytx Framingham, Massachusetts, USA Office

Similar Jobs

Site Reliability Engineer

Senior Site Reliability Engineer

Devops Engineer

What you need to know about the Boston Tech Scene

Key Facts About Boston Tech