GRAIL is seeking a Staff Site Reliability / DevOps Engineer to lead the reliability, scalability, and security of our cloud-native platform. This role operates at the intersection of infrastructure engineering, platform strategy, and organizational leadership, supporting systems that power large-scale data processing and cutting-edge cancer detection technologies.
You will define and drive infrastructure standards across teams, represent reliability and performance in architecture decisions, and build systems that scale well beyond your direct ownership. This is a highly technical, high-impact role combining hands-on engineering with cross-functional influence and mentorship.
Flexible – MPK or RTP (3 days in office)
This is a hybrid role based in either Menlo Park, CA (moving to Sunnyvale, CA in Fall 2026) or Durham, NC. Our current flexible work arrangement policy requires that a minimum of 60%, or 24 hours, of your total work week be on-site. Your specific schedule, determined in collaboration with your manager, will align with team and business needs and could exceed the 60% requirement for the site.
Reponsibilities
- Design, build, and operate highly available, fault-tolerant cloud infrastructure across AWS, GCP, and/or Azure
- Architect and maintain scalable CI/CD pipelines and deployment frameworks for enterprise-grade software delivery
- Lead infrastructure-as-code adoption and maturity using tools such as Terraform, CloudFormation, and Ansible
- Own Kubernetes reliability across multi-cluster environments, including upgrades, scaling, and workload lifecycle management
- Establish and evolve observability platforms (metrics, logs, traces) and define SLO/SLI frameworks across teams
- Lead incident response for critical outages, drive root cause analysis, and implement preventative improvements
- Optimize infrastructure for cost, performance, and scalability, partnering closely with engineering and finance stakeholders
- Define and enforce DevOps, reliability, and security best practices across the organization
- Partner cross-functionally with engineering, data, QA, security, and IT teams to design resilient systems
- Mentor engineers and contribute to technical leadership through design reviews, standards, and knowledge sharing
- Conduct a comprehensive assessment of the current infrastructure, drive infrastructure-as-code adoption to 95%+ across critical systems, and establish clear health and reliability baselines for the Kubernetes platform
- Standardize observability using modern tooling and implement an SLO/SLI framework adopted across multiple product teams, including defined SLAs for critical data systems
- Strengthen security and compliance posture across cloud environments by implementing consistent baselines, launching a compliance-as-code framework, and reducing mean time to resolution (MTTR) for production incidents
- Define, document, and drive adoption of engineering standards, best practices, and operational guidelines across platform and product teams
- Develop and align stakeholders on a forward-looking platform reliability and infrastructure roadmap
- Demonstrate measurable mentorship and technical leadership impact across the engineering organization
- Evaluate and provide recommendations on emerging infrastructure needs, including support for AI/ML and advanced data workloads
These responsibilities summarize the role’s primary responsibilities and are not an exhaustive list. They may change at the company’s discretion.
What Success Looks Like in Your First YearRequired Qualifications
- BS in Computer Science, Engineering, or related field, or equivalent experience
- 8+ years of experience in Site Reliability Engineering, DevOps, or platform engineering
- Strong hands-on experience with at least one major cloud platform (AWS, GCP, or Azure)
- Experience implementing infrastructure-as-code solutions (Terraform, CloudFormation, or similar)
- Experience designing and operating CI/CD pipelines (e.g., GitLab CI, GitHub Actions, Jenkins)
- Hands-on experience with Kubernetes and containerized systems in production environments
- Proficiency in scripting or programming for automation (e.g., Python, Go, Bash, or PowerShell)
- Experience with observability and monitoring tools (e.g., Prometheus, Grafana, OpenTelemetry, Datadog)
- Strong understanding of networking, security, and distributed systems fundamentals
- Experience working in regulated environments and familiarity with frameworks such as ISO 27001, NIST, SOC 2, or HIPAA
Preferred Qualifications
- 10+ years of experience in SRE, DevOps, or infrastructure engineering
- Experience operating multi-cluster Kubernetes environments (e.g., EKS, GKE) at scale
- Familiarity with GitOps practices (e.g., ArgoCD, Flux)
- Experience with data platforms and pipelines (e.g., Kafka, Airflow, Spark, Snowflake, BigQuery)
- Experience implementing SLO/SLI frameworks and reliability practices across multiple teams
- Strong background in cloud security, including IAM, zero-trust architecture, and secrets management
- Experience with compliance-as-code and security tooling (e.g., OPA, Snyk, Checkov)
- Exposure to AI/ML or large-scale data infrastructure workloads
- Experience in healthcare, biotech, or other regulated industries
- Relevant cloud or Kubernetes certifications (e.g., AWS DevOps, CKA/CKS, GCP DevOps)
Physical Demands and Working Environment
- Standard office environment with hybrid flexibility
- Participation in on-call rotation and after-hours support for critical systems may be required
- Frequent collaboration with cross-functional and senior stakeholders
- Fast-paced, dynamic environment with emphasis on reliability, scalability, and innovation
Adaptability and Growth Expectation
- Taking on additional technical or leadership responsibilities
- Participating in cross-functional initiatives and strategic projects
- Adapting to new technologies, tools, and methodologies
- Supporting other teams during periods of high demand
Top Skills
Similar Jobs at GRAIL
What you need to know about the Boston Tech Scene
Key Facts About Boston Tech
- Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
- Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
- Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
- Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories


.png)