The Site Reliability Engineer manages AWS Kubernetes infrastructure, ensuring operational excellence, security, and scalability, while implementing reliability improvements and best practices.
We're seeking an experienced Site Reliability Engineer to take ownership of our AWS-based Kubernetes infrastructure. You'll be responsible for the operational excellence, security, and scalability of our developments and production systems supporting our Enterprise Solid Server (ESS) technology for enterprise clients. You'll have significant autonomy to establish best practices, implement reliability improvements, and build the foundation for our growing infrastructure needs.
Inrupt is headquartered in Boston, MA. This role is ideally based in Boston. Our team operates on a hybrid schedule, working from the office two days a week and enjoying remote flexibility on the remaining days.
Key Responsibilities
- Manage day-to-day operations of AWS EKS clusters across development, staging, and production environments
- Monitor system health, triage alerts, and respond to incidents (15-minute SLO)
- Perform regular patching, upgrades, and maintenance of the infrastructure components.
- Maintain and optimize our technology stack: EKS, MSK, RDS, ArgoCD, Traefik, Sysdig, Mezmo, Terraform
- Manage AWS services, including VPC, RDS, MSK (Kafka), S3, and networking infrastructure
- Implement and maintain comprehensive monitoring dashboards, alerting, and centralized logging
- Maintain Terraform-based infrastructure automation and practice GitOps principles
- Manage data infrastructure lifecycle: RDS databases, Kafka clusters, Redis caching, S3 buckets
- Implement security baselines, manage RBAC, conduct vulnerability scanning, and remediation
- Design and test disaster recovery strategies with defined RTO/RPO
- Support ArgoCD deployments and troubleshoot application deployment issues
- Create and maintain documentation and troubleshooting guides
- Provide architectural reviews and capacity planning aligned with business objectives
- Optimize infrastructure costs while maintaining performance and reliability
- Establish on-call rotation and incident response procedures with post-mortem analysis
- Work closely with the engineers to ensure operational requirements are built into our products
- Work closely with engineers to ensure that non-functional requirements are met by the proposed architecture, design, and development choices.
About You
Required:
- Experience managing production Kubernetes clusters, preferably AWS EKS
- Deep knowledge of cloud platform services (e.g EC2, EKS, VPC, RDS, S3, IAM, CloudWatch)
- Strong Terraform experience for infrastructure automation
- Experience with monitoring platforms (Sysdig, Datadog, or similar) and logging systems
- Hands-on experience with ArgoCD or similar tools
- Strong understanding of networking: VPCs, security groups, load balancers, DNS
- Database administration experience (PostgreSQL), including backups and performance tuning
- Experience with message queue systems (Kafka/MSK preferred)
- Proficiency in Python, Bash, or Go for automation
- Excellent communication skills with the ability to explain complex technical concepts clearly
- Ownership mindset with strong problem-solving and analytical skills
- Experience with security best practices and compliance frameworks (SOC2, GDPR)
Preferred:
- Service mesh experience (Istio, Linkerd, Consul)
- FinOps practices and cost optimization experience
- Chaos engineering and resilience testing
- Multi-region infrastructure experience
- AWS certifications (Solutions Architect, DevOps Engineer, or Security)
- CKA (Certified Kubernetes Administrator) certification
- Experience supporting government or highly regulated industries
Top Skills
Argocd
AWS
Bash
Datadog
Eks
Go
Kafka
Kubernetes
Postgres
Python
Sysdig
Terraform
Inrupt Boston, Massachusetts, USA Office
Boston, Massachusetts, United States
Similar Jobs
Artificial Intelligence • Other • Security • Software • Analytics • Big Data Analytics
The Lead Site Reliability Engineer will oversee the reliability and scalability of the infrastructure, lead a team in operational execution, ensure best practices in SRE, and mentor senior engineers.
Top Skills:
Ci/CdDockerGitopsGoKubernetesLinuxPythonTerraform
Fintech • Software
The Principal Site Reliability Engineer is responsible for maintaining cloud infrastructure, ensuring application performance, and implementing automated solutions in a SaaS environment, while collaborating with security and software engineering teams.
Top Skills:
.NetAnsibleAppdynamicsAWSAzureAzure DevopsC#DatadogDynatraceHarnessJavaJenkinsKubernetesNew RelicTerraform
AdTech • Artificial Intelligence • Marketing Tech • Software • Analytics
The Senior Site Reliability Engineer will enhance system reliability, develop production-grade code, implement observability tools, conduct root cause analyses, and collaborate on system design for scalability.
Top Skills:
ArgocdCi/CdDockerGitopsGoGrafanaHoneycombJenkinsKubernetesOpentelemetryPrometheusPythonTerraform
What you need to know about the Boston Tech Scene
Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.
Key Facts About Boston Tech
- Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
- Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
- Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
- Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
- Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
- Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

.png)

