Chess.com Jobs

Site Reliability Engineer

Chess.com

Site Reliability Engineer

Reposted 8 Hours Ago

Remote

Hiring Remotely in USA

Senior level

Remote

Hiring Remotely in USA

Senior level

The Site Reliability Engineer will manage infrastructure stability and scalability, lead cloud migrations, and optimize performance across systems while mentoring team members.

The summary above was generated by AI

About Us

Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.

We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 250M+ chess players worldwide with the best possible product, content, and tools to serve the community!

We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.

About the role

The Site Reliability Engineer will play a critical role in ensuring the stability, performance, and scalability of our global gaming platform infrastructure. This position exists to bridge the gap between development and operations, maintaining high availability for millions of concurrent users while supporting rapid feature development and deployment. The SRE will be instrumental in building resilient systems that can handle massive scale across multiple regions, directly impacting user experience and platform reliability.

As our platform continues to grow and serve a global community, this role will drive the technical infrastructure decisions that enable seamless gaming experiences. The position requires both deep technical expertise and collaborative leadership to work across engineering teams, ensuring our systems can scale efficiently while maintaining the performance standards our users expect.

What you'll do

Design and implement multi-regional resilient infrastructure capable of handling millions of concurrent sessions and transactions daily across global data centers
Lead the hybrid cloud migration strategy, integrating bare-metal datacenter resources with cloud services for optimal performance and cost efficiency
Own the on-call rotation and incident response procedures, ensuring rapid resolution of critical system issues and maintaining high availability SLAs
Architect monitoring and alerting systems using industry-standard tools to proactively identify and resolve performance bottlenecks before they impact users
Collaborate with development teams to implement infrastructure-as-code practices and establish deployment pipelines that support continuous integration and delivery
Optimize system performance through capacity planning, load testing, and resource allocation across distributed computing environments
Establish and maintain security protocols and risk assessment procedures for infrastructure components and data protection
Partner with engineering teams to design scalable solutions for high-traffic applications and real-time processing requirements
Drive automation initiatives to reduce manual operational overhead and improve system reliability through scripting and configuration management
Mentor team members on SRE best practices and contribute to the development of infrastructure standards and documentation

Preferred Skills

Bachelor's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience
5+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
Experience managing bare-metal server infrastructure and datacenter operations
Strong proficiency with UNIX/Linux operating systems and command-line administration
Experience with cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation, or similar)
Hands-on experience with configuration management systems (Ansible, Chef, Puppet, or similar)
Solid understanding of networking fundamentals, protocols (TCP/IP, HTTP/HTTPS, DNS), and network troubleshooting
Experience with containerization and orchestration technologies (Docker, Kubernetes, or similar)
Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK stack, or similar)
Experience with relational and NoSQL databases, including performance optimization and scaling strategies
Strong collaboration and communication skills for working effectively in a distributed team environment
Demonstrated sense of ownership and accountability for system reliability and performance

Nice to have

Advanced knowledge of content delivery networks (CDNs) and edge computing
Experience with server-side automation and scripting languages (Python, Go, Bash, or similar)
Background in high-availability architectures and disaster recovery planning
Familiarity with security frameworks and compliance requirements
Experience with game server infrastructure or real-time application hosting
Knowledge of database administration and optimization for high-concurrency applications
Understanding of CI/CD pipelines and deployment automation
Experience with capacity planning and performance testing tools
Previous experience in a fully remote, distributed work environment
Continuous learning mindset with interest in emerging infrastructure technologies

About the Opportunity

This is a full-time opportunity
We are 100% remote (work from anywhere!)

---

You can learn more about us here:

https://www.chess.com/article/view/how-chess-com-virtual-team-works-together
https://www.chess.com/about

Similar Jobs

Cohere Health

Site Reliability Engineer

Yesterday

Easy Apply

Remote

United States

Easy Apply

100K-110K Annually

Mid level

100K-110K Annually

Mid level

Healthtech • Software

Operate and maintain AWS-hosted MERN applications and large-scale data workflows. Manage serverless and Spark-based pipelines, perform incident response and on-call duties, engineer automation to eliminate operational toil, ensure HIPAA/SOC2/HITRUST compliance, build observability and lead blameless post-mortems.

Top Skills: Amazon EcsAmazon EksAmazon EmrAthenaAws GlueAws LambdaAws SnsAws SqsCloudwatchEc2IamJavaScriptMernMySQLNode.jsOpentofuPysparkPythonRabbitMQTerraformTypescriptVpc

Runpod

Site Reliability Engineer

8 Days Ago

Remote

USA

150K-200K Annually

Senior level

150K-200K Annually

Senior level

Artificial Intelligence • Cloud • Software • Infrastructure as a Service (IaaS)

Ensure stability and resilience of Runpod's distributed AI platform by defining SLIs/SLOs, leading incident response, building observability and reliability tooling, automating operational workflows, and partnering with engineering teams to reduce toil and improve production readiness.

Top Skills: BashCi/CdContainerized Production SystemsGoGpu Observability ToolingGrafanaInfrastructure As CodeLinuxPrometheusPython

DraftKings

Site Reliability Engineer

3 Days Ago

Remote or Hybrid

United States

200K-250K Annually

Senior level

200K-250K Annually

Senior level

Digital Media • Gaming • Information Technology • Software • Sports • Esports • Big Data Analytics

Lead long-term strategy and architecture for cloud and on‑prem platform infrastructure, driving Kubernetes and multi‑cloud reliability, IaC/GitOps automation, observability, SLO/SLI/error‑budget practices, incident leadership, AI‑augmented tooling adoption, and mentorship of senior engineers to improve platform resilience and developer experience.

Top Skills: Amazon Elastic Kubernetes Service (Eks)AutoscalingAWSCapacity PlanningCi/CdGitopsGoGoogle Cloud PlatformGoogle Kubernetes Engine (Gke)Identity And Access ManagementInfrastructure As CodeKubernetesLinuxNetworkingObservabilityOperatorsPulumiPythonRke2StorageTerraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories