Upwork Logo

Upwork

Principal Site Reliability Engineer

Posted 2 Days Ago
Remote or Hybrid
Hiring Remotely in USA
Expert/Leader
Remote or Hybrid
Hiring Remotely in USA
Expert/Leader
This role leads modern SRE practices, focusing on multi-cluster Kubernetes management, observability, cloud-native scalability, and incident management, while mentoring engineers and driving infrastructure automation.
The summary above was generated by AI

Upwork ($UPWK) is the world’s work marketplace. We serve everyone from one-person startups to over 30% of the Fortune 100 with a powerful, trust-driven platform that enables companies and talent to work together in new ways that unlock their potential.  Last year, more than $3.8 billion of work was done through Upwork by skilled professionals who are gaining more control by finding work they are passionate about and innovating their careers.
This is an engagement through Upwork’s Hybrid Workforce Solutions (HWS) Team. Our Hybrid Workforce Solutions Team is a global group of professionals that support Upwork’s business. Our HWS team members are located all over the world. This is an opportunity to work with a major revenue-producing website with millions of users. In addition to making sure everything works you are also expected to contribute to the continuous improvement of our environment. This is a full time position (~40 hours per week, Monday-Friday). This role will participate in our production on-call rotation in your day-time and on some weekends (once every 2-3 weeks).

Work/Project Scope:
  • Serve as a technical leader in modern SRE practices with a focus on zero-trust infrastructure, platform observability, and cloud-native scalability.
  • Guide the architectural evolution of reliability systems, including multi-cluster Kubernetes environments, GitOps workflows, and service mesh integration.
  • Champion SLO-driven engineering across teams and establish frameworks for defining, tracking, and enforcing reliability standards.
  • Partner with platform and security teams to enable service-to-service authentication, policy enforcement, and resilient control planes.
  • Develop AI-assisted tools and workflows (e.g., for incident triage, RCA generation, auto-remediation) to reduce operational burden and accelerate resolution.
  • Define and maintain end-to-end observability strategies including distributed tracing, metrics pipelines, and log enrichment.
  • Drive infrastructure automation efforts using IaC best practices, with an emphasis on policy-as-code, workload identity, and platform governance.
  • Lead post-incident reviews and reliability audits to surface systemic gaps and drive continuous improvement.
  • Mentor engineers across infrastructure and application teams on designing and operating reliable, scalable systems.
Must Haves (Required Skills):
  • 10+ years in SRE, DevOps, or production engineering roles, including experience operating large-scale distributed systems in production
  • Deep expertise in Kubernetes operations, including multi-cluster orchestration, service mesh (Istio or equivalent), and workload policy management (e.g., OPA, Kyverno)
  • Proven experience building and maintaining GitOps pipelines using tools like ArgoCD or Flux
  • Strong fluency in observability tooling (e.g., Prometheus, OpenTelemetry, Grafana, or Datadog), with a focus on SLO-based alerting and incident detection
  • Familiarity with reliability-as-code practices and automation using scripting languages (Python, Go, or Bash) and AI-enhanced workflows (e.g., Cursor, incident bots, PR-generating agents)
  • Experience designing and enforcing zero trust service-to-service authentication, workload identity, and mTLS policies
  • Track record of leading incident review programs, standardizing postmortems, and driving systemic reliability improvements
  • Ability to work cross-functionally with platform, security, and developer enablement teams to embed resilience across the SDLC.

Upwork is proudly committed to fostering a diverse and inclusive workforce. We never discriminate based on race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical condition), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.   

To learn more about how Upwork processes and protects your personal information as part of the application process, please review our Global Job Applicant Privacy Notice

Top Skills

Argocd
Bash
Datadog
Flux
Go
Grafana
Istio
Kubernetes
Opentelemetry
Prometheus
Python

Similar Jobs

7 Days Ago
Remote
United States
217K-325K Annually
Expert/Leader
217K-325K Annually
Expert/Leader
Cloud
The Principal Site Reliability Engineer will ensure high reliability and performance of a critical SaaS platform, manage Kubernetes deployments, and drive SDLC improvements while supporting a global on-call rotation.
Top Skills: AnsibleAWSChefCi/CdGoKubernetesPythonRustTerraform
8 Days Ago
Remote or Hybrid
United States
115K-164K Annually
Senior level
115K-164K Annually
Senior level
Cloud • Enterprise Web • Information Technology • Other
The Principal Site Reliability Engineer ensures network uptime and performance, focusing on automation, monitoring, incident management, and collaboration with teams.
Top Skills: Assure1AWSBgpCactiDnsGCPGrafanaHttp/SNagiosPrometheusPythonSevoneTcp/IpTls
11 Days Ago
In-Office or Remote
Washington, DC, USA
200K-260K Annually
Senior level
200K-260K Annually
Senior level
Internet of Things • Cybersecurity
The Principal Site Reliability Engineer oversees AWS GovCloud infrastructure, ensuring compliance with FedRAMP, enhancing system performance, and managing incident responses using automation tools.
Top Skills: AnsibleAws GovcloudBashCloudtrailCloudwatchEc2EksElk StackGitlab Ci/CdGrafanaIamJenkinsKubernetesMskPrometheusPythonRdsS3Terraform

What you need to know about the Boston Tech Scene

Boston is a powerhouse for technology innovation thanks to world-class research universities like MIT and Harvard and a robust pipeline of venture capital investment. Host to the first telephone call and one of the first general-purpose computers ever put into use, Boston is now a hub for biotechnology, robotics and artificial intelligence — though it’s also home to several B2B software giants. So it’s no surprise that the city consistently ranks among the greatest startup ecosystems in the world.

Key Facts About Boston Tech

  • Number of Tech Workers: 269,000; 9.4% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Thermo Fisher Scientific, Toast, Klaviyo, HubSpot, DraftKings
  • Key Industries: Artificial intelligence, biotechnology, robotics, software, aerospace
  • Funding Landscape: $15.7 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Summit Partners, Volition Capital, Bain Capital Ventures, MassVentures, Highland Capital Partners
  • Research Centers and Universities: MIT, Harvard University, Boston College, Tufts University, Boston University, Northeastern University, Smithsonian Astrophysical Observatory, National Bureau of Economic Research, Broad Institute, Lowell Center for Space Science & Technology, National Emerging Infectious Diseases Laboratories

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account