Site Reliability Engineer - Operations

Fitbit

Sorry, this job was removed at 3:58 p.m. (EST) on Monday, June 10, 2019

View 929 Jobs

Find out who's hiring in Greater Boston Area.

See all Developer + Engineer jobs in Greater Boston Area

View 929 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

At Fitbit, our mission is to help people lead healthier, more active lives by empowering them with data, inspiration and guidance to reach their goals.

We started our journey in 2007—as a team of two with one big idea. Since then, we’ve grown to over 1,500 employees, sold over 60mm devices, and built a health and fitness community across the globe. In fact, the Fitbit Community has taken enough steps to walk from the Sun to Pluto! Offering award-winning products, a top-rated mobile app and an easy-to-use online dashboard, Fitbit provides personalized experiences that help our users reach their goals. With a reenergized focus on innovative devices, interactive experiences, and enterprise health we are transforming the way consumers and businesses see health & fitness.

From your first steps as a Fitbitter, you will be at the forefront of developing new products. Our culture combines the spirit of startup with the perks of being public. We offer a competitive benefits package and amazing perks like unlimited snacks, Friday happy hours, onsite workout classes, and a strong focus on a healthy work-life balance. As part of our team, you’ll have the opportunity to grow your career, contribute your ideas to life-changing products and services, and—above all—have fun doing it.

Fitbit’s HQ campus is located in the heart of San Francisco with office locations in Boston, San Diego and around the world. Think you’ve found your fit?

The role

Site Reliability Engineers are responsible for the pulse of the software ecosystem. We monitor and improve the system and suggest improvements for implementation by others. The name of the game is automating our job, because hiring linearly with our traffic growth is unsustainable. We are involved in incident and change management. We also act as consultants for engineers when new code and services are getting ready to launch.

Responsibilities

Detective: SREs handle problems in live production systems, both on their own and in collaboration with systems and application engineers.
Ambassador: Keep the company informed about the status of Fitbit services, the impact of known issues, and the progress of ongoing investigations.
Developer: Design and refactor parts of the Fitbit backend system for stability and performance, and write tools and scripts to automate maintenance and monitoring tasks.
Coach: Meet with other teams and attend architecture reviews, and offer advice on how to implement features that are efficient, highly available, and fault-tolerant.

What do we look for?

We want people that:

Write code in Python and perhaps Java, and not just for classes.
Dig into the details of how a system, library, or tool works instead of just blindly using it.
Are willing and eager to wear many hats, as illustrated by the roles described above.
Dive into things that “aren’t their problem.”
Are willing to teach and lead others.

Requirements

You have 3+ years of experience as a systems/operations engineer or system administrator
You are comfortable with the Python programming language and ecosystem
You are very comfortable using and administering Linux servers
You can work independently with limited supervision
You can communicate effectively with peers and to tailor your communication to your audience
You have a willingness to dive in and assist coworkers when incidents arise
You're willing to participate in the team’s production on-call rotation

Nice-to-haves

Experience working with high-traffic, scalable web applications and services
Experience building, deploying, and operating your own web service
Knowledge of the administration and/or performance tuning of MySQL or Cassandra
Prior experience being part of an on-call rotation and responding to production incidents
Experience with cloud computing platforms like AWS or Google Cloud Platform
Familiarity with configuration management tools like Puppet, Chef or Ansible (we use Puppet and Ansible)
Experience developing and shepherding processes around change and incident management
Some familiarity with Java and its ecosystem
Experience with one or more of the technologies in our stack (or similar technologies):

Frameworks: Hibernate, Spring, Finagle, Finatra, Thrift
Messaging: Kafka
Caching: Memcached, Redis
Logging and Monitoring: Prometheus, Graphite, StatsD, Nagios, Logstash, Kibana
Other: Aurora/Mesos, Tomcat, Elasticsearch, Terraform

Fitbit is proud to be an equal opportunity employer. We recruit, hire, train, promote, pay, and administer all personnel actions without regard to race, color, ancestry, national origin, citizenship, religion, age, sex (including pregnancy, childbirth, and medical conditions related to pregnancy, childbirth, or breastfeeding), sex stereotyping (including assumptions about a person’s appearance or behavior, gender roles, gender expression, or gender identity), sexual orientation, gender, gender identity, gender expression, marital status, medical condition, mental or physical disability, military or veteran status, genetic information or other statuses protected by law. We interpret these protected statuses broadly to include both the actual status and any perceptions and assumptions made regarding these statuses.

San Francisco applicants: Pursuant to the San Francisco Fair Chance Ordinance Fitbit will consider for employment qualified applicants with arrest and conviction records.

Read Full Job Description

Site Reliability Engineer - Operations

Location

Similar Jobs