Remote Senior SRE
What you'll do:
- Keep people safe and businesses running.
- Own operational availability, security, scalability, efficiency, monitoring, instrumentation, and overall service reliability of Everbridge's solutions.
- Collaborate across Agile teams with Architects, Developers, Quality, Data, Security, and other Operations engineers on designing and implementing highly reliable solutions.
- Embrace Site Reliability Engineering principles of proactivity, automation, cross-functional collaboration, data-driven decision making, and fast+safe failing to continually improve our technology and culture.
- Enhance our infrastructure, tooling, and processes to extend operability as a self-service function for other groups in the engineering value stream.
- Participate in a rotating on-call schedule to troubleshoot and resolve production escalations from our 24x7x365 NOC.
- Have fun while we work hard to make a difference.
What you'll bring:
- Previous experience contributing in a production Site Reliability, DevOps, SaaS/Technical Operations, or NOC environment
- Dedicated commitment to technical excellence and quality customer service
- Ability to write code in at least one programming language (e.g. Python, Perl, Java, Ruby, Go)
- Comfort using Git for practical configuration data and code management
- Expertise with cloud compute IaaS/abstracted PaaS solutions (AWS Solutions Architect or equivalent) and hybrid/on-premises private compute environments (VMware Certified Professional or equivalent)
- Deep knowledge in one of these disciplines forms the central pillar of your T-shaped skill set:
- Network architecture and operation with an emphasis on: application load balancing at local and global scales (ALB/ELB/Route 53), IPv4 routing and dynamic routing protocols (OSPF, BGP), VPN, and network security best practices
- Automation framework orchestration, configuration management, and software-defined infrastructure management techniques (SaltStack preferred, others e.g. Puppet, Chef, Ansible, etc. also acceptable)
- Large scale production UNIX/Linux operating system, application, and security maintenance in an online service provider environment (Ubuntu and Debian GNU/Linux preferred)
- US Citizenship
Bonus if you have experience with:
- Infrastructure/application monitoring and alerting solutions (Datadog, Elastic BELK/X-Pack, Prometheus, Nagios, Cacti, Graphite/Grafana, InfluxDB, OpenTSDB, Splunk, Graylog, etc.)
- Application virtualization, containerization, and service-oriented-architecture technologies (Nomad & rest of HashiCorp suite, Docker, Kubernetes, Mesos, CoreOS/rkt)
- Email transport software and deliverability management concepts (Postfix/Sendmail and derivative commercial MTAs, SPF, DomainKeys/DKIM, DMARC, IP reputation)
- VoIP (FreeSWITCH or Asterisk w/ SIP) and/or TDM telephony infrastructure
- Cisco IOS/NX-OS, Juniper JUNOS, and related hardware device and virtual appliance families (Cisco Catalyst/Nexus/ISR/ASR, Juniper routing/switching/firewall platforms, Brocade Vyatta)
- RDBMS, NoSQL, and hybrid data tier platforms (MongoDB, Elasticsearch, Postgres, MySQL, Riak, Cassandra, HBase, etc.)
- SEIM, HIDS/NIDS, and related infrastructure tooling required to maintain positive control over security
- Practical knowledge of BGP traffic engineering, DDoS mitigation, and active threat defense techniques
- Continuous integration and deployment/delivery pipelines in a release engineering context
- Performance measurement and tuning methodology for capacity planning and bottleneck hunting