We want people who are passionate about designing & operating secure systems at scale
We are looking for an experienced, motivated, adaptable, empathetic engineer who is comfortable working remotely and has SRE skills. You will report to the Engineering Manager of the Availability team, & act as a contributor to the teams mission:
Mission: Improve customer happiness & retention by driving availability & process improvements across the company.
Primary Focuses:
- Reduce incident duration & frequency: By taking an active role in incident management, trimming down bloated processes, & leading cross-team efforts, the Availability Team decreases downtime & improves reliability.
- Create Meaningful Metrics & Dashboards: Defining & refining relevant metrics, building informative dashboards, & establishing effective alerting thresholds.
- Automate Repetitive Tasks: Manual processes related to monitoring, reporting, & other tasks the Availability team handles.
DigitalOceans Internal Culture & Tooling
DigitalOcean teams communicate primarily via Slack. The Availability team makes use of Jira & GSuite. We strive to make our work-life balance comfortable, & aim to scope work appropriately so that everyone works at a healthy pace. You might expect to be on-call periodically, potentially managing very high-priority incidents.
DigitalOceans observability platform comprises VictoriaMetrics, Grafana, Alertmanager, & Elasticsearch. Knowing any of these tools is a bonus, because every service at DO is generally expected to use this platform.
The Availability team exists within the Resiliency division, an arm of the Infrastructure department. We are aimed at driving fast recovery & minimal impact to customer availability. The Resiliency division is made up of a diverse group of over 40 engineers located across the US, Canada, & Europe.
What Youll Be Doing:
As an engineer, you will spend your day-to-day on:
- Spending a 1-2 days a week on-call including shift work during set hours on those days
- Driving the mitigation & resolution of incidents, as well as handling incident reviews/postmortems
- Example Schedule:
- Tue - Thu: Flexible hours doing availability work
- Fri - Sat: Set hours doing operations / incident response
- Sun - Mon: Off
- Improving toilsome availability-related processes
- You will tweak, rewrite, & introduce processes that have company-wide impact. You will need to be organized, patient, flexible, & empathetic.
- Identifying opportunities for improvement
- You will have a platform to suggest & drive improvements when it comes to monitoring, alerting, incident resolution, & other processes around the organization
- You should be comfortable providing feedback early & often
- Communicating incident status clearly to customers
- You will need to be capable of understanding complex, ongoing technical issues & writing clear, accurate reports intended for public consumption
- Embedding directly with service teams
- Youll have to dive into unknown codebases written in languages youre not familiar with - the ability to learn quickly & pick things up on the fly will be key.
- You will need to be comfortable meeting folks where theyre at. We collaborate with teams with an aim to assist.
- Communicating internally with tons of lovely engineers
- This role is quite public - you will need to be comfortable speaking with a diverse set of engineers located around the globe.
- Responding to Slack messages & keeping up with various streams of conversation
- Our work can require a lot of context switching - youll need to be comfortable hopping from one Slack conversation to another, many times per day.
What Well Expect From You:
- On-Call Responsibility:
- You will participate in an on-call rotation to respond to critical incidents & ensure the continuous availability of our services.
- Champion Reliability & Availability
- You'll be deeply invested in maintaining & improving the uptime & overall health of our cloud infrastructure & applications, helping to instrument, strive for, & exceed our service level objectives (SLOs).
- Automation First Mindset
- You'll identify & automate repetitive tasks, infrastructure provisioning, deployments, & monitoring processes to improve efficiency, reduce toil, & minimize human error.
- Scalability Design & Implementation
- You'll contribute to the design & implementation of scalable & resilient systems that can handle rapid growth & fluctuating demand.
- Experience using or administering Linux systems
- At DigitalOcean, we live & breathe Linux - our systems primarily run Ubuntu.
- Experience reading, writing, & debugging code (any language is fine)
- We primarily work with Python, Rust, & Golang, but adaptability is more important than any single language.
- Familiarity with incident management
- Our team is deeply involved with incidents - any prior experience at a NOC, doing triage, etc would be very valuable.
- Familiarity with shell & git
- Familiarity with continuous integration systems & concepts
- Familiarity with Github Actions or Concourse is a plus
- Experience leveraging monitoring systems (e.g. Grafana, VictoriaMetrics, Looker, Elasticsearch) for data-driven outcomes
- Comfortable executing in an asynchronous remote environment
- The Availability team is spread across North America & Europe!
- Transparency, honesty, & openness to constructive feedback
- A desire to work with a respectful & inclusive team
If you dont meet all of the expectations below, thats completely okay! Submit an application anyways, & include a cover letter telling us why youd be a good fit for our team.
Why Youll Like Working for DigitalOcean:
- We innovate with purpose. Youll be a part of a cutting-edge technology company with an upward trajectory, who are proud to simplify cloud & AI so builders can spend more time creating software that changes the world. As a member of the team, you will be a Shark who thinks big, bold, & scrappy, like an owner with a bias for action & a powerful sense of responsibility for customers, products, employees, & decisions.
- We prioritize career development. At DO, youll do the best work of your career. You will work with some of the smartest & most interesting people in the industry. We are a high-performance organization that will always challenge you to think big. Our organizational development team will provide you with resources to ensure you keep growing. We provide employees with reimbursement for relevant conferences, training, & education. All employees have access to LinkedIn Learning's 10,000+ courses to support their continued growth & development.
- We care about your well-being. Regardless of your location, we will provide you with a competitive array of benefits to support your overall well-being, from one-time work from home stipend to wellness allowance to flexible time off policy, to name a few. While the philosophy around our benefits is the same worldwide, specific benefits may vary based on local regulations & preferences.
- We reward our employees. The salary range for this position is between $107,640.00 - $134,520.00 based on market data, relevant years of experience, & skills. You may qualify for a bonus in addition to base salary; bonus amounts are determined based on company & individual performance. We also provide equity compensation to eligible employees, including equity grants upon hire & the option to participate in our Employee Stock Purchase Program.
- We value diversity & inclusion. We are an equal-opportunity employer, & recognize that diversity of thought & background builds stronger teams & products to serve our customers. We approach diversity & inclusion seriously & thoughtfully. We do not discriminate on the basis of race, religion, color, ancestry, national origin, caste, sex, sexual orientation, gender, gender identity or expression, age, disability, medical condition, pregnancy, genetic makeup, marital status, or military service.
*This is a remote role
#LI-Remote
|