At Lyft, the community is what we are & its what we do. Its what makes us different. To create the best ride for all, we start in our own community by creating an open, inclusive, & diverse organization where all team members are recognized for what they bring.
Passengers rely on Lyft to get to work, to go to the doctor, or to get home safely when public transit has stopped running. Drivers use Lyft for income & flexibility. Building a stable & reliable application for our passengers & drivers is a responsibility we take very seriously, & we are building out a team of Software Engineers focused on reliability, to deliver a consistent & highly reliable user experience.
Every engineering team at Lyft is responsible for running & operating the software that they build. The Reliability Engineers works towards standardizing & supporting all of the rapidly growing teams throughout our organization, assessing their architecture, helping them design scalable services, & fostering excellent operational practices. It's a mission-critical role of ensuring that our systems are always healthy, monitored, automated, & designed to scale.
What makes Reliability Engineering different at Lyft?
- It is engineering! We work on resolving the problems with the mindset on how to ensure they don't happen again. We are looking to automate ourselves out of our jobs.
- Our day to day is driven by helping our product teams create robust software faster.
- We don't sit on the other side of the tossing fence -- we're a first class engineering citizen & embedded in specific development teams where we drive engineering improvements from the bottom up.
Examples of Reliability Engineering projects:
- We automated Kafka topics management by building a declarative service that prevents abuse before capacity changes are shipped.
- We built a rate limiting system for our Wavefront proxy.
- We rolled out Kubernetes as a core component of Lyft infrastructure.
- We built Horizon, a cubism-inspired system to visualize faults across our various services.
- We revamped our incident management process & tools. This created a safe culture to understand outages & focus on preventing future ones.
- Define roadmap & architecture based on technology & business needs.
- Build holistic visibility into SLIs, SLOs, SLAs, dependency graphs, past performance of software, network, & system to ensure that we can continue to scale without increasing operational burden or toil.
- Share your knowledge by giving brown bags, tech talks, & evangelizing appropriate tech & engineering best practices.
- Build infrastructure & drive projects that break things with the aim to improve the robustness of production systems
- Use the core Site Reliability Engineering principles of change management, monitoring, emergency response, capacity planning, & production readiness reviews to run the platform.
- Step back to observe patterns & develop innovative tools & automation to minimize toil. Use those learnings to drive the best operational practices.
- Partner with the broader Lyft organization to build a culture of rigorously learning from incidents.
- Unblock, support, & effectively communicate across teams to achieve results.
- 2+ years of software engineering experience
- Experience with high level programming languages (Python, Go, Java, etc.)
- Experience designing, debugging & running fault tolerant large-scale distributed systems
- Experience working with public cloud platforms (e.g., AWS, Google Cloud Platform, Microsoft Azure, etc.)
- Strong troubleshooting & debugging skills
- Experience bringing software to production at high scale
- Strong Cross team collaboration
- Good communication skills
The nature of work is interdisciplinary, & our teammates come from varying backgrounds e.g. (Site Reliability Engineer (SRE), Systems Engineer, Software Engineer, DevOps Engineer, Infrastructure Engineer, Production Engineer). We urge you to apply even if you feel uncertain that you have the exact background.
- Great medical , dental, & vision insurance options.
- In addition to 11 observed holidays , salaried team members have unlimited paid time off, hourly team members have 15 days paid time off.
- 401(k) plan to help save for your future
- 18 weeks of paid parental leave. Biological, adoptive, & foster parents are all eligible
- Monthly commuter subsidy to cover your transit to work & 20% of all Lyft rides
Lyft is an Equal Employment Opportunity employer that proudly pursues & hires a diverse workforce. Lyft does not make hiring or employment decisions on the basis of race, color, religion or religious belief, ethnic or national origin, nationality, sex, gender, gender-identity, sexual orientation, disability, age, military or veteran status, or any other basis protected by applicable local, state, or federal laws or prohibited by Company policy. Lyft also strives for a healthy & safe workplace & strictly prohibits harassment of any kind. Pursuant to the San Francisco Fair Chance Ordinance & other similar state laws & local ordinances, & its internal policy, Lyft will also consider for employment qualified applicants with arrest & conviction records.