Databases are the beating heart of every business in the world.
Cockroach Labs is the team behind CockroachDB, an open source, distributed SQL database. We strive to build infrastructure that keeps pace with the world, so developers can focus on what matters most: creating the best products. Come & join us on our mission to Make Data Easy. Are you ready to aim high & build to last?
CockroachDB provides the backbone of how a businesss data is stored on a global scale. The Site Reliability Engineer is responsible for managing the infrastructure for our cloud service offering. This is a high-impact role where you will be accountable for our production system & ensuring that our services span several cloud providers as part of our hosted offering. You will also spend roughly half of your time doing greenfield development work, with an emphasis on tool development & driving automation.
- You will manage the infrastructure for cloud services, including running internal production systems & hosting CockroachDB for our external customers.
- You will design, write, & deliver software & systems to increase product reliability & organizational efficiency.
- You will develop custom tools as necessary
- You will keep a complex system running & solve problems relating to mission-critical services.
- You will design, implement, operate, & troubleshoot the automation & monitoring of production clusters to maximize performance & availability.
- You will drive the company through disaster recovery tests, where we manually turn down pieces of CockroachDB to test it's overall resilience to failures.
- You will participate in a weekly on-call rotation for our production systems & hosted services.
In your first 30 days, you will take over the operation of our existing internal & customer-facing production systems. Working with product & engineering, you will assess our production operations & flesh out runbooks for the operation of different systems. We believe that it's essential for you to take this first month to become familiar with our technology & our company.
After 3 months, you'll be fully integrated into the team. You will take full ownership for reliability, automation, & other issues related to CockroachDB's stability. You will identify new opportunities for automating processes, streamlining delivery, deploying new core functionality, & building great tools. You will help make CockroachDB more friendly by bringing your expertise to our database.
- Expertise in analyzing, monitoring, & troubleshooting large-scale distributed systems.
- Experience in software development using one or more of the following Go, C, C++, Python, Java
- Proficiency working with algorithms, data structures & production troubleshooting
- Expertise in interacting with major cloud providers like AWS, Azure, GCP, etc. & Cloud APIs.
- Ability to debug & optimize code & to automate routine tasks
- Working knowledge of web & network protocols & standards (HTTP, TLS, DNS, etc.)
- Previous on-call experience, with a sense of urgency.
Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse & inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at email@example.com.