Our mission at Vimeo is to help businesses drive impact through video. Thanks to our strong community of video professionals, the volume of video consumption, & uploaded content, data is a key ingredient for success & one of our economic moat.
We are looking for a Data reliability engineer to help us with improving the reliability of our data platforms & pipelines serving billions of events & terabytes of data daily.
Youll be working closely with different data engineering teams on their incident management process, post-mortem, root cause analysis, & preventing incidents recurrence.
If you are passionate about data reliability, scale, & automation we should talk soon!
What You'll Do:
- You will collaborate with engineering teams to improve, maintain, performance tune, & capacity plan for Vimeos data platforms & infrastructure.
- Design business continuity & disaster recovery plans & processes, work with the engineering team in implementation.
- You will drive the incident management process for our data platform working with our partner teams to perform incident post-mortems, root cause analysis, & prevent recurring incidents.
- You will lead the standard change & release management process, automate & promote related best practices across engineering teams & help Vimeo to meet & maintain legal compliance status.
- Build intelligent monitoring over data pipelines & infrastructure, to achieve early & automated anomaly detection.
- You'll work closely with software developers to build an end-to-end automated testing framework & system-level testing environment.
- Participate in an on-call rotation.
Skills & knowledge you should possess:
- You have production experience with distributed datastores, e.g. Hbase, zookeeper, Kafka (alternative experience such as RabbitMQ, Cassandra, elasticsearch, etc would be also relevant)
- Own, manage, monitor & optimize the reliability & overall health of our development & production environments
- Detailed problem-solving approach, coupled with a strong sense of ownership & drive
- A passionate bias to action & passion for delivering high-quality data solutions
- 3+ years of experience working on Linux environment, & proficient with cloud environment (AWS, GCP)
- Experience coding in one or more of the following programming language: Python, Java (mandatory), or Scala
- 3+ years of hands-on experience in Reliability Engineering for high-performant, scalable & distributed data systems with a focus on automation
- Experience in a config management systems like chef, puppet, Ansible, or terraform.
- Deep understanding of CI/CD principles, familiar with source control systems (Git)
- Work with peer SREs to roll out changes to our production environment & help mitigate data-related production incidents.
- Experience with a Change Data Capture system, such as Debezium, is a plus.
- Attention to detail & quality with excellent problem solving & interpersonal skills
- A bonus - you have some experience in data warehousing & data engineering
Vimeo is the worlds leading professional video platform & community. We empower over 175 million users from creatives to entrepreneurs to the worlds largest brands to grow their business with video. Our products make it easy to create high-quality, impactful videos & to reach teams, audiences & customers anywhere.
Vimeo is powered by a growing team of over 600 passionate, dedicated humans. Were headquartered in New York City with offices around the world. We believe our impact is greatest when our workforce represents the diverse & global community that we serve, & were proud to be an equal opportunity employer where diversity, equity & inclusion is prioritized in how we build our products, leaders & culture. Learn more at www.vimeo.com/jobs.