Our industry is starting to go through a transformational shift & we intend to lead it. As talent becomes the main differentiator between failure & success, organizations must attract, engage & develop their people more than ever. To do so, they need powerful & sophisticated tools, which take the pain out of HR management & empower employees & people leaders. That's where we come in.
Lifion by ADP is expanding our startup style operation in NYC in order to accelerate new technical innovation across UI, Search, Platform Technology, IaaS, Big Data, Social, etc. The concept & vision behind the strategy is "Innovate like a Startup" with the goal of delivering highly automated, intelligent & predictive solutions to the market. Our goal is to have specialized teams of superstars focused in these areas to keep pace with market trends & quickly incubate & deliver capabilities that dramatically increase the value of our solutions for clients.
As an Application Site Reliability Engineer you have come up through the ranks as a full-stack engineer & are passionate about automating all the things. You are very opinionated about patterns & practices but pragmatic in your discourse & implementation. You have experience building fully automated, highly elastic, cloud-orchestrated platforms over various IaaS providers like AWS, GCE, & / or Azure, or using on-premise solutions like OpenStack. You see containers as the future of software deployments & are familiar with how to orchestrate them with frameworks like Docker Engine, Mesos, and/or Kubernetes.
This a combined technical leadership & hands-on development role that contributes to Lifions success through expertise in large-scale distributed systems. You will leverage matured existing systems to help design & create the next generation service architecture. Qualified individuals will have a solid background in the fundamentals of computer science, distributed computing, high availability, software development process & best practices.
- Solve problems relating to mission critical services & create solutions to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Deep understanding of distributed systems & the ability to lead/teach engineers to design & deliver software to improve the reliability, scalability, latency, & efficiency of our services.
- Influence & create new designs, architectures, standards, & methods for large-scale distributed systems.
- Design & implement stability & reliability best practices & proactive solutions to potential issues by collaborating with global technology partners.
- Define, track, review & report on Service Level Objectives (SLOs), Service Level Indicators (SLIs), System Availability, & the progress & outcomes related to reliability initiatives.
- Capable of decision making & Leadership without oversight. As well as influencing others without hierarchy (both upwards & sideways in parallel teams)
- Understand the operational complexity of a microservice architecture
- Conduct periodic on call duties (on an as needed basis). Ideally reducing the need for on-call incidents.
- Work with Incident Commanders, SREs, & Platform engineers during & after the incident recovery life-cycle.
- Identify key priority initiatives to significantly improve reliability, both proactively & reactively.
- Follow up & publish After Action Reviews which are timely & clearly understood by technical & business personnel, & include accurate root causes & concrete follow-up items with clear owners.
- Increasing efficiency by identifying & addressing performance bottlenecks
- Upholding a high standard of code quality through code reviews & extensive testing
- At least 5 years of experience in software engineering or development operations
- Strong understanding of containerization, container networking, & Kubernetes
- Strong knowledge of continuous integration & continuous deployment
- Strong production experience with cloud native services (AWS, Azure, GCP)
- Strong skills with Git SCM & one or more repository managers such as Github, Gitlab, Stash, Bitbucket, or Gerrit.
- Experience with higher level-level network protocols including HTTP & REST
- Experience with lightweight development methodologies such as Agile - Scrum & / or Kanban