Senior Site Reliability Engineer - Product Area Focus
Location
Summary of role
Own availability, the most important product feature, by continually striving for sustained operational excellence of Sumos planet-scale observability & security products. Work alongside your global SRE team, executing on projects in your product-area specific reliability roadmap, to optimize operations, increase efficiency in our use of cloud resources & our developers time, harden security posture, & increase feature velocity of our developers Work closely with multiple teams to optimize the operations of their microservices - & improve the lives of the engineers within your product area engineering teams.
Responsibilities
- Support the engineering teams within your product area by maintaining & executing a reliability roadmap of opportunities for improvement for reliability, maintainability, security, efficiency, & velocity - & help for realizing those opportunities.
- Collaborate with development infrastructure, Global SRE, & your product area engineering teams to establish & continually refine your reliability roadmap.
- Participate in defining, evolving, & managing SLOs for several teams within your product area.
- Participate in on-call rotations within your product area to understand operations workload so you can continually work to improve the on-call experience & reduce operational workload for running microservices & related components.
- Complete projects to optimize & tune on-call experience for your engineering teams.
- Continually improve the lifecycle of microservices & architectural components from inception & design, through deployment, operation, & refinement.
- Write code & automation to reduce operational workload, increase efficiency, improve security posture, eliminate toil, & enable Sumos developers to deliver features more rapidly.
- Work closely with the developer infrastructure teams to expedite development infrastructure adoption of tools to advance your reliability roadmap by identifying needs for your supported engineering teams, & contributing back features & bug fixes when needed.
- Scale systems sustainably through mechanisms like automation, & evolve systems by pushing for changes that improve reliability & velocity.
- Facilitate blame-free root cause analysis meetings for incidents to learn & drive improvement
- Participate in & continually improve our global IRC (incident response coordination) for all products.
- Drive root cause identification & issue resolution with the teams.
- Work inside of a fast-paced iterative environment.
Required Qualifications & Skills
- Cloud native application development experience leveraging best practices & design patterns
- Strong debugging & trouble-shooting skills across the entire technology stack
- Deep understanding of AWS Networking, Compute, Storage, & managed services.
- Competency with modern CI/CD tooling like Kubernetes, Terraform, Ansible & Jenkins
- Experience with full life cycle support of services, from creation to production support
- Versed in Infrastructure as Code practices using technologies like Terraform or Cloud Formation
- Ability to author production ready code in at least one the following: Java, Scala or Go.
- Experience with Linux systems & at home on the command line
- Understand & apply modern approaches to cloud-native software security
- Experienced with agile frameworks, such as Scrum & Kanban, & how to operate within these frameworks to continually deliver value.
- Flexible & willing to step into new roles & responsibilities
- Willingness to learn & use Sumo Logic products for solving reliability & security issues
- Bachelors or Master's Degree in Computer Science, Electrical Engineering, or another scientific or technical discipline
- 6+ years of industry experience.
Desirable Skills
- Experience using Sumo Logic products or other observability products for reliability & security
- Experienced with planet scale product development
- Running & operating SaaS products on AWS Cloud with expert level proficiency
- Experience with streaming technologies like Kafka, Kafka Streams, or KSQL
- Expert level experience in one or more of: Java, Go, Scala, or Python
- Expert level experience in one or more of: Terraform, Jenkins, Kubernetes
- Extensive experience running & tuning JVM workloads at scale
About Us
Sumo Logic empowers the people who power modern, digital business. Through its SaaS analytics platform, Sumo Logic enables customers to deliver reliable & secure cloud-native applications. The Sumo Logic Continuous Intelligence Platform helps practitioners & developers ensure application reliability, secure & protect against modern security threats, & gain insights into their cloud infrastructures. Customers around the world rely on Sumo Logic to get powerful real-time analytics & insights across observability & security solutions for their cloud-native applications. For more information, visit www.sumologic.com.
|