The world loves Postgres. If you work with developers or data scientists or anyone wrangling data, youll probably see a sticker with the tusks and trunk of the Postgres elephant on the lid of a nearby laptop. EDB has a lot to do with that.
Weve been major contributors to Postgres since the beginning, and we are proud to call thousands of boundary pushing customers our partners. Proud though we are, we are not resting on our laurels. Theres plenty of work to do. The good news is that everything we do will impact Postgres, which is to say that it will impact the world. No pressure.
Candidates please note: this position can be located remotely in the US
The Team Lead, Site Reliability Engineer will provide strategic direction to a global team of SREs. They will be responsible for advocating and reporting on the service to peers, executives and the broader organization. The successful candidate will have a calm head in a crisis, the ability to learn from mistakes and outages, a good procedural mindset and the ability to grow the SRE team; in number, skills & tooling. They will value fire prevention over fire fighting, and bring well articulated improvement requests into the development teams.
- Provide technical leadership & management to a global team of SREs working in a follow the sun pattern.
- Define the process & procedures the SRE team uses with development and support teams to solve production escalation cases.
- Lead post escalation reviews to document learnings and take actions to improve existing processes continuously.
- Executive communication with regard to service status, SLA/SLO objectives, running risks etc. and tie these into business metrics.
- Collaborate and guide our Engineering teams to ensure our applications and infrastructure are stable, reliable, and available.
- Collaborate with Application teams to continuously monitor processes, thresholds and define SLOs with corresponding SLIs. Advocate for changes to systems to enable better observability.
- Ensure business continuity, security, observability, and compliance in accordance with corporate & external standards (NIST, SOC2 etc).
- Proactively identify critical metrics to monitor and alert on to surface actual user issues before users notice them.
- Apply data modeling and predictive analysis to anticipate issues.
- Lead the SRE team in creating tools to use to do their jobs more efficiently.
- Ensure that the SRE team documents solutions, architectural patterns, system architecture and best practices so that developers have guidance and insight as needed.
- Advocate for the service in all areas of the company.
- Help grow & train the SRE team.
- 5+ years of experience working as a Site Reliability Engineer, Systems Administrator, or Software Engineer in production environments
- Ability & confidence to deal with major outages in a calm and effective manner.
- Experience driving discussions with senior personnel regarding trade-offs, best practices, project management and risk mitigation
- Experience with Kubernetes administration
- Experience running mission critical services on major Cloud providers (at least two of AWS, Azure, GCP)
- Experience with automating alerts and notifications to enable quick response and real-time collaboration between various teams.
- Experience with modern development, management & observability tools (Linux, shell scripting, Python, Golang, Git, Terraform, FluxCD, Prometheus, Grafana, Splunk)
- Worked in an on-call capacity
- Good written and spoken English Language skills
- 3+ years of experience in a leadership role in incident management
- Familiarity with PostgreSQL
- Experience with chaos engineering, game days etc.
- Experience with PagerDuty & its API
- Experience with vendor-supported container orchestration platforms based on Kubernetes, both public (such as EKS, AKS, and GKE) and private (such as Red Hat OpenShift or Rancher)
- Relevant certifications from CNCF such as Certified Kubernetes Administrator and Certified Kubernetes Security
- Familiarity with Agile methodologies, Lean thinking, and DevOps/DevSecOps cultures
- Direct contribution to Open Source projects in the Cloud Native, observability or Infrastructure as Code spaces
We know it takes a unique mix of people and skills to help us in our mission to supercharge Postgres, and we understand that not everyone will check every box. Wed love to hear from you and we want you to apply!
EDB is proud to be an equal opportunity workplace. We celebrate diversity and are committed to creating an inclusive environment for all employees. EDB was built on a commitment to trust and respect each other and to embrace an array of people and ideas. These values remain at the center of our culture and are key to our companys integrity.