Site Reliability Engineer III
ProNavigator
Software Engineering, IT
Dublin, Ireland
Job Description
About the Role
As a Site Reliability Engineer at Guidewire, you’ll join a passionate team dedicated to automating every process to ensure our systems run efficiently. Our Platform team is fully committed to developing and managing software that enhances the reliability of production systems—systems that serve hundreds of customers and support millions of transactions every day. You will play a key role in ensuring the stability of our flagship cloud platform products while building the tooling necessary for efficient operations and optimal availability of our SaaS multi-tenant, customer-focused systems. In close collaboration with our core product developers, you’ll help ensure our cloud products meet both functional and non-functional requirements including availability, performance, observability, and maintainability.
If you thrive on teamwork, embrace responsibility, and have a passion for solving problems at scale with technologies like AWS, Kubernetes, and Aurora, then we’d love to hear from you. We’re looking for someone who lives by the mantra, "If you have to do something more than once, automate it," and who is eager to learn and master new tools and concepts. Bonus points if you have experience in production support for a SaaS platform and are comfortable working with cutting-edge, highly containerized, cloud-native environments in AWS.
What You’ll Do
-
Drive Reliability & Automation:
Take a dedicated SRE approach to managing shared multi-tenant infrastructure for resilient SaaS microservice-based systems and customer-centric applications.
Oversee and continuously enhance our team’s presence in AWS by automating deployment and operational tasks.
-
Innovate and Improve Core Systems:
Contribute to the development of our core infrastructure systems—adding features, fixing bugs, and implementing reliability enhancements.
Engineer and maintain a complex single sign-on (SSO) authentication platform based on SAML/OAuth to ensure secure, seamless access for our users.
-
Enhance Observability & Incident Management:
Build and maintain comprehensive observability tooling, metrics, and dashboards to support our global platform infrastructure.
Improve our incident management lifecycle by identifying, mitigating, and learning from reliability risks, while helping to create a self-healing environment.
-
Empower the Team:
Develop system documentation and training materials to educate and empower your teammates.
Collaborate with various engineering teams, providing valuable feedback and contributing code when needed to enhance our products.
At Guidewire, we foster a culture of curiosity, innovation, and responsible use of AI—empowering our teams to continuously leverage emerging technologies and data-driven insights to enhance productivity and outcomes.
Who You Are
-
Technically Skilled:
You hold a Bachelor’s Degree in Computer Science or a related field.
You have proven software engineering and automation skills using Bash, Python, and/or Go.
You’re well-versed in agile development methodologies (Scrum, Kanban, etc.) and have a deep background in Linux systems.
-
Cloud & DevOps Savvy:
You bring significant experience in automating and managing systems on Amazon Web Services (AWS) and supporting live production environments (Java/Apache/Tomcat).
You are proficient with Infrastructure as Code (IaC) tools such as Terraform, Terragrunt, or Terraspace, and have used devops/gitops tools (Git, Bitbucket, Flux CD, TeamCity) for smooth code promotions.
You have hands-on experience in containerization (Docker, Helm, Kubernetes/EKS, CNI, and Ingress networking) and a strong understanding of Single-Sign On, SAML, and OAuth (bonus if you’ve worked with Okta).
-
Observability & Database Knowledge:
You are experienced with observability tools (Datadog, CloudWatch, PagerDuty) and familiar with event store/stream-processing technologies like Kafka or AWS SQS.
You have worked with relational databases such as Aurora Postgres or Oracle RDS and possess advanced exposure to application development, web UI design, JSON, and overall application architecture.
Exposure to Open Application Model systems like KubeVela or Crossplane is a plus.
-
A Collaborative Problem Solver:
You prefer writing robust code over clicking through a GUI and enjoy mentoring others.
Your outstanding troubleshooting skills, analytical mindset, and process-driven approach enable you to solve complex problems effectively.
You are a proactive team player with excellent communication skills, capable of explaining complex technical concepts to a varied audience.
You champion a culture of reliability by promoting practices such as blameless postmortems, SLO tracking, and continuous learning from incidents.
Demonstrated ability to embrace AI and apply it to your role as well as use data-driven insights to drive innovation, productivity, and continuous improvement.
#LI-AS3