Site Reliability Engineer II (Platform)
ProNavigator
Software Engineering
Kuala Lumpur, Malaysia
Job Description
The Opportunity
Site Reliability Engineering (SRE) brings together software and systems engineering to design and operate large-scale, highly distributed, and fault-tolerant systems. As a Site Reliability Engineer (SRE) at Guidewire, you will join a team focused on automating tasks to enhance system efficiency and reliability. This role involves ensuring the stable operation of Guidewire's cloud platform (GWCP) and InsuranceSuite products, collaborating with developers to meet various requirements. Your work will support numerous customers and transactions daily.
To learn more about GWCP and it’s tenancy model, you can read more here: https://medium.com/guidewire-engineering-blog/guidewire-cloud-why-hybrid-tenancy-is-the-right-choice-part-2-of-2-ba22c9888bb8 .
What You Will Do
Collaborate with engineering teams to provide feedback and contribute code where needed, enhancing product functionality and resilience.
Participate in on-call rotations to ensure 24x7 availability of services.
Design and develop tools to support 24x7 follow-the-sun operations for critical production systems.
Automate deployment tasks for core products and infrastructure, maintaining a robust automation framework.
Monitor and optimize the performance of applications on the Guidewire Cloud Platform, ensuring reliability and efficiency.
Develop and maintain observability tools, metrics, and dashboards, including self-healing mechanisms for increased reliability.
Foster a culture of reliability by promoting blameless postmortems, SLO tracking, and continuous learning from incidents.
Proactively identify and address infrastructure issues to minimize business impact.
Develop system documentation and training materials to empower and educate team members.
Who You Are / What We’re Looking For
Skilled in programming with Python or Go for building internal tools, CLIs, and APIs; familiarity with Java and Spring Boot is a plus.
Exceptional troubleshooting skills, with a proactive, critical approach to solving complex issues.
Proficient in containerization technologies, with hands-on expertise in Docker, Helm, Kubernetes (EKS), CNI, and Ingress networking.
Strong knowledge of Kubernetes concepts (pods, deployments, services, statefulsets, ingress etc.) and the Operator pattern.
Experienced with Terraform, including developing and testing complex modules.
Advanced experience with AWS, including custom tool development using AWS SDK.
Solid understanding of Single Sign-On (SSO), SAML, and OAuth protocols; experience with Okta is a bonus.
Skilled in using observability tools such as Prometheus, OpenTelemetry, or Datadog for proactive monitoring.
Production-At-Scale support background in a heavily microservice-based world.
Prior experience with CI/CD tools like TeamCity, GitHub actions or Jenkins.
Familiar with agile methodologies, including Scrum and Kanban, to enhance software development processes.
Excellent communication skills, with the ability to explain complex technical concepts to diverse audiences.
Bonus Points
Kubernetes or AWS certifications
Contributions to open source projects
Familiar with cutting edge tools like Kubevela (OAM), Crossplane for Kubernetes-native infrastructure management
Other Requirements
Ability to read, write, and speak English
We provide 24x7 support to our customers, so we expect you to take turns with your teammates being on-call for weekend production emergencies or to provide rotating weekend operational support
Travel – Expect occasional travel (less than 5%) to other Guidewire offices for training and team meetings
#LI-AA1