Senior Site Reliability Engineer (Data Platform)
ProNavigator
Software Engineering
Kuala Lumpur, Malaysia
Job Description
What You Will Do
Design and implement self‑service automation and tooling (in Go, Python, or scripting languages) to standardize and streamline deployment, operations, and troubleshooting for data platform services.
Implement and improve CI/CD pipelines (e.g., TeamCity, Github Actions) to support safe, frequent deployments, including gate promotion and automated quality checks.
Use Infrastructure as Code (e.g., Terraform, AWS CloudFormation) to build, harden, and maintain repeatable cloud infrastructure for data and analytics workloads.
Operate and improve Kubernetes‑based environments (AWS EKS), including deployment, scaling, and lifecycle management of containerized data services (e.g., Docker‑packaged microservices, streaming jobs).
Apply progressive delivery strategies such as blue/green and canary deployments, and support chaos engineering experiments to validate resilience and recovery mechanisms.
Collaborate on capacity planning and cost‑aware design for cloud resources across compute, storage, and networking layers for data‑intensive systems.
Build and refine end‑to‑end observability for the data platform using monitoring and logging tools (e.g., Datadog, ELK), including metrics, traces, and logs.
Develop meaningful dashboards and alerts to provide clear visibility into data pipeline health, customer experience, and platform performance.
Analyze operational data to identify reliability risks and bottlenecks, feeding insights into the roadmap and reliability backlogs.
Partner with product engineering, data platform, security, and other SRE teams to define and implement improvements in service architecture and operational practices that support PDO’s AI, cloud, and data platform priorities.
Advocate for reliability, resilience, and operational excellence in design reviews, readiness assessments, and release planning.
Contribute to a positive, inclusive work environment based on accountability, continuous learning, and psychological safety, consistent with Guidewire’s culture of determination, collaboration, continuous improvement, and bravery.
What You Need to Succeed
Experience and Education
8+ years of relevant industry experience in Site Reliability Engineering, DevOps, Production Engineering, or similar roles supporting large‑scale distributed systems and data platforms.
BS/MS in Computer Science, Computer Engineering, Mathematics, or equivalent practical experience.
Technical Skills
Strong experience with continuous deployment and operation of cloud services on public cloud (AWS), including production support and on‑call.
Hands-on experience running data platforms using big data and streaming technologies such as Kafka, Hadoop, Spark, and Hive on the public cloud.
Proficiency in at least one of Java, Go, or Python, and solid skills with scripting languages to build tools, automation, and integrations.
Experience building and operating microservices, including REST APIs and/or gRPC services.
Solid experience with CI/CD tools (e.g., TeamCity, Github Actions) for automated builds, tests, and deployments, including promotion gates.
Strong experience with Infrastructure as Code tools such as Terraform and AWS CloudFormation for provisioning and managing cloud infrastructure. Familiarity with Kubevela/Crossplane is a plus
Practical knowledge of Kubernetes (e.g., AWS EKS) and Docker, including deployment patterns, service discovery, and resource management.
Familiarity with AWS services relevant to data and distributed systems, such as RDS, EMR, Redshift, MSK (Managed Streaming for Kafka), ECS, SNS, and SQS. ● Expertise with monitoring, logging, and observability tools (e.g., Datadog, ELK) to instrument services and build actionable alerts and dashboards.
Deep understanding of distributed systems fundamentals, networking, storage, operating systems, and how they interact in complex multi‑tier environments.
Knowledge of capacity planning, scalability, and resilience patterns (including blue/green and canary deployments, and chaos engineering concepts).
Operational and Problem‑Solving Skills
Demonstrated experience solving infrastructure and application problems using software engineering approaches rather than only manual operations.
Familiarity with agile methodologies like Scrum and Kanban.
Strong analytical and troubleshooting skills for complex, distributed, multi‑service environments.
Experience with on‑call, incident response (e.g., PagerDuty), and post‑incident review processes, with a bias for learning and continuous improvement.
Ways of Working
Ability to collaborate effectively with other engineering, data, and operations teams to understand their systems and help improve them.
A big‑picture perspective on systems, tools, and customer value, aligning technical decisions with PDO’s priorities around operational excellence, AI, cloud, and data platform adoption.
Comfort with agile development methodologies and iterative delivery in a highly collaborative environment.
Eagerness to learn, experiment, and grow—staying current with emerging technologies across cloud, data, and SRE practices, and applying them thoughtfully where they add real value.
Bonus Points
Kubernetes/AWS certifications
Contributions to open source projects
#LI-AA1