Site Reliability Engineer III (Data Platform)

ProNavigator

ProNavigator

Software Engineering

Kuala Lumpur, Malaysia

Posted on Jun 4, 2026

Job Description

What You Will Do

  • Collaborate with senior SREs to enhance incident response, reinforce critical data paths, and facilitate scalable, cost-efficient operations in support of PDO’s objectives for operational excellence and AI/cloud/data platform adoption.

  • Support the maintenance and enhancement of production environment for data platform services to ensure high availability and performance for mission-critical workloads.

  • Manage and optimize big data and streaming platforms like Kafka, Hadoop, Spark, and Hive on AWS, focusing on configuration, tuning, and daily operations.

  • Assist in defining SLOs, error budgets, capacity plans, and scaling strategies for data platform components to support new AI and data products.

  • Participate in on-call rotations to maintain high availability of services, leveraging tools like PagerDuty to triage alerts and provide technical responses to incidents impacting data and analytics platforms.

  • Resolve production issues through cross-team collaboration and adherence to established incident management practices.

  • Contribute to blameless post-incident reviews to develop reliability improvements, runbooks, and automation.

  • Build and improve automation and tooling using Go, Python, or scripting to standardize deployment and troubleshooting for data services.

  • Support CI/CD pipelines using TeamCity or Github Actions to enable safe and frequent service deployments.

  • Utilize Infrastructure as Code, including Terraform and AWS CloudFormation, to maintain repeatable cloud infrastructure.

  • Operate Kubernetes-based environments on AWS EKS, managing the lifecycle and scaling of containerized data services.

  • Implement progressive delivery strategies, such as blue/green and canary deployments, to minimize release risks.

What You Need to Succeed

Experience and Education

  • 4–6 years of relevant industry experience in Site Reliability Engineering, DevOps, Production Engineering, or similar roles supporting large‑scale distributed systems and/or data platforms.

  • BS/MS in Computer Science, Computer Engineering, Mathematics, or a related technical field, or equivalent practical experience.

Technical Skills

  • Experience deploying and operating services on AWS or Azure, including on-call support.

  • Expertise with data and streaming platforms like Kafka, Hadoop, Spark, or Hive.

  • Proficiency in Go, Python, or Bash for automation; Java/Spring Boot knowledge is beneficial.

  • Skill in building tools and utilities using REST APIs or gRPC.

  • Proficiency with CI/CD tools such as TeamCity, GitHub Actions, or Jenkins.

  • Experience with Infrastructure as Code (Terraform or CloudFormation). Familiarity with Kubevela/Crossplane is a plus

  • Working knowledge of Kubernetes (EKS) and Docker for deployment and resource management.

  • Familiarity with AWS services like RDS, EMR, Redshift, MSK, and ECS.

  • Experience with observability and logging tools like Datadog or ELK.

  • Understanding of distributed systems, networking, storage, and operating systems.

  • Familiarity with agile methodologies like Scrum and Kanban.

  • Ability to solve infrastructure problems using software engineering and participate in incident response.

  • Effective collaboration skills aligned with business priorities like AI and cloud adoption.

Bonus Points

  • Kubernetes/AWS certifications

  • Contributions to open source projects

#LI-AA1