Site Reliability Engineer III (Data Platform)
ProNavigator
Software Engineering
Kuala Lumpur, Malaysia
Job Description
What You Will Do
Collaborate with senior SREs to enhance incident response, reinforce critical data paths, and facilitate scalable, cost-efficient operations in support of PDO’s objectives for operational excellence and AI/cloud/data platform adoption.
Support the maintenance and enhancement of production environment for data platform services to ensure high availability and performance for mission-critical workloads.
Manage and optimize big data and streaming platforms like Kafka, Hadoop, Spark, and Hive on AWS, focusing on configuration, tuning, and daily operations.
Assist in defining SLOs, error budgets, capacity plans, and scaling strategies for data platform components to support new AI and data products.
Participate in on-call rotations to maintain high availability of services, leveraging tools like PagerDuty to triage alerts and provide technical responses to incidents impacting data and analytics platforms.
Resolve production issues through cross-team collaboration and adherence to established incident management practices.
Contribute to blameless post-incident reviews to develop reliability improvements, runbooks, and automation.
Build and improve automation and tooling using Go, Python, or scripting to standardize deployment and troubleshooting for data services.
Support CI/CD pipelines using TeamCity or Github Actions to enable safe and frequent service deployments.
Utilize Infrastructure as Code, including Terraform and AWS CloudFormation, to maintain repeatable cloud infrastructure.
Operate Kubernetes-based environments on AWS EKS, managing the lifecycle and scaling of containerized data services.
Implement progressive delivery strategies, such as blue/green and canary deployments, to minimize release risks.
What You Need to Succeed
Experience and Education
4–6 years of relevant industry experience in Site Reliability Engineering, DevOps, Production Engineering, or similar roles supporting large‑scale distributed systems and/or data platforms.
BS/MS in Computer Science, Computer Engineering, Mathematics, or a related technical field, or equivalent practical experience.
Technical Skills
Experience deploying and operating services on AWS or Azure, including on-call support.
Expertise with data and streaming platforms like Kafka, Hadoop, Spark, or Hive.
Proficiency in Go, Python, or Bash for automation; Java/Spring Boot knowledge is beneficial.
Skill in building tools and utilities using REST APIs or gRPC.
Proficiency with CI/CD tools such as TeamCity, GitHub Actions, or Jenkins.
Experience with Infrastructure as Code (Terraform or CloudFormation). Familiarity with Kubevela/Crossplane is a plus
Working knowledge of Kubernetes (EKS) and Docker for deployment and resource management.
Familiarity with AWS services like RDS, EMR, Redshift, MSK, and ECS.
Experience with observability and logging tools like Datadog or ELK.
Understanding of distributed systems, networking, storage, and operating systems.
Familiarity with agile methodologies like Scrum and Kanban.
Ability to solve infrastructure problems using software engineering and participate in incident response.
Effective collaboration skills aligned with business priorities like AI and cloud adoption.
Bonus Points
Kubernetes/AWS certifications
Contributions to open source projects
#LI-AA1