Site Reliability Engineering Manager, GWCP
ProNavigator
Software Engineering, Other Engineering
Curitiba, PR, Brazil
Job Description
What You'll Do
Technical Leadership & Execution
Provide technical direction and oversight for SRE initiatives, ensuring best practices in reliability, scalability, and performance.
Remain hands-on where needed, contributing to system design, automation, and incident resolution.
Guide the design and development of tools supporting 24x7 follow-the-sun operations.
Drive automation across infrastructure provisioning, deployments, and operational workflows.
Ensure effective observability strategies (metrics, logging, tracing) and promote self-healing systems.
Partner with engineering teams to influence system design for reliability and operability.
Reliability, Process Engineering & Continuous Improvement
Design, evolve, and simplify SRE processes (incident management, production readiness, capacity planning, change management) with a focus on effectiveness over overhead.
Apply process engineering principles—ensuring processes are lightweight, scalable, and enable teams rather than slow them down.
Prioritize people over process: use processes as guardrails, not rigid workflows, and empower engineers to make sound decisions.
Proactively identify gaps, inefficiencies, and risks—and drive them through to resolution with a bias for action.
Establish and enforce SLOs, SLIs, and error budgets across services.
Lead major incident response and ensure blameless postmortems result in real, implemented improvements, not just documentation.
Continuously reduce operational toil through automation and simplification.
Ensure follow-the-sun operations are practical, sustainable, and optimized for real-world execution.
Leading People Working for You
Hire, onboard, and develop SRE engineers.
Lead the people working for you by setting clear expectations, providing guidance, and removing obstacles to execution.
Foster a culture of ownership, accountability, and service orientation.
Support engineers in making decisions and taking action, rather than relying on rigid processes or escalation.
Encourage critical thinking and problem-solving over checklist-driven execution.
Balance workload across the team, ensuring sustainable on-call participation and operational responsibilities.
Set clear priorities and ensure the team is focused on high-impact work that improves reliability and customer outcomes.
Cross-Team Collaboration & Service Mentality
Act as a key stakeholder across SRE Platform, Product Development, and Cloud Engineering teams.
Demonstrate a strong service mentality—ensuring platform capabilities meet the needs of internal teams and customers.
Balance platform standards with pragmatism, enabling teams while maintaining reliability and guardrails.
Partner with teams to solve problems collaboratively, rather than acting as a gatekeeper.
Drive adoption of best practices through influence, not enforcement alone.
Operational Strategy & Execution
Define and track metrics that reflect real outcomes (reliability, customer impact, team efficiency), not just process adherence.
Ensure work is prioritized toward meaningful improvements in reliability, scalability, and developer experience.
Continuously evaluate whether processes, tools, and practices are delivering value—and adjust when they are not.
Avoid unnecessary process overhead; focus on enabling teams to move faster safely.
Advocate for and drive investments in platform improvements and reliability initiatives.
Documentation & Knowledge Sharing
Ensure high-quality documentation, runbooks, and operational guidance.
Promote knowledge sharing across teams and regions.
Enable teams to operate independently through clear documentation and tooling.
Who You Are
Technical Expertise
Strong programming skills in Python or Go; experience with Java/Spring Boot is a plus.
Deep experience with Kubernetes (EKS), including networking, ingress, and operator patterns.
Expertise in Terraform and infrastructure as code at scale.
Advanced knowledge of AWS services and distributed systems architecture.
Strong background in observability tools such as Prometheus, OpenTelemetry, or Datadog.
Experience supporting production systems at scale in a microservices environment.
Familiarity with CI/CD systems such as TeamCity, GitHub Actions, or Jenkins.
Understanding of SSO, SAML, OAuth; experience with Okta is a plus.
Leadership & Ownership
Proven experience leading engineers working for you while remaining technically credible.
Demonstrated ability to build and evolve processes that serve people and outcomes, not bureaucracy.
Strong sense of ownership with a track record of driving issues through to resolution.
Demonstrated ability to identify problems, take initiative, and implement solutions without waiting for direction.
Ability to balance short-term operational needs with long-term improvements.
Comfortable making decisions and taking accountability in high-pressure situations.
Collaboration & Communication
Excellent communication skills with the ability to influence across teams.
Ability to translate complex technical concepts into clear, actionable insights.
Experience working in agile environments (Scrum, Kanban).
Mindset
Strong service-oriented mindset with a focus on enabling others to succeed.
Bias toward action and problem-solving over coordination and escalation.
Focus on outcomes, not process overhead.
Passion for reliability, automation, and continuous improvement.
Curiosity and willingness to explore emerging technologies, including AI, to improve productivity and outcomes.
Bonus Points
Kubernetes or AWS certifications.
Experience leading SRE or platform teams.
Contributions to open source projects.
Familiarity with tools like KubeVela (OAM) or Crossplane.
Experience implementing SLO/error budget frameworks at scale.