Job description for Site Reliability Engineer at PT Delta Omega Beta
You will own reliability for a product domain end-to-end — defining SLOs, building the observability stack, running
multi-cluster infrastructure, and leading high-severity incident response. You'll embed with engineering squads, mentor
junior SREs, and connect infrastructure decisions to business outcomes.
RESPONSIBILITIES
• Own SLO compliance and error budgets for a service domain; run quarterly capacity planning reviews and trigger
reliability sprints when budgets are at risk
• Build and maintain infrastructure on GCP, AWS, and Alibaba Cloud using Terraform (modules, remote state,
workspaces)
• Manage multi-cluster Kubernetes environments across staging and production — deployments, config maps, secrets,
resource tuning, and rollbacks
• Design and own CI/CD pipelines (GitLab CI + ArgoCD) and implement the full observability stack (Prometheus,
Grafana, structured logs, distributed traces) for owned services
• Identify and execute cloud cost optimisations: right-sizing, spot instances, reserved capacity, and idle resource
elimination
• Run fault-injection experiments to validate graceful degradation; serve as incident commander for P1/P2 incidents
and publish blameless post-mortems within 48 hours
• Embed with engineering squads to catch reliability risks before features are built; mentor junior SREs and deliver a
monthly reliability report to leadership
REQUIREMENTS
• 2–4 years in SRE, DevOps, or platform engineering in a production environment
• Hands-on Terraform at module level (remote state, workspace separation) and multi-cluster Kubernetes management
• Experience building CI/CD pipelines (GitLab CI preferred) and GitOps delivery with ArgoCD
• Working knowledge of at least two of: GCP, AWS, Alibaba Cloud
• Practical observability experience — Prometheus, Grafana, PromQL alerting rules; not just theory
• Has owned or shadowed P1/P2 incident command; comfortable with stakeholder communication under pressure
• Applied SLI/SLO/error budget experience on real services
• Automation scripting in Bash and/or Python
NICE TO HAVE
• Professional cloud certification (AWS SAP, GCP PCA, or equivalent)
• FinOps / cloud billing governance experience
• Experience writing and facilitating blameless post-mortems
