Deskripsi pekerjaan Junior Site Reliablity Engineer TIX ID
Key Responsibilities:
- Ensure the availability and reliability of services in accordance with defined SLA/SLO/SLI by continuously monitoring production systems and applications, and implementing effective alerting mechanisms.
- Handle production incidents through participation in on-call rotations. Conduct thorough Root Cause Analysis (RCA) to prevent recurrence and improve incident response processes and documentation.
- Develop automation for deployment, scaling, failover, and monitoring, using tools such as Terraform, Ansible, Pulumi, Chef, or Puppet. Build and maintain CI/CD pipelines to support seamless software delivery.
- Monitor and proactively improve system performance. Work closely with development teams to ensure systems are scalable and resilient, and conduct regular benchmarking and stress testing.
- Plan future system capacity based on projected growth, and optimize cloud or data center resources to ensure cost efficiency.
- Implement and monitor operational security measures, including patching, access control, and audit logging. Collaborate with the security team to comply with industry standards such as ISO 27001 and SOC 2, and manage sensitive data securely.
- Collaborate with both Development and Operations teams by adopting a DevOps mindset. Facilitate a smooth release process and contribute to system design and architectural discussions.
- Build and maintain observability systems that include logging, metrics, and tracing using tools like Prometheus, Grafana, Loki, OpenTelemetry, or the ELK stack. Ensure observability is integrated into both development and operational pipelines.
- Maintain clear and thorough documentation for systems, operational procedures, and incident postmortems. Share knowledge regularly to improve team capabilities and foster a culture of continuous learning.
Requirements:
This job might be fit for you if
- Have 1–3 years of experience as an SRE, DevOps Engineer, or in a similar software engineering role.
- Possess strong programming skills and experience with scripting languages, especially Shell scripting.
- Have a solid understanding of Linux operating system internals, filesystems, disk/storage technologies, and networking.
- Have mandatory experience working with AWS and/or other cloud platforms.
- Proficient with CI/CD tools such as Jenkins or GitHub Actions (mandatory).
- Have hands-on experience with Infrastructure as Code tools like Terraform, Ansible, Pulumi, Chef, or Puppet.
- Have mandatory experience working with Docker and containerization technologies.
- Experienced in analyzing, monitoring, and troubleshooting large-scale, high-traffic distributed systems.
- Have good knowledge of distributed service architectures, including load balancing, service discovery, caching, and tracing.
- Have familiarity with security tools such as Sonar, Trivy, and Aqua Security (considered a plus).