Make sure you’re applying to a legit company by checking their website and job posts.
- Working closely with the Chief Software Architect to develop a holistic SRE roadmap that improves our reliability and performance
- Leading other SREs to develop and implement a comprehensive alerting and monitoring system to surface issues before they become major production issues
- Maintaining and optimizing our deployment and release workflow and supporting tools to support >= 50 engineers
- Implementing a triage system for issues that may arise in production and lead incident response as needed
- Partnering with the Chief Software Architect and other engineers to perform capacity planning, configuration and secrets management of new and existing services
- Maintaining, testing and executing disaster recovery procedures as needed
- Composure: When production issues occur, SREs should be able to maintain composure and systematically identify root causes
- Good Communication Skills: This role cuts across many service teams and requires coordination with them.
- Infrastructure-as-Code: Glints uses Terraform to provision infrastructure and Helm and Kubernetes files to provision services.
- Cloud Computing & Containers: Glints runs on cloud infrastructure, using a mix of AWS, GCP and DigitalOcean. We use Kubernetes, Docker and Linux extensively.
- Distributed Systems: Many services requiring HA also require understanding of key characteristics of distributed systems.
- Monitoring, Logging and Alerting Tools: Glints uses the ELK stack and plans to deploy monitoring and alerting using Prometheus and Grafana (or similar).
- Deployment Automation: Glints uses GitLab CI/CD with shell scripts and Helm charts to deploy to target environments. Knowledge of TypeScript, Go and or Python is a plus as these are the main languages in use.
- Documentation: It’s important to externalize operational knowledge onto an easy-to-access location.
- Flexible work hours
- Remote work options
- Paid unlimited leave
- Medical coverage