Salário: R$ 2.000 a R$ 5.000 por mês
Área: Outros
Nível: Junior
We are strengthening our platform team with a Middle Site Observability Engineer to keep Kubernetes production services stable for AI research on Azure Stack. You will enhance observability, handle business-hours operational support, and work closely with engineering and research partners to improve reliability and processes—apply now.
Responsibilities
- Develop, operate, and enhance observability capabilities, including dashboards and visualizations in Grafana or similar tools
- Establish and maintain metrics, SLIs, SLOs, and alerting approaches for production platforms
- Deliver business-hours operational support for Kubernetes-based environments through troubleshooting, log analysis, and metrics-driven investigations
- Assist with production operations for SQL-based systems by diagnosing issues and supporting performance investigations
- Investigate incidents and system behavior to identify root causes, participate in post-incident reviews, and propose improvements to monitoring and reliability practices
- Partner with engineering, platform, and research teams to raise observability standards, refine operational processes, and increase system reliability
- Create and maintain documentation, share knowledge across the team, and drive ongoing improvement activities
Requirements
- Hands-on experience of 2+ years in Site Reliability Engineering, DevOps or Production Support for live production systems
- Practical knowledge of observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack, or Datadog
- Solid understanding of Linux systems with strong troubleshooting abilities and log analysis skills
- Background supporting Kubernetes-based production environments
- Working experience with SQL production support, including query troubleshooting and basic performance analysis
- Proficiency in automation scripting using Python, Bash, or similar languages
- Ability to assess incidents, determine root causes, and contribute to continuous improvement efforts
- Effective communication skills and comfort collaborating with distributed, cross-functional teams
- English proficiency at an intermediate to advanced level (B1–C1)
