Caro usuário, habilite o javascript para que esse site funcione corretamente.

Middle SRE / Observability Engineer

Salário: R$ 2.000 a R$ 5.000 por mês

Área: Outros

Nível: Junior

We are strengthening our platform team with a Middle Site Observability Engineer to keep Kubernetes production services stable for AI research on Azure Stack. You will enhance observability, handle business-hours operational support, and work closely with engineering and research partners to improve reliability and processes—apply now.

Responsibilities

  • Develop, operate, and enhance observability capabilities, including dashboards and visualizations in Grafana or similar tools
  • Establish and maintain metrics, SLIs, SLOs, and alerting approaches for production platforms
  • Deliver business-hours operational support for Kubernetes-based environments through troubleshooting, log analysis, and metrics-driven investigations
  • Assist with production operations for SQL-based systems by diagnosing issues and supporting performance investigations
  • Investigate incidents and system behavior to identify root causes, participate in post-incident reviews, and propose improvements to monitoring and reliability practices
  • Partner with engineering, platform, and research teams to raise observability standards, refine operational processes, and increase system reliability
  • Create and maintain documentation, share knowledge across the team, and drive ongoing improvement activities

Requirements

  • Hands-on experience of 2+ years in Site Reliability Engineering, DevOps or Production Support for live production systems
  • Practical knowledge of observability and monitoring stacks such as Grafana, Prometheus, Elastic Stack, or Datadog
  • Solid understanding of Linux systems with strong troubleshooting abilities and log analysis skills
  • Background supporting Kubernetes-based production environments
  • Working experience with SQL production support, including query troubleshooting and basic performance analysis
  • Proficiency in automation scripting using Python, Bash, or similar languages
  • Ability to assess incidents, determine root causes, and contribute to continuous improvement efforts
  • Effective communication skills and comfort collaborating with distributed, cross-functional teams
  • English proficiency at an intermediate to advanced level (B1–C1)

BUSCAS DE VAGAS SEMELHANTES