Caro usuário, habilite o javascript para que esse site funcione corretamente.

Lead 3rd Line Support Specialist / Site Reliability Engineer

* Salário: R$ 2.000 a R$ 5.000 por mês (estimado)

* O valor exibido é uma estimativa calculada com base em dados públicos e referências do mercado. Não garantimos que este seja o salário oferecido para esta vaga específica.

Área: Outros

Nível: Gerente

We are seeking a Lead 3rd Line Support Specialist / Site Reliability Engineer to provide advanced support and reliability solutions for critical cloud-based systems.


The role emphasizes reliability, performance, and observability within AWS environments, with a focus on Kubernetes, monitoring technologies, database proficiency, and distributed systems like Kafka.

Responsibilities

  • Ensure observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena, and other tools
  • Oversee and troubleshoot EKS, Aurora RDS (PostgreSQL), and other AWS infrastructure at an advanced level
  • Apply automated remediations and self-healing approaches
  • Engage in incident response, root-cause analysis, and postmortems
  • Integrate security measures to enhance cluster reliability (IAM, network policies, Config)
  • Maintain and improve existing AWS infrastructure
  • Collaborate with L3 teams to escalate, analyze, and resolve operational challenges

Requirements

  • 5+ years of professional experience in DevOps or Site Reliability Engineering
  • Proficiency in Grafana, Prometheus, OpenSearch, Open Telemetry, Fluent Bit, CloudWatch, CloudTrail
  • Understanding of distributed tracing, metrics pipelines, and log aggregation
  • Competency in troubleshooting and managing EKS (Kubernetes), RDS (PostgreSQL), MSK (Kafka), Network (VPC, SG, Route Tables), IAM (Roles and Policies), CloudWatch, and AWS observability tools
  • Familiarity with AWS networking, security, scaling, and reliability methodologies
  • Expertise in Kubernetes systems (operations, debugging, networking, scaling) and PostgreSQL performance tuning, monitoring, and diagnostics
  • Capability to lead incident response efforts, conduct RCA, postmortems, and manage SLAs
  • Skills in scripting using Bash or Python and automating cloud operations, observability setups, and incident recovery mechanisms
  • Excellent analytical problem-solving and interpersonal communication with technical and non-technical stakeholders
  • Flexibility to adapt to a fast-moving Agile environment
  • English language proficiency at B2+ level

Nice to have

  • Knowledge of Azure tools including AKS (Kubernetes), Azure Monitor, Application Insights, Log Analytics
  • Background in Cosmos DB and PostgreSQL
  • Skills in Azure DevOps, Terraform, and ArgoCD

BUSCAS DE VAGAS SEMELHANTES