* Salário: R$ 2.000 a R$ 5.000 por mês (estimado)

* O valor exibido é uma estimativa calculada com base em dados públicos e referências do mercado. Não garantimos que este seja o salário oferecido para esta vaga específica.

Área: Outros

Nível: Gerente

We are seeking a Lead 3rd Line Support Specialist / Site Reliability Engineer to provide advanced support and reliability solutions for critical cloud-based systems.

The role emphasizes reliability, performance, and observability within AWS environments, with a focus on Kubernetes, monitoring technologies, database proficiency, and distributed systems like Kafka.

Responsibilities

Ensure observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena, and other tools
Oversee and troubleshoot EKS, Aurora RDS (PostgreSQL), and other AWS infrastructure at an advanced level
Apply automated remediations and self-healing approaches
Engage in incident response, root-cause analysis, and postmortems
Integrate security measures to enhance cluster reliability (IAM, network policies, Config)
Maintain and improve existing AWS infrastructure
Collaborate with L3 teams to escalate, analyze, and resolve operational challenges

Requirements

5+ years of professional experience in DevOps or Site Reliability Engineering
Proficiency in Grafana, Prometheus, OpenSearch, Open Telemetry, Fluent Bit, CloudWatch, CloudTrail
Understanding of distributed tracing, metrics pipelines, and log aggregation
Competency in troubleshooting and managing EKS (Kubernetes), RDS (PostgreSQL), MSK (Kafka), Network (VPC, SG, Route Tables), IAM (Roles and Policies), CloudWatch, and AWS observability tools
Familiarity with AWS networking, security, scaling, and reliability methodologies
Expertise in Kubernetes systems (operations, debugging, networking, scaling) and PostgreSQL performance tuning, monitoring, and diagnostics
Capability to lead incident response efforts, conduct RCA, postmortems, and manage SLAs
Skills in scripting using Bash or Python and automating cloud operations, observability setups, and incident recovery mechanisms
Excellent analytical problem-solving and interpersonal communication with technical and non-technical stakeholders
Flexibility to adapt to a fast-moving Agile environment
English language proficiency at B2+ level

Nice to have

Knowledge of Azure tools including AKS (Kubernetes), Azure Monitor, Application Insights, Log Analytics
Background in Cosmos DB and PostgreSQL
Skills in Azure DevOps, Terraform, and ArgoCD