* Salário: R$ 2.000 a R$ 5.000 por mês (estimado)
* O valor exibido é uma estimativa calculada com base em dados públicos e referências do mercado. Não garantimos que este seja o salário oferecido para esta vaga específica.
Área: Outros
Nível: Gerente
We are seeking a Lead 3rd Line Support Specialist / Site Reliability Engineer to provide advanced support and reliability solutions for critical cloud-based systems.
The role emphasizes reliability, performance, and observability within AWS environments, with a focus on Kubernetes, monitoring technologies, database proficiency, and distributed systems like Kafka.
Responsibilities
- Ensure observability for AWS Cloud and Kubernetes workloads using Prometheus, Grafana, Open Telemetry, Fluent Bit, OpenSearch, CloudWatch, CloudTrail, Athena, and other tools
- Oversee and troubleshoot EKS, Aurora RDS (PostgreSQL), and other AWS infrastructure at an advanced level
- Apply automated remediations and self-healing approaches
- Engage in incident response, root-cause analysis, and postmortems
- Integrate security measures to enhance cluster reliability (IAM, network policies, Config)
- Maintain and improve existing AWS infrastructure
- Collaborate with L3 teams to escalate, analyze, and resolve operational challenges
Requirements
- 5+ years of professional experience in DevOps or Site Reliability Engineering
- Proficiency in Grafana, Prometheus, OpenSearch, Open Telemetry, Fluent Bit, CloudWatch, CloudTrail
- Understanding of distributed tracing, metrics pipelines, and log aggregation
- Competency in troubleshooting and managing EKS (Kubernetes), RDS (PostgreSQL), MSK (Kafka), Network (VPC, SG, Route Tables), IAM (Roles and Policies), CloudWatch, and AWS observability tools
- Familiarity with AWS networking, security, scaling, and reliability methodologies
- Expertise in Kubernetes systems (operations, debugging, networking, scaling) and PostgreSQL performance tuning, monitoring, and diagnostics
- Capability to lead incident response efforts, conduct RCA, postmortems, and manage SLAs
- Skills in scripting using Bash or Python and automating cloud operations, observability setups, and incident recovery mechanisms
- Excellent analytical problem-solving and interpersonal communication with technical and non-technical stakeholders
- Flexibility to adapt to a fast-moving Agile environment
- English language proficiency at B2+ level
Nice to have
- Knowledge of Azure tools including AKS (Kubernetes), Azure Monitor, Application Insights, Log Analytics
- Background in Cosmos DB and PostgreSQL
- Skills in Azure DevOps, Terraform, and ArgoCD
