Salário: R$ 11.000 a R$ 20.000 por mês
Área: Tecnologia da Informação
Nível: Senior
We are delivering scalable Kubernetes and Linux compute foundations for GPU-heavy workloads, and a Senior DevOps Engineer will help keep them reliable and fast. You will manage Kubernetes and Volcano scheduling, enforce quotas, and automate workflows using Python and UNIX Shell scripting in a client-facing delivery setup. Apply now to join the team
Responsibilities
- Build, configure, and operate GPU-enabled Kubernetes clusters and standalone Linux compute environments to maximize workload scheduling and performance
- Run Volcano scheduling end-to-end, including queue creation, POD execution, GPU assignment, and enforcing namespace quotas
- Manage Kubernetes environments comprehensively, including namespaces, RBAC, resource quotas, and workload isolation approaches
- Create and support automation scripts in Python and Shell to streamline job submission, provisioning, and reporting
- Partner with orchestration, optimization, and observability teams to improve scheduling efficiency, capacity utilization, and researcher workflows
- Track infrastructure health and resource utilization, and provide data to support optimization and reporting needs
- Recommend and drive enhancements to infrastructure, tooling, and automation workflows to improve performance, scalability, and usability
- Maintain operational processes that enable a seamless and efficient researcher experience across AI and computational workloads
Requirements
- Minimum 3 years of experience in DevOps or infrastructure engineering roles within complex, large-scale environments
- Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management
- Practical experience using Volcano for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes
- Demonstrated experience running GPU cluster environments in Kubernetes and on standalone Linux compute nodes
- Advanced skills in Python scripting for infrastructure automation and strong UNIX Shell scripting such as Bash
- Strong Linux administration knowledge, including troubleshooting, performance tuning, and configuration management
- Good command of infrastructure automation and orchestration concepts and related tooling
- Fluent English communication skills (spoken and written) to work directly with clients
Nice to have
- Working knowledge of Helm for Kubernetes application packaging
- Experience with observability tooling such as Prometheus, Grafana and Loki
- Exposure to Infrastructure as Code tooling, including Terraform
- Familiarity with multi-cloud Kubernetes options such as Amazon EKS and Google GKE
- Knowledge of Azure Networking, including VPN, ExpressRoute and network security
- Comfort with AI-assisted coding tools like GitHub Copilot, ChatGPT and Claude
- Understanding of hybrid (cloud and on-premises) scheduling and resource optimization
