Company Overview: Lean Tech is a rapidly expanding organization situated in Medellín, Colombia. We pride ourselves on possessing one of the most influential networks within software development and IT services for the entertainment, financial, and logistics sectors. Our corporate projections offer a multitude of opportunities for professionals to elevate their careers and experience substantial growth. Joining our team means engaging with expansive engineering teams across Latin America and the United States, contributing to cutting-edge developments in multiple industries. Currently, we are seeking a Site Reliability Engineer (SRE) to join our team. Here are the challenges that our next warrior will face and the requirements we look for: Position Title: Site Reliability Engineer (SRE) Location: Remote (Colombia) What you will be doing: This senior-level position is focused on the design, implementation, and maintenance of robust, scalable, and high-performing infrastructure. The primary purpose of this role is to collaborate closely with development teams to ensure system stability and scalability through advanced automation and monitoring improvements. Key responsibilities include architecting, deploying, and maintaining systems on AWS, managing Kubernetes clusters, and developing CI/CD pipelines. This position requires an advanced understanding of AWS, Kubernetes, Prometheus, and Grafana, as well as proficiency in scripting with Python, Bash, or Go. The role is integral to the company's broader mission, emphasizing streamlined integration and deployment within a collaborative work environment. Architect and maintain scalable, reliable systems on AWS, utilizing advanced AWS best practices. Oversee Kubernetes clusters to ensure optimal performance and availability in production environments. Develop and implement comprehensive monitoring and visualization strategies utilizing Prometheus and Grafana. Define, measure, and report on SLOs, SLIs, and SLAs to continuously enhance system reliability and performance. Drive automation of operational tasks through Infrastructure as Code tools like Terraform and CloudFormation. Create robust CI/CD pipelines to facilitate seamless and efficient software deployments. Perform in-depth root cause analyses on production issues and implement comprehensive solutions to prevent recurrence. Design, update, and manage detailed runbooks and escalation processes to improve incident management efficiency. Collaborate closely with development and DevOps teams to ensure effective integration and deployment processes. Document systems, configurations, and processes with precision to support operational continuity and knowledge sharing. Required Skills & Experience: Advanced proficiency in AWS services for architecting, deploying, and maintaining scalable and reliable systems. Advanced expertise in managing Kubernetes in production environments to ensure high availability and performance. Strong proficiency in Prometheus for monitoring and Grafana for visualization. Intermediate understanding and use of CI/CD tools such as GitHub Actions, Jenkins, GitLab CI/CD, or CircleCI. Intermediate proficiency with Infrastructure as Code tools like Terraform or CloudFormation. Experience with configuration management tools including Ansible, Chef, or Puppet. Proficient in scripting languages such as Python, Bash, or Go. Solid understanding of Linux/Unix systems and networking concepts. Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience. Minimum of 3 years in a Site Reliability Engineer or DevOps role. Nice to Haves: Experience with log aggregation tools such as ELK Stack or Fluentd for efficient log management. Knowledge of database systems, both SQL and NoSQL, to support diverse data storage needs. Familiarity with service meshes like Traefik, Istio, or Linkerd to enhance microservices communication. Experience with cloud-native application development and serverless architectures. Excellent problem-solving skills with a focus on improving system efficiency and performance. Strong communication and collaboration abilities for effective team interaction. Soft Skills: Excellent problem-solving and analytical skills. Strong communication and collaboration abilities, with the capacity to work effectively across different time zones. Why you will love Lean Tech: Join a powerful tech workforce and help us change the world through technology. Professional development opportunities with international customers. Collaborative work environment. Career path and mentorship programs that will lead to new levels. Join Lean Tech and contribute to shaping the data landscape within a dynamic and growing organization. Your skills will be honed, and your contributions will play a vital role in our continued success. Lean Tech is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. #J-18808-Ljbffr