Site Reliability Lead| Boston, MA | Hybrid at Boston, Massachusetts, USA |
Email: [email protected] |
From: Vinod Katkam, Agile enterprises Solutions [email protected] Reply to: [email protected] Implementation partner: Cigniti Client: No information Location: Boston, MA(For locals, it is hybrid and for non-locals, they will have relocate to Boston, MA and from there they can work hybrid) Please find the below skill matrix. SNO Skills Years of experience Ratings out of 10 1 SRE 2 Azure 3 Prometheus, Grafana, ELK Stack, or Splunk 4 Powershell 5 Terraform 6 ARM Templates 7 MySQL, PostgreSQL, or MongoDB 8 Docker 9 Infrastructure-as-code 10 AKS 11 AWS 12 GCP 13 Python 14 DataDog 1.SRE role Job Description: Site Reliability Lead (SRL) - DataDog, Cloud, Python, PowerShell, Ansible (10+ years experience) Summary: We are looking for an experienced Site Reliability Engineer (SRE) with expertise in cloud technologies, Python programming, PowerShell, and Ansible. As an SRE, you will be responsible for ensuring the reliability, availability, and performance of our systems and infrastructure. You will collaborate with cross-functional teams to design and implement automation, monitor system health, and proactively identify and resolve issues. Responsibilities: 1. Design, build, and maintain highly available and scalable infrastructure on cloud platforms such as AWS, Azure, or GCP. 2. Develop and maintain automation scripts and tools using Python, PowerShell, and Ansible for deployment, configuration management, and system monitoring. 3. Collaborate with development teams to ensure the deployment of reliable and efficient applications and services. 4. Implement and improve monitoring and alerting systems to identify and address performance bottlenecks, availability issues, and capacity constraints. 5. Troubleshoot and resolve complex infrastructure issues, including performance optimization, network connectivity, and security concerns. 6. Perform regular system performance analysis and capacity planning to ensure scalability and efficiency of the infrastructure. 7. Design and implement disaster recovery strategies and ensure business continuity. 8. Collaborate with security teams to ensure compliance with security policies and industry best practices. 9. Continuously evaluate and adopt new technologies and tools to improve system reliability, performance, and operational efficiency. 10. Participate in on-call rotations and respond to incidents to minimize downtime and impact on system availability. 11. Document system configurations, processes, and troubleshooting procedures. 12. Mentor and provide guidance to junior members of the team. Requirements: 1. Bachelor's or Master's degree in Computer Science, Engineering, or a related field. 2. 7-10 years of experience working as a Site Reliability Engineer or in a similar role. 3. Strong experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure provisioning, networking, and security. 4. Proficiency in programming languages such as Python and PowerShell for automation, scripting, and infrastructure management. 5. Extensive experience with configuration management tools like Ansible for provisioning and managing infrastructure as code. 6. Solid understanding of DevOps principles and practices, including CI/CD pipelines and version control systems. 7. Strong knowledge of containerization technologies like Docker and container orchestration platforms like Kubernetes. 8. Experience with monitoring and log aggregation tools such as Prometheus, Grafana, ELK Stack, or Splunk. 9. Deep understanding of networking concepts, including TCP/IP, DNS, load balancing, and firewalls. 10. Familiarity with database technologies like MySQL, PostgreSQL, or MongoDB. 11. Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed, large-scale production environment. 12. Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams. 13. Experience with infrastructure-as-code tools like Terraform is a plus. 14. Relevant certifications such as AWS Certified DevOps Engineer, Azure Administrator, or Certified Kubernetes Administrator (CKA) are a plus Keywords: continuous integration continuous deployment information technology Massachusetts |
[email protected] View all |
Fri Oct 06 00:20:00 UTC 2023 |