Job Details

Home

Site Reliability Lead| Boston, MA | Hybrid at Boston, Massachusetts, USA

Email: [email protected]

From:

Vinod Katkam,

Agile enterprises Solutions

[email protected]

Reply to: [email protected]

Implementation partner: Cigniti

Client: No information

Location:
Boston, MA(For locals, it is hybrid and for non-locals, they will have relocate to Boston, MA and from there they can work hybrid)

Please find the below skill matrix.

SNO

Skills

Years of experience

Ratings out of 10

1

SRE

2

Azure

3

Prometheus, Grafana, ELK Stack, or Splunk

4

Powershell

5

Terraform

6

ARM Templates

7

MySQL, PostgreSQL, or MongoDB

8

Docker

9

Infrastructure-as-code

10

AKS

11

AWS

12

GCP

13

Python

14

DataDog

1.SRE role

Job Description: Site Reliability Lead (SRL) - DataDog, Cloud, Python, PowerShell, Ansible (10+ years experience)

Summary:

We are looking for an experienced Site Reliability Engineer (SRE) with expertise in cloud technologies, Python programming, PowerShell, and Ansible. As an SRE, you will be responsible for ensuring the reliability, availability, and performance of our systems and infrastructure. You will collaborate with cross-functional teams to design and implement automation, monitor system health, and proactively identify and resolve issues.

Responsibilities:

1. Design, build, and maintain highly available and scalable infrastructure on cloud platforms such as AWS, Azure, or GCP.

2. Develop and maintain automation scripts and tools using Python, PowerShell, and Ansible for deployment, configuration management, and system monitoring.

3. Collaborate with development teams to ensure the deployment of reliable and efficient applications and services.

4. Implement and improve monitoring and alerting systems to identify and address performance bottlenecks, availability issues, and capacity constraints.

5. Troubleshoot and resolve complex infrastructure issues, including performance optimization, network connectivity, and security concerns.

6. Perform regular system performance analysis and capacity planning to ensure scalability and efficiency of the infrastructure.

7. Design and implement disaster recovery strategies and ensure business continuity.

8. Collaborate with security teams to ensure compliance with security policies and industry best practices.

9. Continuously evaluate and adopt new technologies and tools to improve system reliability, performance, and operational efficiency.

10. Participate in on-call rotations and respond to incidents to minimize downtime and impact on system availability.

11. Document system configurations, processes, and troubleshooting procedures.

12. Mentor and provide guidance to junior members of the team.

Requirements:

1. Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

2. 7-10 years of experience working as a Site Reliability Engineer or in a similar role.

3. Strong experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure provisioning, networking, and security.

4. Proficiency in programming languages such as Python and PowerShell for automation, scripting, and infrastructure management.

5. Extensive experience with configuration management tools like Ansible for provisioning and managing infrastructure as code.

6. Solid understanding of DevOps principles and practices, including CI/CD pipelines and version control systems.

7. Strong knowledge of containerization technologies like Docker and container orchestration platforms like Kubernetes.

8. Experience with monitoring and log aggregation tools such as Prometheus, Grafana, ELK Stack, or Splunk.

9. Deep understanding of networking concepts, including TCP/IP, DNS, load balancing, and firewalls.

10. Familiarity with database technologies like MySQL, PostgreSQL, or MongoDB.

11. Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed, large-scale production environment.

12. Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.

13. Experience with infrastructure-as-code tools like Terraform is a plus.

14. Relevant certifications such as AWS Certified DevOps Engineer, Azure Administrator, or Certified Kubernetes Administrator (CKA) are a plus

Keywords: continuous integration continuous deployment information technology Massachusetts

[email protected]
View all

Fri Oct 06 00:20:00 UTC 2023

To remove this job post send "job_kill 720983" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.

Your reply to [email protected] -

To

Subject
Message -

vinod_katkam@aesincus.com wrote:
From:

Vinod Katkam,

Agile enterprises Solutions

vinod_katkam@aesincus.com

Reply to:   vinod_katkam@aesincus.com

Implementation partner: Cigniti

Client: No information

Location: 
Boston, MA(For locals, it is hybrid and for non-locals, they will have relocate to Boston, MA and from there they can work hybrid)

Please find the below skill matrix.

SNO

Skills

Years of experience

Ratings out of 10

SRE

Azure

Prometheus, Grafana, ELK Stack, or Splunk

Powershell

Terraform

ARM Templates

MySQL, PostgreSQL, or MongoDB

Docker

Infrastructure-as-code

AKS

AWS

GCP

Python

DataDog

1.SRE role

Job Description: Site Reliability Lead (SRL) - DataDog, Cloud, Python, PowerShell, Ansible (10+ years experience)

Summary:

We are looking for an experienced Site Reliability Engineer (SRE) with expertise in cloud technologies, Python programming, PowerShell, and Ansible. As an SRE, you will be responsible for ensuring the reliability, availability, and performance of our systems and infrastructure. You will collaborate with cross-functional teams to design and implement automation, monitor system health, and proactively identify and resolve issues.

Responsibilities:

1. Design, build, and maintain highly available and scalable infrastructure on cloud platforms such as AWS, Azure, or GCP.

2. Develop and maintain automation scripts and tools using Python, PowerShell, and Ansible for deployment, configuration management, and system monitoring.

3. Collaborate with development teams to ensure the deployment of reliable and efficient applications and services.

4. Implement and improve monitoring and alerting systems to identify and address performance bottlenecks, availability issues, and capacity constraints.

5. Troubleshoot and resolve complex infrastructure issues, including performance optimization, network connectivity, and security concerns.

6. Perform regular system performance analysis and capacity planning to ensure scalability and efficiency of the infrastructure.

7. Design and implement disaster recovery strategies and ensure business continuity.

8. Collaborate with security teams to ensure compliance with security policies and industry best practices.

9. Continuously evaluate and adopt new technologies and tools to improve system reliability, performance, and operational efficiency.

10. Participate in on-call rotations and respond to incidents to minimize downtime and impact on system availability.

11. Document system configurations, processes, and troubleshooting procedures.

12. Mentor and provide guidance to junior members of the team.

Requirements:

1. Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

2. 7-10 years of experience working as a Site Reliability Engineer or in a similar role.

3. Strong experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure provisioning, networking, and security.

4. Proficiency in programming languages such as Python and PowerShell for automation, scripting, and infrastructure management.

5. Extensive experience with configuration management tools like Ansible for provisioning and managing infrastructure as code.

6. Solid understanding of DevOps principles and practices, including CI/CD pipelines and version control systems.

7. Strong knowledge of containerization technologies like Docker and container orchestration platforms like Kubernetes.

8. Experience with monitoring and log aggregation tools such as Prometheus, Grafana, ELK Stack, or Splunk.

9. Deep understanding of networking concepts, including TCP/IP, DNS, load balancing, and firewalls.

10. Familiarity with database technologies like MySQL, PostgreSQL, or MongoDB.

11. Strong problem-solving skills and the ability to troubleshoot complex issues in a distributed, large-scale production environment.

12. Excellent communication and collaboration skills, with the ability to work effectively in cross-functional teams.

13. Experience with infrastructure-as-code tools like Terraform is a plus.

14. Relevant certifications such as AWS Certified DevOps Engineer, Azure Administrator, or Certified Kubernetes Administrator (CKA) are a plus

Keywords: continuous integration continuous deployment information technology Massachusetts

Your email id:

Captcha Image:

Captcha Code:

Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 8

Location: Boston, Massachusetts