Home

Urgent Req- Site Reliability Engineer (SRE) Lead || Remote || Contract at Remote, Remote, USA
Email: [email protected]
ONLY REPLY TO 
[email protected]

TO REVIEW PROFILE.

Share Lead profiles only.

Hi Professional

I hope youre doing well!

My name is Abhay and I'm an IT Recruiter at Diverse Lynx.

I have an
urgent position for the following role. If interested, please share your resume
at [email protected] 

Site Reliability Engineer (SRE) Lead - Public Sector Core
Framework team

Remote

About the Role

Client is seeking a Lead Software Engineer to join our Public
Sector Core Framework platform team and play a critical role as a Site
Reliability Engineer (SRE) within our Azure/Kubernetes ecosystem. In this role,
you will be responsible for ensuring the stability, scalability, and
performance of our platform, contributing significantly to the continued
success of the client.

Key Responsibilities

Champion
SRE Practices: Lead the team in strengthening SRE practices,
including defining service level indicators (SLIs), objectives (SLOs),
error budgets, thresholds, alerting, and error management systems.

Site
Planning and Optimization: Collaborate with development and testing
teams to plan changes for production and other environments. Optimize
planned outages, streamlining DevOps activities and minimizing downtime.

Toil
Reduction: Identify repetitive tasks (toil) and develop solutions to
improve efficiency and reduce manual workload.

Automation
Advocacy: Leverage automation wherever possible to enhance stability,
functionality, and overall platform management.

Alert
Management: Strengthen alerting systems by establishing goals,
criteria, and processes for alert recalls, resets, enabling/disabling
alerts, and revising error budgets based on team toil.

Outage
Prevention and Response: Proactively address non-critical alerts and
collaborate with development and testing teams to prevent outages.

Performance
Verification: Work closely with Load and Performance teams to
redefine parameters like load and concurrent user capacity.

Incident
Management: Lead and facilitate meetings with development and
operations teams during incidents to ensure effective resolution.

Post-Incident
Reviews: Lead post-incident reviews with teams to identify root
causes (RCAs), develop long-term solutions (code changes, configuration
adjustments, architectural modifications, or capacity planning), and
implement learnings to prevent future issues.

Reliability
Reporting: Generate reports using defined reliability metrics,
including availability, Mean Time to Restore (MTTR), Mean Time Between
Repairs (MTBR), and Probability of Failure.

Continuous
Improvement: Develop and maintain a backlog of opportunities for SRE
improvements.

Security
Clearance: With company sponsorship, obtain and maintain a U.S.
Federal Government "Public Trust" suitability clearance
(required).

Requirements

Proven
experience and expertise within the Site Reliability Engineering (SRE) discipline.

In-depth
knowledge and experience administering Azure systems.

Proficiency
with Kubernetes systems and familiarity with Podman/Docker and Helm
Charts.

Strong
programming skills in Python.

Experience
using GitHub for version control.

Understanding
of resiliency and reliability design patterns.

Bonus Points (Will be a strong plus)

Experience
with Prometheus, AKS Monitoring, Grafana, and automation tools.

Benefits:

Opportunity
to work with cutting-edge technologies

Work
in a collaborative and fast-paced environment

We are an equal opportunity employer and value diversity at our
company. We do not discriminate on the basis of race, religion, color, national
origin, gender, sexual orientation, age, marital status, veteran status, or
disability status.

Best Regards,

Abhay Singh

IT Recruiter

Diverse Lynx LLC.

Email: [email protected]

|
URL: http://www.diverselynx.com

LinkedIn ID: https://www.linkedin.com/in/abhaysingh-chauhan/

Diverse Lynx LLC|300 Alexander Park|Suite #200|Princeton, NJ
08540

--

Keywords: information technology wtwo Idaho New Jersey
Urgent Req- Site Reliability Engineer (SRE) Lead || Remote || Contract
[email protected]
[email protected]
View all
Thu Apr 11 22:42:00 UTC 2024



Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Time Taken: 0

Location: ,