Home

Urgent Req- Site Reliability Engineer (SRE) Lead (12+ years) || Remote || Contract at Remote, Remote, USA
Email: [email protected]
ONLY REPLY TO 
[email protected]

TO REVIEW PROFILE.

Hi Professional

I hope youre doing well!

My name is Abhay and I'm an IT Recruiter at Diverse Lynx.

I have an urgent position for
the following role. If interested, please share your resume at [email protected]
or call me at 732-452-1006 Ext 618

Site
Reliability Engineer (SRE) Lead - Public Sector Core Framework team

Remote

About
the Role

Client
is seeking a Lead Software Engineer to join our Public Sector Core Framework
platform team and play a critical role as a Site Reliability Engineer (SRE)
within our Azure/Kubernetes ecosystem. In this role, you will be responsible
for ensuring the stability, scalability, and performance of our platform,
contributing significantly to the continued success of the client.

Key
Responsibilities

Champion
SRE Practices: Lead the team in strengthening SRE practices,
including defining service level indicators (SLIs), objectives (SLOs),
error budgets, thresholds, alerting, and error management systems.

Site
Planning and Optimization: Collaborate with development and testing
teams to plan changes for production and other environments. Optimize
planned outages, streamlining DevOps activities and minimizing downtime.

Toil
Reduction: Identify repetitive tasks (toil) and develop solutions to
improve efficiency and reduce manual workload.

Automation
Advocacy: Leverage automation wherever possible to enhance stability,
functionality, and overall platform management.

Alert
Management: Strengthen alerting systems by establishing goals,
criteria, and processes for alert recalls, resets, enabling/disabling
alerts, and revising error budgets based on team toil.

Outage
Prevention and Response: Proactively address non-critical alerts and
collaborate with development and testing teams to prevent outages.

Performance
Verification: Work closely with Load and Performance teams to
redefine parameters like load and concurrent user capacity.

Incident
Management: Lead and facilitate meetings with development and
operations teams during incidents to ensure effective resolution.

Post-Incident
Reviews: Lead post-incident reviews with teams to identify root
causes (RCAs), develop long-term solutions (code changes, configuration
adjustments, architectural modifications, or capacity planning), and
implement learnings to prevent future issues.

Reliability
Reporting: Generate reports using defined reliability metrics,
including availability, Mean Time to Restore (MTTR), Mean Time Between
Repairs (MTBR), and Probability of Failure.

Continuous
Improvement: Develop and maintain a backlog of opportunities for SRE
improvements.

Security
Clearance: With company sponsorship, obtain and maintain a U.S.
Federal Government "Public Trust" suitability clearance
(required).

Requirements

Proven
experience and expertise within the Site Reliability Engineering (SRE)
discipline.

In-depth
knowledge and experience administering Azure systems.

Proficiency
with Kubernetes systems and familiarity with Podman/Docker and Helm
Charts.

Strong
programming skills in Python.

Experience
using GitHub for version control.

Understanding
of resiliency and reliability design patterns.

Bonus
Points (Will be a strong plus)

Experience
with Prometheus, AKS Monitoring, Grafana, and automation tools.

Benefits:

Opportunity to work with cutting-edge
technologies

Work in a collaborative and fast-paced
environment

We
are an equal opportunity employer and value diversity at our company. We do not
discriminate on the basis of race, religion, color, national origin, gender,
sexual orientation, age, marital status, veteran status, or disability status.

Best Regards,

Abhay
Singh

IT Recruiter

Diverse Lynx LLC.

Email:
[email protected]

|
URL: http://www.diverselynx.com

LinkedIn ID:
https://www.linkedin.com/in/abhaysingh-chauhan/

Diverse Lynx LLC|300
Alexander Park|Suite #200|Princeton, NJ 08540

--

Keywords: information technology wtwo Idaho New Jersey
Urgent Req- Site Reliability Engineer (SRE) Lead (12+ years) || Remote || Contract
[email protected]
[email protected]
View all
Fri Apr 19 00:31:00 UTC 2024

To remove this job post send "job_kill 1323417" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 0

Location: ,