Urgent Req- Site Reliability Engineer (SRE) Lead (12+ years) || Remote || Contract at Remote, Remote, USA |
Email: [email protected] |
ONLY REPLY TO [email protected] TO REVIEW PROFILE. Hi Professional I hope youre doing well! My name is Abhay and I'm an IT Recruiter at Diverse Lynx. I have an urgent position for the following role. If interested, please share your resume at [email protected] or call me at 732-452-1006 Ext 618 Site Reliability Engineer (SRE) Lead - Public Sector Core Framework team Remote About the Role Client is seeking a Lead Software Engineer to join our Public Sector Core Framework platform team and play a critical role as a Site Reliability Engineer (SRE) within our Azure/Kubernetes ecosystem. In this role, you will be responsible for ensuring the stability, scalability, and performance of our platform, contributing significantly to the continued success of the client. Key Responsibilities Champion SRE Practices: Lead the team in strengthening SRE practices, including defining service level indicators (SLIs), objectives (SLOs), error budgets, thresholds, alerting, and error management systems. Site Planning and Optimization: Collaborate with development and testing teams to plan changes for production and other environments. Optimize planned outages, streamlining DevOps activities and minimizing downtime. Toil Reduction: Identify repetitive tasks (toil) and develop solutions to improve efficiency and reduce manual workload. Automation Advocacy: Leverage automation wherever possible to enhance stability, functionality, and overall platform management. Alert Management: Strengthen alerting systems by establishing goals, criteria, and processes for alert recalls, resets, enabling/disabling alerts, and revising error budgets based on team toil. Outage Prevention and Response: Proactively address non-critical alerts and collaborate with development and testing teams to prevent outages. Performance Verification: Work closely with Load and Performance teams to redefine parameters like load and concurrent user capacity. Incident Management: Lead and facilitate meetings with development and operations teams during incidents to ensure effective resolution. Post-Incident Reviews: Lead post-incident reviews with teams to identify root causes (RCAs), develop long-term solutions (code changes, configuration adjustments, architectural modifications, or capacity planning), and implement learnings to prevent future issues. Reliability Reporting: Generate reports using defined reliability metrics, including availability, Mean Time to Restore (MTTR), Mean Time Between Repairs (MTBR), and Probability of Failure. Continuous Improvement: Develop and maintain a backlog of opportunities for SRE improvements. Security Clearance: With company sponsorship, obtain and maintain a U.S. Federal Government "Public Trust" suitability clearance (required). Requirements Proven experience and expertise within the Site Reliability Engineering (SRE) discipline. In-depth knowledge and experience administering Azure systems. Proficiency with Kubernetes systems and familiarity with Podman/Docker and Helm Charts. Strong programming skills in Python. Experience using GitHub for version control. Understanding of resiliency and reliability design patterns. Bonus Points (Will be a strong plus) Experience with Prometheus, AKS Monitoring, Grafana, and automation tools. Benefits: Opportunity to work with cutting-edge technologies Work in a collaborative and fast-paced environment We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. Best Regards, Abhay Singh IT Recruiter Diverse Lynx LLC. Email: [email protected] | URL: http://www.diverselynx.com LinkedIn ID: https://www.linkedin.com/in/abhaysingh-chauhan/ Diverse Lynx LLC|300 Alexander Park|Suite #200|Princeton, NJ 08540 -- Keywords: information technology wtwo Idaho New Jersey Urgent Req- Site Reliability Engineer (SRE) Lead (12+ years) || Remote || Contract [email protected] |
[email protected] View all |
Fri Apr 19 00:31:00 UTC 2024 |