Job title Site Reliability Engineer Location Local to Atlanta GA Hybrid Duration 12 months at Atlanta, Georgia, USA |
Email: [email protected] |
From: Suresh, VYZE INC [email protected] Reply to: [email protected] Hi, hope you are doing great, please go through the below job description and provide me your consultant updated resume with visa and current location. Please submit Candidates including Education Details in Resume Job title: Site Reliability Engineer Location: Local to Atlanta, GA Hybrid Duration: 12+ months Visa: Any MOI: Skype Client: Delta Airlines LINKEDIN IS MUST. MUST HAVE: Qualifications: Bachelors degree in design, computer science, or a related technical field Strong debugging, troubleshooting, and problem-solving skills Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc. Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana. Experience with logs and metrics analytics platforms like Sumologic, Splunk Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace. Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible Proven history of leveraging automation Experience using tools like PagerDuty for managing incidents. Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies Experience in Serverless Application Framework Experience in containerized workloads and management platforms such as Docker or Kubernetes Familiarity with distributed systems is a plus including Microservices. Experience in Infrastructure automation tools such as CDK Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors. Experience liaising with developers, operations engineers, and third-party resources. Experience consuming APIs. Soft Skills: Ability to work in a team and independently. Excellent verbal and written communication skills Multitasking Time management Responsibilities: Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation, and refinement. Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence. Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency. Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments. Practice sustainable incident response and blameless postmortems Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation. Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems. Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target. SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors. Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs). These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well. SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing. Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, its helpful to have all the necessary logs and tools immediately at hand. This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis. Writing postmortems. After an incident has been dealt with, its important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE. These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again. Thanks and Regards . Suresh Nayak Technical Recruiter Vyze INC Email: [email protected] 25179 Methley Plum Place, Aldie, VA 20105 www.vyzeinc.com Disclaimer: This communication, along with any documents, files or attachments, is intended only for the use of the addressee and may contain confidential information. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of any information contained in or attached to this communication is strictly prohibited, To remove your email address permanently from future mailings, please send REMOVE to [email protected] Keywords: continuous integration continuous deployment information technology golang Georgia Virginia |
[email protected] View all |
Thu Aug 31 00:23:00 UTC 2023 |