Home

Job title Site Reliability Engineer Location Local to Atlanta GA Hybrid Duration 12 months at Atlanta, Georgia, USA
Email: [email protected]
From:

Suresh,

VYZE INC

[email protected]

Reply to:   [email protected]

Hi, hope you are doing great, please go through the below job description and provide me your consultant updated resume with visa and current location.

Please submit Candidates including Education Details in Resume

Job title: Site Reliability Engineer

Location: Local to Atlanta, GA Hybrid

Duration: 12+ months

Visa: Any

MOI: Skype

Client: Delta Airlines

LINKEDIN IS MUST.

MUST HAVE:

Qualifications:

Bachelors degree in design, computer science, or a related technical field

Strong debugging, troubleshooting, and problem-solving skills

Proficient in Nodejs, familiarity with other scripting languages is a plus: JavaScript, Python, Maven, Ansible, Bash, etc.

Experience with monitoring and alerting systems like Dynatrace, Prometheus, Grafana.

Experience with logs and metrics analytics platforms like Sumologic, Splunk

Experience setting SLOs / SLIs / error budgets and managing of reliability for infrastructure and applications using Kubernetes, AWS Native components, CloudWatch, Dynatrace.

Experience handling large numbers of diverse systems with configuration management systems like Puppet, Chef, Ansible

Proven history of leveraging automation

Experience using tools like PagerDuty for managing incidents.

Understanding of standard networking protocols and components such as HTTP, DNS, ECMP, TCP/IP, ICMP, the OSI Model, Subnetting and Load Balancing strategies

Experience in Serverless Application Framework

Experience in containerized workloads and management platforms such as Docker or Kubernetes

Familiarity with distributed systems is a plus including Microservices.

Experience in Infrastructure automation tools such as CDK

Understanding of CI/CD processes and experience with deployment automation tools such as Code Pipeline, Code Deploy, Jenkins, Bamboo

Effective communication, collaboration & negotiation skills with the ability to interface with various business units and vendors.

Experience liaising with developers, operations engineers, and third-party resources.

Experience consuming APIs.

Soft Skills:

Ability to work in a team and independently.

Excellent verbal and written communication skills

Multitasking

Time management

Responsibilities:

Engage in and improve the whole lifecycle of servicesfrom inception and design through deployment, operation, and refinement.

Responsible for improvements to end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence.

Partner with business and technical product owners to set SLOs / SLIs / error budgets to manage reliability of infrastructure and applications

Scale and optimize existing infrastructure and services sustainably through mechanisms, including automation, and evolve them by improving reliability and efficiency.

Manage end-to-end availability and performance of mission-critical services and build automation to prevent problem recurrence

Maintain infrastructure (infrastructure as code) and services by measuring, and monitoring system metrics to proactively identify operational efficiencies, potential outages, and security threats in Development, UAT, Staging and Production environments.

Practice sustainable incident response and blameless postmortems

Build infrastructure and drive projects that break things with the aim to improve the robustness of production systems

Preserve operational visibility and response capabilities fixing and improving our dashboards, alerts, and automation.

Maintain operational uptime and reliability by participating in triage and issue support calls for mission critical systems.

Monitoring service-level indicators (SLIs). An SLI could be the number of successful requests out of total requests. Having a high SLI, in this case, would be a target. SREs track other metrics such as availability, uptime performance, latency, error count and throughput. Regularly monitoring systems is essential to ensure proper resource utilization of containers and to avoid out-of-memory (OOM) errors.

Setting SLOs and SLAs and determining error budgets. Once you have determined baseline system performance, you can set service-level objectives (SLOs). These are typically internal targets like 99.99% availability. While SREs typically oversee functional metrics, some teams set goals for non-functional metrics, as well. SREs help determine service-level agreements (SLAs), which are more legally binding and typically partner-facing.

Responding to incidents. On-call SREs will be tasked with finding the root cause of issues as they arise. When triaging an incident, its helpful to have all the necessary logs and tools immediately at hand. This is one area where automation can assist by pulling relevant details to instantly build a case, said Curtis.

Writing postmortems. After an incident has been dealt with, its important to learn from it. Postmortems are common in cybersecurity practice and often fall under the responsibility of an SRE. These reviews seek to answer set criteria to get to the heart of an incident and identify the root cause(s) of an issue to prevent it from happening again.

Thanks and Regards

.

Suresh Nayak

Technical Recruiter

Vyze INC

Email:

[email protected]

25179 Methley Plum Place, Aldie, VA 20105

www.vyzeinc.com

Disclaimer:
This communication, along with any documents, files or attachments, is intended only for the use of the addressee and may contain confidential information. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of any information contained in or attached to this communication is strictly prohibited, To remove your email address permanently from future mailings, please send REMOVE to
[email protected]

Keywords: continuous integration continuous deployment information technology golang Georgia Virginia
[email protected]
View all
Thu Aug 31 00:23:00 UTC 2023

To remove this job post send "job_kill 589479" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 1

Location: ,