Job Details

Home

Hybrid RE: Site Reliability Engineer No h1B at Remote, Remote, USA

Email: [email protected]

Client:
Walmart

Title: Site
Reliability Engineer role with Azure and SPlunk

Location:
Sunnyvale CA

Duration:
6+ Months

Visa: NO
H1B

MOI:
Skype

Need LinkedIn with profile picture-2 Candidate
only

This is a Site Reliability Engineer Role for Sam's Cash Application
team.

Role and Responsibilities include:

Production Tickets handling and
Troubleshooting :
Requires
knowledge of: Strong Analytical and problem solving skills; Root cause
analysis (RCA); Root cause corrective action (RCCA) To guide team members
in RCA and RCCA to identify the origins of and prevent defects/performance
gaps. Analyzes complex problems involving multiple parties, networks,
hardware, software, and cloud computing technologies.

Assesses immediate restoration versus root
cause based on consequences and resource requirements. Analyzes the issues
and plans a series of steps to enhance an application's availability and
reliability, potentially including reconfiguration, integration, removal,
or the addition of application components. Analyzes trends to proactively
prevent incidents and provide historical summary reports.

Disaster Recovery Planning: Requires knowledge
of:
Disaster recovery
procedures and processes; Enterprise disaster recovery systems. To
coordinate partial and full tests of contingency and disaster recovery
plans. Creates and maintains data center contingency documents and action
plans. Defines and documents contingency and disaster recovery procedures.
Leads the identification of critical functions for assigned area of
responsibility. Creates and tests plans for operating in a remote back-up
environment. Coordinates the day-to-day activities of control measures
used in recovery plans.

Monitoring and Alerting :
Requires knowledge of: Monitoring and
alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key
performance indicators (for example, availability, MTBF, MTTR); SLIs and
SLOs (for example, request latency, availability, error rates, saturation);
Distributed tracing; Alerting logic.

To establish metrics to monitor network,
software, or system performance. Establishes SLOs/SLAs to determine
availability goals of systems/services. Sets altering priorities by
identifying the most important systems based on criticality. Oversees
daily system monitoring, including verifying the integrity and
availability of all hardware and services, reviews system and application
logs, and verifies the completion of scheduled jobs.

Leads end-to-end audits of monitors and alarms
based on subsystem knowledge. Provides proactive updates to executive
leadership on potential customer-impacting issues. Analyzes systems and
makes recommendations to prevent possible incidents using knowledge of
complex and company-wide systems.

Data Reporting and Metrics:

Advanced SQL skills to pull complex data
report from multiple sources, familiar with Databricks or GCP Big Query,
capable to write advanced "Splunk" queries to join multiple
indices to stitch data, using Data-Driven decision-making process to analyze
the impact of the production issues and prioritize them.

Top 3 Skills Needed or Required

Strong technical analytical and problem
solving skills , experiences on triaging and Troubleshooting Production
Issues;

Monitoring and Alerting Skills ((Splunk,
Prometheus, Grafana)

Data Reporting and Metrics Skills (SQL,Python,
Pyspark, Databricks).

--

Keywords: information technology California
Hybrid RE: Site Reliability Engineer No h1B
[email protected]

[email protected]
View all

Mon Sep 09 22:59:00 UTC 2024

To remove this job post send "job_kill 1732387" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.

Your reply to [email protected] -

To

Subject
Message -

nirajkr147852@gmail.com wrote:
Client:
Walmart

Title: Site
Reliability Engineer role with Azure and SPlunk

Location: 
Sunnyvale CA

Duration:
6+ Months

Visa: NO
H1B

MOI:
Skype

Need LinkedIn with profile picture-2 Candidate
only

This is a Site Reliability Engineer Role for Sam's Cash Application
team.

Role and Responsibilities include:

Production Tickets handling and
     Troubleshooting : 
Requires
     knowledge of: Strong Analytical and problem solving skills; Root cause
     analysis (RCA); Root cause corrective action (RCCA) To guide team members
     in RCA and RCCA to identify the origins of and prevent defects/performance
     gaps. Analyzes complex problems involving multiple parties, networks,
     hardware, software, and cloud computing technologies.

Assesses immediate restoration versus root
     cause based on consequences and resource requirements. Analyzes the issues
     and plans a series of steps to enhance an application's availability and
     reliability, potentially including reconfiguration, integration, removal,
     or the addition of application components. Analyzes trends to proactively
     prevent incidents and provide historical summary reports.

Disaster Recovery Planning: Requires knowledge
     of: 
Disaster recovery
     procedures and processes; Enterprise disaster recovery systems. To
     coordinate partial and full tests of contingency and disaster recovery
     plans. Creates and maintains data center contingency documents and action
     plans. Defines and documents contingency and disaster recovery procedures.
     Leads the identification of critical functions for assigned area of
     responsibility. Creates and tests plans for operating in a remote back-up
     environment. Coordinates the day-to-day activities of control measures
     used in recovery plans.

Monitoring and Alerting : 
Requires knowledge of: Monitoring and
     alerting tools (Splunk, Prometheus, Grafana); Monitoring metrics and key
     performance indicators (for example, availability, MTBF, MTTR); SLIs and
     SLOs (for example, request latency, availability, error rates, saturation);
     Distributed tracing; Alerting logic.

To establish metrics to monitor network,
     software, or system performance. Establishes SLOs/SLAs to determine
     availability goals of systems/services. Sets altering priorities by
     identifying the most important systems based on criticality. Oversees
     daily system monitoring, including verifying the integrity and
     availability of all hardware and services, reviews system and application
     logs, and verifies the completion of scheduled jobs.

Leads end-to-end audits of monitors and alarms
     based on subsystem knowledge. Provides proactive updates to executive
     leadership on potential customer-impacting issues. Analyzes systems and
     makes recommendations to prevent possible incidents using knowledge of
     complex and company-wide systems.

Data Reporting and Metrics:

Advanced SQL skills to pull complex data
     report from multiple sources, familiar with Databricks or GCP Big Query,
     capable to write advanced "Splunk" queries to join multiple
     indices to stitch data, using Data-Driven decision-making process to analyze
     the impact of the production issues and prioritize them.

Top 3 Skills Needed or Required

Strong technical analytical and problem
     solving skills , experiences on triaging and Troubleshooting Production
     Issues;

Monitoring and Alerting Skills ((Splunk,
     Prometheus, Grafana)

Data Reporting and Metrics Skills (SQL,Python,
     Pyspark, Databricks).

Keywords: information technology California 
Hybrid RE: Site Reliability Engineer No h1B
nirajkr147852@gmail.com

Your email id:

Captcha Image:

Captcha Code:

Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 13

Location: Sunnyvale, California