Home

Onsite - Platform monitoring Engineer or Cloud DevOps Engineer local to Virginia - NO H1b at Reston, Virginia, USA
Email: [email protected]
NO H1b

Role - Platform monitoring Engineer or Cloud DevOps Engineer

Location - Reston, Virginia (Onsite)

Job Description:

Client
is seeking an experienced monitoring tools and Open Telemetry Subject
Matter Expert (SME) who will be responsible for designing, implementing
and optimizing monitoring solutions and leveraging Open Telemetry to
enhance observability within the Enterprise Command Center (ECC).

The SME
should collaborate with the Incident Management team to troubleshoot and
resolve incidents.

Key Job Functions:

Lead
the design and implementation of monitoring solutions using industry
standard tools such as Splunk and others.

Customize
monitoring configurations to align with the organizational requirements.

Implement
and integrate Open Telemetry across various applications and services for
enhanced observability.

Optimize
monitoring solutions for efficiency and accuracy ensuring minimal impact
on system performance.

Responsible
for designing and implementing application and infrastructure performance
monitoring under AWS Cloud environment.

Create
monitors and dashboards to monitor applications and infrastructure
performance.

Perform
deep statistical analysis using performance data to help identify capacity
and performance bottlenecks.

Configure
alerting mechanisms within monitoring tools to proactively identify and address
potential issues.

Develop
comprehensive documentation for monitoring tool configurations, Open
Telemetry implementations and best practices.

Provide
training to incident management teams on utilizing monitoring tools and
interpreting open telemetry data effectively.

Setup
monitoring dashboards for incident detection and alerting.

Perform
end-to-end analysis of transactions under an observability environment.

Troubleshoot
incidents and identify root cause quickly using wire data analytics,
application performance management and event correlation monitoring tools.

Diagnose
and resolve incidents by providing factual data from the various
monitoring and instrumentation systems.

Job Requirements:

A good
understanding of the IT Cloud infrastructure that includes AWS Cloud,
middleware, database, storage and/or network infrastructure.

Strong
understanding of IT infrastructure, networking, security concepts and
application architecture.

Hands-on experience with Open
Telemetry instrumentation and telemetry data collection.

Proven experience as a Splunk SM
with in-depth knowledge of Splunk architecture and components.

Excellent
troubleshooting and problem-solving skills.

Strong
documentation skills and attention to detail.

Proactively
monitoring of hardware, software, and environmental alerts or
malfunctions.

Analyze
dashboards and monitoring tools to look for trends and patterns in
application/infrastructure health and performance.

Monitor applications and
infrastructure using tools like Splunk, DynaTrace, Catchpoint, MoogSoft,
xMatters, SignalFx, Catchpoint, MoogSoft, xMatters, SolarWinds, Extrahop
etc.

Expert understanding of micro
service-based applications deployed in Cloud using Lambdas, ECS Fargate
etc.

Proficiency in AWS services like
IAM, Roles, Security groups, EC2, S3, Lambda, ALB, ECS etc.

Experience working with AWS tools
like ELB, RDS, Redshift, DynamoDB, Aurora, Route53, Lambda, S3, Batch,
CloudWatch, CloudTrail, WAF etc.

Hands on experience with
transaction level monitoring using Dynatrace and Splunk.

Create
Splunk search queries and dashboards.

Be the
SME in helping recognize and onboard new data sources into Splunk and
other tools, analyze the data for anomalies and trends, and building
dashboards highlighting the key trends of the data.

Implement
best in class engineering strategies to support a distributed clustered
Splunk environment consisting of Search Heads, Indexers, Forwarders,
Splunk Enterprise Security (ES) app spanning security, performance,
engineering, and operational roles.

Use
open-source Observability framework, OpenTelemetry for instrumenting,
generating, collecting, and exporting telemetry data such as traces,
metrics, logs to help analyze application performance and behavior.

Use
distributed tracing in an end-to-end visibility environment that consists
of micro-services, Containers, Serverless and Lambda.

Work
closely with application teams and business stakeholders to perform
troubleshooting and aid in incident triage. 

Influence
other technical teams on incident calls and articulate troubleshooting
steps effectively.

Follow
up on items that could negatively impact production operations, assist
with postmortem related activities, and support various efforts related to
operational improvements.

Strong
relationship management skills and aptitude to multi-task and work well in
a high stress environment, both within teams and independently.

Preferred Qualifications:

Familiarity
with distributed tracing and logging solutions.

Knowledge
of Cloud Platforms (AWS, Azure) and their integration with monitoring
tools.

AWS Solution Architect Associate
or higher certification.

Exposure
working under an incident management environment.

Triage
incidents to resolution in a 24/7/365 environment, effectively guide
incident triage calls from a technical perspective, share technical
details obtained from monitoring tools and dashboards to aid
troubleshooting, outline details of resolution activities provide timely
status updates to stakeholders, assist with postmortem related activities
and support various efforts related to operational improvements.

Ability
to report incident details and metrics to senior leadership.

Perform
analysis of data, evaluating multiple application protocols including web,
database, storage, and supporting infrastructure such as UNIX, DNS, LDAP,
SSL, SMTP, and FTP.

Proficient
in Scripting - UNIX/LINUX- Shell Scripting and Python. Working knowledge
of JavaScript or Perl etc. for customizing monitoring configurations

Certification in relevant
monitoring tools or Open Telemetry is a plus.

--

Keywords: sthree information technology
[email protected]
View all
Wed Feb 07 20:27:00 UTC 2024

To remove this job post send "job_kill 1093982" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 8

Location: Reston, Virginia