Home

Site Reliability Engineer at Remote, Remote, USA
Email: [email protected]
From:

Krishna,

Orchids

[email protected]

Reply to:   [email protected]

Site Reliability Engineer

Houston, TX/Remote (17675-1)

Work Location:

On-Site-Houston (client preferred), remote is a possibility for the right candidate.     

This is a 6+ month project with the potential for multiple extensions.

Our client has a need for a Site Reliability Engineer (SRE) to become a part of our growing Digital IT team focused on the Integrated Production Surveillance & Optimization (IPS&O) function. The SRE will support the reliability of Digital IT/OT critical applications.  This transformative role involves automating IT infrastructure tasks and driving SRE best practices, tools, and processes.  The ideal candidate should exhibit a growth mindset and proactively monitor and respond to incidents for optimal user experience.

The candidate must have senior level experience deploying and supporting applications in OpenShift/Kubernetes container platforms.

The successful candidate will possess a strong developer background as well as interpersonal skills needed to communicate design requirements and objectives while providing thought leadership to peers and leadership.

Candidates should be self-motivated and collaborative IT professionals with a strong background in software development, systems administration and IT automation.

Responsibilities:

* Maintaining survivability and reliability of IT/OT critical resources.

* Write and build CI/CD pipelines and build/release processes for IT/OT workflow applications.

* Provide mentoring to the IT/OT Devops team in the best practices associated with CI/CD deployments using ADO, and GIT.

* Perform periodic load and scalability testing to establish baselines, drift, and capacity planning.

* Conduct weekly operational state reviews covering performance trends, anomalies, errors, and other availability events with SREs, product owners, and development teams.

* Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc.

* Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection).

Required Qualifications

* Candidates must have a bachelors degree and 8 years of IT experience.

* Senior level experience with OCP and Kubernetes.

* Familiarity with continuous integration/deployment processes and tools such as IDEs (Eclipse), Source Code management. (GIT/Stash), ADO Pipelines, Maven, Nexus artifacts, etc.

* Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems.

* Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting.

* Scripting experience such as Python and Bash

* Experienced in deploying applications in OCP in both public and private cloud.

* Excellent written and oral communications skills

* Demonstrated ability to communicate to nontechnical audience on technical issues.

* Demonstrated ability to communicate on a technical level to a technical audience.

* Strong interpersonal skills, adaptable and able to learn quickly.

* Requires limited supervision and have excellent time management skills.

* Self-motivated and self-starter.

* Ability to work and interact with others in a structured/team environment.

Technology Stack:

Experience with at least one technology in each of the tech stack categories below:

* Monitoring and Logging Tools(s): AppDynamics, Splunk, ELK Stack, DataDog, Prometheus, AWS CloudWatch/X-Ray, Grafana

* Programming: C# .NET, PowerShell, Python, YAML

* Containers: Docker, Helm Chart

* OS: Linux RHEL, Ubuntu, CentOS

* Code Repos: Azure Repos, GitHub

* Infrastructure as code: Terraform, Ansible

* Automation Tools: Jenkins, Chef, Puppet

* Agile: JIRA, SAFe

Desired Qualifications:

* Experience in cloud/virtual technologies and management VMware, AWS, Azure, etc.

* Knowledge, skills and abilities to support web server technologies Apache, Nginx, IIS.

* Knowledge, skills and abilities to automate the creation of Platform as a Services (PaaS) infrastructure using industry standard tools such as Ansible and Chef.

* Familiarity with Industrial Control System (ICS) security architecture Purdue model.

Keywords: csharp continuous integration continuous deployment information technology Texas
[email protected]
View all
Fri Oct 06 23:13:00 UTC 2023

To remove this job post send "job_kill 725519" as subject from [email protected] to [email protected]. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to [email protected] -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at [email protected]
Time Taken: 0

Location: ,