Site Reliability Engineer at Remote, Remote, USA |
Email: [email protected] |
From: Krishna, Orchids [email protected] Reply to: [email protected] Site Reliability Engineer Houston, TX/Remote (17675-1) Work Location: On-Site-Houston (client preferred), remote is a possibility for the right candidate. This is a 6+ month project with the potential for multiple extensions. Our client has a need for a Site Reliability Engineer (SRE) to become a part of our growing Digital IT team focused on the Integrated Production Surveillance & Optimization (IPS&O) function. The SRE will support the reliability of Digital IT/OT critical applications. This transformative role involves automating IT infrastructure tasks and driving SRE best practices, tools, and processes. The ideal candidate should exhibit a growth mindset and proactively monitor and respond to incidents for optimal user experience. The candidate must have senior level experience deploying and supporting applications in OpenShift/Kubernetes container platforms. The successful candidate will possess a strong developer background as well as interpersonal skills needed to communicate design requirements and objectives while providing thought leadership to peers and leadership. Candidates should be self-motivated and collaborative IT professionals with a strong background in software development, systems administration and IT automation. Responsibilities: * Maintaining survivability and reliability of IT/OT critical resources. * Write and build CI/CD pipelines and build/release processes for IT/OT workflow applications. * Provide mentoring to the IT/OT Devops team in the best practices associated with CI/CD deployments using ADO, and GIT. * Perform periodic load and scalability testing to establish baselines, drift, and capacity planning. * Conduct weekly operational state reviews covering performance trends, anomalies, errors, and other availability events with SREs, product owners, and development teams. * Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc. * Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection). Required Qualifications * Candidates must have a bachelors degree and 8 years of IT experience. * Senior level experience with OCP and Kubernetes. * Familiarity with continuous integration/deployment processes and tools such as IDEs (Eclipse), Source Code management. (GIT/Stash), ADO Pipelines, Maven, Nexus artifacts, etc. * Strong understanding of SRE practices: incident response, change/release management, capacity planning, infrastructure automation, elastic environments, chaos engineering and blameless postmortems. * Expertise in application performance monitoring, observability, and proactive alert correlation, including monitoring containers and failure-based alerting. * Scripting experience such as Python and Bash * Experienced in deploying applications in OCP in both public and private cloud. * Excellent written and oral communications skills * Demonstrated ability to communicate to nontechnical audience on technical issues. * Demonstrated ability to communicate on a technical level to a technical audience. * Strong interpersonal skills, adaptable and able to learn quickly. * Requires limited supervision and have excellent time management skills. * Self-motivated and self-starter. * Ability to work and interact with others in a structured/team environment. Technology Stack: Experience with at least one technology in each of the tech stack categories below: * Monitoring and Logging Tools(s): AppDynamics, Splunk, ELK Stack, DataDog, Prometheus, AWS CloudWatch/X-Ray, Grafana * Programming: C# .NET, PowerShell, Python, YAML * Containers: Docker, Helm Chart * OS: Linux RHEL, Ubuntu, CentOS * Code Repos: Azure Repos, GitHub * Infrastructure as code: Terraform, Ansible * Automation Tools: Jenkins, Chef, Puppet * Agile: JIRA, SAFe Desired Qualifications: * Experience in cloud/virtual technologies and management VMware, AWS, Azure, etc. * Knowledge, skills and abilities to support web server technologies Apache, Nginx, IIS. * Knowledge, skills and abilities to automate the creation of Platform as a Services (PaaS) infrastructure using industry standard tools such as Ansible and Chef. * Familiarity with Industrial Control System (ICS) security architecture Purdue model. Keywords: csharp continuous integration continuous deployment information technology Texas |
[email protected] View all |
Fri Oct 06 23:13:00 UTC 2023 |