SRE Engineer - Remote ( Site Reliability Engineer ) at Remote, Remote, USA |
Email: [email protected] |
Role: SRE Engineer Location: Remote Job Type: Contract Exp:6+ Years Client: E&Y Requirement: 5 years minimum experience Experience in Azure Testing, tuning, monitoring specifically performance testing Monitoring tools Splunk, Dynatrace Experience in Kubernetes Non-functional testing Experience with observability tools Job Description Site Reliability Engineer (5+ years of experience) is responsible for supporting reliability driven development and operations through enablement and enhancement of tools, process, best practices, framework, training, and technology SREs will be part of development lifecycle with more focus on shift left phases including Requirement, Design/Architecture, Coding, and testing and to make sure performance, resiliency, reliability, scalability, and availability are collected, factored in the design, developed, and tested before it goes into production and provide support/guidance for post-production operations Facilitate Non-Functional Requirement & SLO Collection and Documentation Architecture and Design review for Reliability and Performance identify gaps to be added to the backlog Conduct Single User profiling for resource (CPU, Memory, IO, Network) usage optimization Create automation framework/dashboard for faster root cause analysis required for issues arises from Non-functional testing Be escalation point for issues that Performance testing team can't figure out or need more support Identify and configure Observability tools for the given application Develop dashboards for SLO/SLIs through configured observability tools Create alerts with thresholds for application/Infra issues through configured observability tools Enable proactive and predictive monitoring and alerting Crosstrain developers in the use of Observability Tools Splunk, Dynatrace, Grafana for issue resolution and application tracing Develop automated framework (dashboards and reports) for observability to include automated reports for SLOs and Error Budgets Develop automated dashboards/alerts for new functionality/modules Develop or support self-healing solutions/framework for repeated prod issues Facilitate blameless RCA with the team for Production Issues to develop recommendations for improvement Escalation-point for complex issue resolution encountered in Production - escalation path Establish a list of common Stability/Resilience recommendations each team should have in place for all deployments and assist teams with adopting. For existing applications and ensure new development includes the recommendations In liaison with Technical Architects, develop automated framework for production readiness checklist to ensure deployments are configured for stability best practices that reduces risk and that error budgets are in a state to accept the risk of upcoming changes Facilitate integrating Non-Functional testing into CICD pipeline Recommend best deployment strategy and tools Takes ownership and proactively identifies issues and opportunities to improve the systems they are involved in and acts accordingly by doing the work and/or providing a plan to be executed Qualifications Proficient in SRE Principles and Practices Hands on SRE experience with influencing skills on providing solutions and taking decisions Proficient in observability tools including Splunk, Dynatrace, Prometheus/Grafana Proficient in Kubernetes technology Proficient in problem identification and solving Experience in software, infrastructure, or platform engineering with a combination of the following: o Scalability, resiliency, reliability analysis of application design/architecture Performance Testing/Tuning/Monitoring, maximizing system uptime and availability, ensuring functional and performance SLAs Experience in Agile Methodologies and processes o Experience representing technical viewpoints to diverse audiences and making prudent & timely technical risk decisions Experience developing automation Experience in understanding end to end application component architecture -- Thanks and Regards,Praveen J Email Address - [email protected] Talent Acquisition Specialist http://adepttechservices.com 11340 Lakefield Dr., Suite 200, Johns Creek, GA 30097 Ph: (678)-785-3 -- Keywords: information technology Georgia |
[email protected] View all |
Tue Sep 19 23:16:00 UTC 2023 |