Site Reliability Engineer||Remote at Remote, Remote, USA |
Email: [email protected] |
Hi, Hope you are doing great today, Please find the below requirement and if you are comfortable then share the updated resume with contact details ASAP. Job Title: Site Reliability Engineer Location: Remote Duration: 12+ Months Visa: No H1B & Opt, Cpt MOI: Zoom Exp:6+ Years Requirement: 5 years minimum experience Experience in Azure Testing, tuning, monitoring specifically performance testing Monitoring tools Splunk, Dynatrace Experience in Kubernetes Non-functional testing Experience with observability tools Job Description Site Reliability Engineer (5+ years of experience) is responsible for supporting reliability driven development and operations through enablement and enhancement of tools, process, best practices, framework, training, and technology SREs will be part of development lifecycle with more focus on shift left phases including Requirement, Design/Architecture, Coding, and testing and to make sure performance, resiliency, reliability, scalability, and availability are collected, factored in the design, developed, and tested before it goes into production and provide support/guidance for post-production operations Facilitate Non-Functional Requirement & SLO Collection and Documentation Architecture and Design review for Reliability and Performance identify gaps to be added to the backlog Conduct Single User profiling for resource (CPU, Memory, IO, Network) usage optimization Create automation framework/dashboard for faster root cause analysis required for issues arises from Non-functional testing Be escalation point for issues that Performance testing team can't figure out or need more support Identify and configure Observability tools for the given application Develop dashboards for SLO/SLIs through configured observability tools Create alerts with thresholds for application/Infra issues through configured observability tools Enable proactive and predictive monitoring and alerting Crosstrain developers in the use of Observability Tools Splunk, Dynatrace, Grafana for issue resolution and application tracing Develop automated framework (dashboards and reports) for observability to include automated reports for SLOs and Error Budgets Develop automated dashboards/alerts for new functionality/modules Develop or support self-healing solutions/framework for repeated prod issues Facilitate blameless RCA with the team for Production Issues to develop recommendations for improvement Escalation-point for complex issue resolution encountered in Production - escalation path Establish a list of common Stability/Resilience recommendations each team should have in place for all deployments and assist teams with adopting. For existing applications and ensure new development includes the recommendations In liaison with Technical Architects, develop automated framework for production readiness checklist to ensure deployments are configured for stability best practices that reduces risk and that error budgets are in a state to accept the risk of upcoming changes Facilitate integrating Non-Functional testing into CICD pipeline Recommend best deployment strategy and tools Takes ownership and proactively identifies issues and opportunities to improve the systems they are involved in and acts accordingly by doing the work and/or providing a plan to be executed Qualifications Proficient in SRE Principles and Practices Hands on SRE experience with influencing skills on providing solutions and taking decisions Proficient in observability tools including Splunk, Dynatrace, Prometheus/Grafana Proficient in Kubernetes technology Proficient in problem identification and solving Experience in software, infrastructure, or platform engineering with a combination of the following: o Scalability, resiliency, reliability analysis of application design/architecture Performance Testing/Tuning/Monitoring, maximizing system uptime and availability, ensuring functional and performance SLAs Experience in Agile Methodologies and processes o Experience representing technical viewpoints to diverse audiences and making prudent & timely technical risk decisions Experience developing automation Experience in understanding end to end application component architecture Amit Vikal (AV) Thoth IT LLC -- Keywords: information technology |
[email protected] View all |
Wed Sep 20 02:10:00 UTC 2023 |