Required||Site reliability engineering||Remote at Remote, Remote, USA |
Email: [email protected] |
From: Shivani, kpg99 [email protected] Reply to: [email protected] Hi , Hope you are doing well. My name is Shiv an i Saini and I'm an IT recruiter at KPG99. Kindly go through the below JD and let me know your interest. Also share with me your updated resume with contact details. Position : Site Reliability Engineering Location : Remote Duration : 6+ Months Contract JOB DESCRIPTION. We are seeking a highly skilled and experienced Site Reliability Engineer (SRE) to join our dynamic team, focusing on the reliability, scalability, and robustness of our trading enclave products and service lines. The ideal candidate will possess a deep understanding of SRE principles, including incident management, DevOps practices, and software development, with specialized expertise in Dynatrace, Splunk, and Grafana. This role requires a strong background in root cause analysis, troubleshooting, and implementing resilient system designs like circuit breakers, Kubernetes deployments, and various deployment strategies (blue/green, canary, etc.). Key Responsibilities Incident Management: Lead and manage incident response efforts, ensuring rapid recovery from operational issues and implementing preventive measures to reduce future incidents. Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to set up comprehensive monitoring and observability frameworks, enabling proactive detection and resolution of issues. Performance Optimization: Analyze system performance, identify bottlenecks, and implement optimizations to improve reliability and efficiency. Deployment Strategies: Design and implement resilient deployment strategies, including blue/green deployments, canary releases, and Kubernetes rollouts, to ensure zero-downtime updates and scalability. Root Cause Analysis: Conduct thorough root cause analysis for incidents and issues, documenting findings and leading the implementation of corrective actions to prevent recurrence. DevOps Practices: Champion DevOps practices across the organization, working closely with development and operations teams to streamline CI/CD pipelines and automate workflows. Circuit Breakers Implementation: Specialize in designing and implementing circuit breaker patterns to prevent system failures and ensure high availability. Qualifications Bachelor's or Master's degree in Computer Science, Engineering, or a related field. 5+ years of experience in an SRE or similar role, with a focus on trading systems or financial services. Expertise in monitoring tools (Dynatrace, Splunk, Grafana) and Kubernetes. Strong understanding of DevOps methodologies and tools. Proven track record in incident management, root cause analysis, and implementing resilient system designs. Experience with deployment strategies (blue/green, canary, etc.) and managing complex, distributed systems in a cloud environment. Excellent problem-solving, communication, and teamwork skills. Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker). Fairly good understanding of MongodB, Kafka and IBM mainframe DB2 (preferred) Conversant with WebLogic, Java technology stacks including spring boot(Not Expert level skillset) Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams. Certifications in relevant technologies (Dynatrace, Splunk) are a plus. 1. Dynatrace Basics: o Can you explain how Dynatrace OneAgent works and how it collects data from monitored applications 2. Splunk Fundamentals: o How would you use Splunk to search and filter log data to identify errors or anomalies in an application 3. Monitoring and Alerting: o Describe how you would set up monitoring and alerting for a critical service using Dynatrace. 4. Log Analysis with Splunk: o Can you provide an example of a Splunk search query you've used to troubleshoot a specific issue 5. Performance Optimization: o How do you use Dynatrace to identify and address performance issues in a web application 6. Incident Response: o Describe your role in an incident response scenario and how you used Dynatrace or Splunk to diagnose and resolve the issue. 7. Infrastructure Monitoring: o How would you monitor the health and performance of a Kubernetes cluster using Dynatrace 8. Data Visualization: o How do you create dashboards in Splunk to visualize key metrics and trends for a service 9. Collaboration with Development Teams: o How do you work with development teams to implement observability and monitoring solutions using Dynatrace and Splunk 10. Continuous Improvement: o How do you use data from Dynatrace and Splunk to drive continuous improvement in application performance and reliability Thanks & Regards Shivani Saini Technical Recruiter [email protected] Direct--609-662-6116 KPG99,INC 3240 E STATE ST EXT Hamilton, NJ 08619 www.kpgtech.com Linkedin ID https://www.linkedin.com/in/shivani-saini-1397311a2/ Keywords: continuous integration continuous deployment information technology golang Idaho New Jersey Required||Site reliability engineering||Remote [email protected] |
[email protected] View all |
Mon Jul 15 19:41:00 UTC 2024 |