| Site Reliability Engineer :: Location : Sunnyvale, CA (Onsite) at Sunnyvale, California, USA |
| Email: [email protected] |
|
http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=1932184&uid= From: Lokesh, Cloud think Technologies [email protected] Reply to: [email protected] Need Local to California only Role: Site Reliability Engineer Location : Sunnyvale, CA (Onsite) Position type : Contract Project Duration : Long Term Visa : & only Detailed JD : Must Have: handling tickets for the Walmart environment. Splunk, ServiceNow, Azure Ideally candidates that come from an Enterprise background. Must take a glider test before submitted to the client. Job Description: As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and maintaining the infrastructure and systems necessary to support our applications and services. You will work closely with cross-functional teams to drive operational excellence, automate processes, and continuously improve system reliability. You will be a specialist on complex technical and business matters. Your expertise in cloud technologies, automation, and performance optimization will be key to the success of our engineering and operations efforts. In this role you will be collaborating heavily with the team, oftentimes multitasking, and consistently driving projects to completion. The position will require a good mix of steadfast persistence, innovative thinking, ability to interpret performance data, and good people skills. If you can do all that while having fun, even better! Local area candidates are encouraged to apply Key Responsibilities: Cloud Strategy Provide thought leadership, mentorship, and technical vision related to site reliability, DevOps, and a cloud-first culture. Analyze and implement cloud services to meet business goals, focusing on cost optimizations, efficiencies, and scalability. Drive orchestration efforts for cloud services, design self-service aspects, and stay updated with emerging cloud technologies. Infrastructure Automation and Design Collaborate on designing, building, and maintaining scalable infrastructure across cloud and on-prem environments. Automate provisioning and configuration using tools like Terraform, Terragrunt, and Puppet. Develop automation scripts, maintain CI/CD pipelines, and plan for scalability and capacity, conducting load testing as needed. Reliability and Performance Engineering Ensure system reliability, availability, and performance through monitoring, alerting, and incident response. Implement and manage SLOs/SLIs to meet reliability standards. Identify and address performance bottlenecks across the infrastructure and application stack. Build and maintain observability solutions (e.g., monitoring, logging, and tracing) and improve system health dashboards. Security and Compliance: Implement security measures for Cloud Native applications and ensure compliance with industry standards (SOC2, PCI, etc). Collaborate with security teams to audit and monitor systems, continuously updating security configurations and dashboards. Incident Management and Root Cause Analysis: Participate in on-call rotations to provide 24/7 support for production environment. Lead incident response activities and perform root cause analysis to prevent recurring incidents. Conduct and document post-incident retrospectives (postmortems) to drive continuous improvement. Create and Maintain runbooks and operational documentation for continuous improvement. Proactively test system resilience through Chaos Engineering experiments and failure injection. Disaster Recovery and Business Continuity Design and test disaster recovery (DR) and business continuity strategies, ensuring backup and failover mechanisms are effective. Cost Management and Financial Optimization Monitor cloud usage and implement financial optimization practices (FinOps) to control infrastructure costs. Collaborate with stakeholders to drive financial efficiency. Collaboration, Knowledge Sharing, and Communication: Collaborate across teams to ensure alignment and effective project implementation. Communicate during incidents and changes, providing transparency to stakeholders. Mentor and share knowledge with team members to foster a collaborative and continuous learning environment. Maintain comprehensive documentation of system configurations, processes, and best practices. Qualifications: Bachelors or Masters degree in Computer Science, Engineering, or a related field, or equivalent experience. 8+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments. Proficiency in AWS and containerization technologies like Kubernetes and Docker. Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/, or Go. Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN). Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management. Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices. Excellent troubleshooting skills, with experience in performance optimization and root cause analysis. Strong communication and collaboration skills. Bonus skills: experience with Rundeck, Java, Spring Framework, Terragrunt, Puppet, Vector, Loki, VictoriaMetrics, and additional cloud platforms (e.g., P, Azure), as well as relevant certifications such as AWS Solutions Architect or Certified Kubernetes Administrator (CKA) Looking forward to work with you! Thanks & Regards Lokesh Yadav Sr. Technical Recruiter CloudThink Tech Inc Keywords: continuous integration continuous deployment golang card California Site Reliability Engineer :: Location : Sunnyvale, CA (Onsite) [email protected] http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=1932184&uid= |
| [email protected] View All |
| 08:30 PM 14-Nov-24 |