Lead Site Reliability Engineer with hands on Architect profiles & strong Java, NodeJS and AWS app, Resiliency Architecture patterns Experience. at Strong, Arkansas, USA |
Email: [email protected] |
From: Steve, 3mkllc [email protected] Reply to: [email protected] Hello, Greetings for the day!!! Please review the below role and advise the best time to connect with you. If you are interested, You can call me at +1 801-701-7848or reach me on linkedin.com/in/saumen-kumar-porel-97a0696b and send resumes to [email protected] Updated feedback from manager: We need people who have strong Java, NodeJS and AWS app dev skills, familiar with resiliency architecture patterns, and have good understanding on observability, performance and resiliency engineering. Hiring: Lead Site Reliability Engineer with hands on Architect profiles & strong Java, NodeJS and AWS app, Resiliency Architecture patterns Experience. Loc: Malvern, PA (Remote Now) Start Date: Immediate Contract: Long Term *Additional Info* Looking for lead expert level software engineers or hands-on architect profiles. Description Are you an engineer who loves to solve impactful complex operational problems Are you passionate about finding opportunities to improve system performance and efficiency, scalability, fault tolerance, and self-healing capabilities Are you excited about Chaos Engineering Do you want to apply these principles and creatively experiment with our systems to discover hidden weaknesses Are you obsessed with understanding systems inner state, interactions between systems or observability-driven development If the above holds, then the Lead Site Reliability Engineer opportunity at client is for you! A successful candidate will likely have experience in being a Full Stack Engineer who has supported their applications operationally. You will be solutioning reliability problems across product families and continuously seeking opportunities to improve our systems -ilities. You will also help define, maintain, and carry out subdivisional reliability engineering standards, contribute to enterprise-wide libraries for reliability, and train product SRE and product family SRE leads within the subdivision. Core Responsibilities/ Qualifications Minimum of eight years related work experience, with at least three years of development experience. Undergraduate degree or equivalent combination of training and experience. Graduate degree preferred. Full stack development JDK8+ preferred with spring boot, Rest APIs, multithreaded, multiprocessing applications, Graphql. Experience with UI development (familiar with Angular, TypeScript, NodeJS etc.) is a plus. Ability to diagnose and resolve problems in high-throughput applications, Experience with one or more observability frameworks or tools Experience with OpenTelemetry (java, js, etc.), Cloudwatch, Grafana, Splunk, etc. Exposure to *nix environments including some shell script development and basic command execution. Strong understanding of database principles and working knowledge in distributed storage and infrastructural solutions. Experience with container management and micro-services architectures such as Docker in cloud and on-premises infrastructure. Working knowledge of AWS network foundations, application networking, edge, and network security. Excellent communication, and documentation skills. In this role you will: 1. Instrument, enhance and advocate for system observability. Identify and develop solutions to bridge systems observability gaps. 2. Collaborates with internal teams to evaluate the health, stability and reliability of systems/platforms. Looks for opportunity to improve system performance efficiency and resiliency. 3. Develops and communicates new standards and newly available tools and frameworks across subdivisions. Enforces reliability standards. Designs and develops new automated solutions for reliability. 4. Provides technical leadership, consultancy, and coaching on designing and implementing both traditional and serverless architectures in AWS with an emphasis on repeatability, scaling options, resilience, reliability, telemetry, networking, etc., including design patterns for resilient systems 5. Leads failure modes analysis spanning product families when new features and architecture patterns are introduced. Facilitates post-incident reviews for any high severity client impacting events local to the product family. 6. Leads cross-product or cross-subdivision chaos experimentation. 7. Designs, reviews, and coaches others on performance tests using appropriate components (e.g., requests per minute, # of threads, the construction of a request with headers and cookies) 8. Consults, reviews, coaches, and influences architectural decisions, including non-functional aspects, proposing potential technical solutions/enhancements, and explaining convincingly which is better and why. 9. Contributes to or leads Reliability Engineering and Resilience communities of practice. Remains informed about site reliability engineering activities happening within the subdivision. 10. Works with product owners to set subdivision goals for higher availability and SRE impact, and tracks progress toward achieving them. 11. Provides technical leadership, guidance, consulting, training, and governance on SRE to one or more product families in a subdivision. 12. Identifies opportunities to automate away toil and develops solutions, monitors error budget exhaustion rates, configures auto scaling thresholds for the product, and incorporates resilience patterns, such as circuit breakers, into the application code. Develops complex deployment and/or routing strategies for high availability. 13. Maintains and looks for opportunities to improve centralized incident response playbook for the subdivision to document standards for managing communication and escalation during an incident. 14. Oversees blameless post-incident reviews for high severity incidents involving more multiple product families. |
[email protected] View all |
Tue Oct 18 00:00:00 UTC 2022 |