| urgent : MONITORING and ALERTING ENGINEER ,Skype + F2F at Fort Worth, Texas, USA |
| Email: [email protected] |
|
http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2085783&uid= From: Ramashankar, vyzeinc [email protected] Reply to: [email protected] Job Description - Recruiters Note:- Reference #755484 Intake call notes- Need Dynatrace and Cloudwatch with AWS specifically- most critical skills needed Need a monitoring/alerting engineer Does not need a Dynatrace user, but rather needs someone who can work with the application team (more of an engineering role) Identifying threshold issues, memory issues, user issues,etc Change management and problem management experience are necessary This is a forward-facing role Application performance management engineer would be ideal Looking for someone with minimum 4-5 years experience would be perfect (but open to less experience for the right candidate) After hours/weekend work schedule may require Sat or Sun periodically but for the most part is M-F 7:45/8 to 4:45/5 or so DDU, DEM license familiarity required Needs to know what session replay is Zoom virtual interview will be required, then an onsite in person will be required as well Title: MONITORING and ALERTING ENGINEER Location: Fort Worth, Tx Hybrid (4 days onsite) Duration: 6+months MOI: 1 zoom, final onsite or candidates only Reason for opening - backfill 4 days/wk onsite required Some weekend availability may be required from time to time. 7 day/24 hour Operation Environment After hours support flexibility is REQUIRED. MONITORING and ALERTING ENGINEER Manager wants Dynatrace - not just a user Cloudwatch DataDog or Splunk ITIL - change management or Incident management experience is desirable A Monitoring and Alerting Engineer is a specialized IT professional responsible for the design, implementation, and management of monitoring and alerting systems for an organization's IT infrastructure. Their primary goal is to ensure the continuous availability, reliability, and performance of critical systems and applications. By leveraging various monitoring tools and technologies, they proactively identify and address potential issues before they impact business operations. Key Responsibilities - System Monitoring: Implement and maintain monitoring solutions to track the performance, health, and availability of IT systems, applications, and networks. - Alert Management: Configure and manage alert mechanisms to ensure timely notifications of any anomalies, failures, or performance degradations. - Incident Response: Collaborate with support and operations teams to analyze, resolve, and lead event resolution processes during incidents and outages. - Root Cause Analysis: Conduct thorough investigations to determine the root cause of incidents and implement corrective actions to prevent recurrence. - Optimization: Identify opportunities for system optimization and performance improvements through data analysis and trend identification. - Tool Evaluation and Integration: Evaluate, recommend, and integrate new monitoring and alerting tools and technologies to enhance the organization's monitoring capabilities. - Documentation and Reporting: Develop and maintain comprehensive documentation, including monitoring configurations, incident reports, and performance metrics. - Collaboration and Communication: Work closely with various IT teams, including application, infrastructure, and DevOps teams, to ensure seamless operations and effective communication during incidents. Skills and Qualifications - Proficiency in monitoring and alerting tools (e.g., Dynatrace, Datadog, CloudWatch, Splunk). - Strong understanding of IT infrastructure, including servers, networks, databases, and cloud environments. - Some Experience with incident, problem, and change management processes a plus - Ability to analyze complex systems and identify performance bottlenecks. - Excellent troubleshooting and problem-solving skills. - Effective communication and collaboration skills. - Familiarity with ITIL best practices and service management frameworks. Performance of Duties - Operate in a 7-day/24-hour environment with after-hours support flexibility. - Collaborate with internal teams and suppliers to resolve and lead event resolution across all mission-critical IT and Telecom service levels. - Protect business system availability through integrated incident, problem, and change management. - Monitor systems for faults and optimization opportunities. - Assist the major incident response team and escalate critical events. - Evaluate and improve monitoring/alerting tools and processes. - Conduct technical root cause analysis and engage with management teams for internal issues. - Identify potential business-impacting events and manage incident processes. - Provide expert guidance during reviews and debriefs. - Analyze problem trends and monitor tools to identify chronic activity. - Communicate effectively with senior management. Qualifications - Experience with Dynatrace, AppMon, Zabbix, SCOM, Datadog, CloudWatch, X-Ray, and Splunk. - Self-motivated and able to work in a 7x24 environment. - Experience managing critical system outages and interacting at all organizational levels. - On-call support availability. Preferred Qualifications - B.S. degree in Computer Science, Information Systems, or Engineering. - Technical expertise in distributed systems/administration and general scripting/programming (Python, Node.js, Ruby, Perl, Bash/sh). - Excellent writing and communication skills. - ServiceNow experience. Keywords: javascript information technology card Texas urgent : MONITORING and ALERTING ENGINEER ,Skype + F2F [email protected] http://bit.ly/4ey8w48 https://jobs.nvoids.com/job_details.jsp?id=2085783&uid= |
| [email protected] View All |
| 08:09 PM 16-Jan-25 |