Home

Grafana SRE Architect at Remote, Remote, USA
Email: ryan@nityainc.com
https://jobs.nvoids.com/job_details.jsp?id=2204813&uid=
From:

Nitya,

Nitya software solutions

ryan@nityainc.com

Reply to:   ryan@nityainc.com

Grafana SRE Architect

Location:

Basking Ridge, NJ (Onsite)

 C2C

Job Summary

The Grafana SRE Architect will lead the design, implementation, and management of scalable, reliable, and performant Grafana-based observability solutions. This role bridges Site Reliability Engineering (SRE) practices with Grafanas ecosystem (Loki, Mimir, Tempo, etc.) to ensure robust monitoring, logging, tracing, and alerting for mission-critical systems. You will collaborate with DevOps, engineering, and infrastructure teams to align technical strategies with business objectives, driving automation, resilience, and cost efficiency across cloud and on-premises environments.

Key Responsibilities

Architecture & Design

Design end-to-end Grafana solutions for metrics, logs, traces, and dashboards, ensuring scalability, security, and compliance.

Architect integrations with Prometheus, Loki, Mimir, Tempo, and third-party tools (e.g., AWS CloudWatch, Datadog).

Define best practices for Grafana deployment (self-managed vs. Grafana Cloud) and optimize data storage/retention strategies.

SRE Leadership

Implement SRE principles: SLAs/SLOs/SLIs, error budgets, and blameless post-mortems.

Build automated monitoring/alerting systems to preemptively identify system bottlenecks and failures.

Lead incident response, root cause analysis, and remediation for observability-related outages.

Collaboration & Integration

Partner with DevOps teams to embed Grafana into CI/CD pipelines and automate provisioning via IaC (Terraform, Ansible).

Work with developers to instrument applications for observability (OpenTelemetry, custom exporters).

Advise stakeholders on cost-effective monitoring strategies and resource optimization.

Performance Optimization

Tune Grafana dashboards, queries, and data sources for high-performance environments.

Optimize PromQL/Loki LogQL queries and manage large-scale time-series databases (Mimir).

Conduct capacity planning and disaster recovery testing for Grafana ecosystems.

Governance & Security

Ensure compliance with security policies (RBAC, SSO, encryption) and audit requirements.

Monitor Grafana stack health, perform upgrades, and enforce version control.

Mentorship & Innovation

Mentor SRE/engineering teams on Grafana best practices and SRE culture.

Stay ahead of Grafana/Observability trends and pilot new tools (e.g., AI-driven anomaly detection).

Education & Experience

Bachelors/Masters in Computer Science, Engineering, or related field.

10+ years in SRE/DevOps roles, with 5+ years hands-on Grafana experience.

Proven track record in designing large-scale observability solutions.

Managing offshore teams

Open to work overlapping hours with offshore teams

Technical Skills

Expertise in Grafana: Dashboards, plugins, alerting, and integrations (Prometheus, Loki, Mimir, Tempo).

Cloud Platforms: AWS/GCP/Azure, Kubernetes, and serverless architectures.

Automation: Terraform, Ansible, Python/Go scripting.

Monitoring Tools: Thanos, Cortex, Jaeger, OpenTelemetry.

Database Optimization: Time-series data (Mimir), log management (Loki).

Certifications (Preferred)

Grafana Certified: Observability Engineer/Administrator.

AWS/GCP/Azure Architect or DevOps certifications.

Soft Skills

Leadership in cross-functional teams and crisis management.

Strong communication for technical and non-technical audiences.

Analytical problem-solving and strategic thinking.

Preferred Qualifications

Contributions to Grafana/Prometheus open-source projects.

Experience with AI/ML model monitoring.

Knowledge of regulatory frameworks (GDPR, HIPAA). 

Keywords: continuous integration continuous deployment artificial intelligence machine learning golang New Jersey
Grafana SRE Architect
ryan@nityainc.com
https://jobs.nvoids.com/job_details.jsp?id=2204813&uid=
ryan@nityainc.com
View All
10:25 PM 25-Feb-25


To remove this job post send "job_kill 2204813" as subject from ryan@nityainc.com to usjobs@nvoids.com. Do not write anything extra in the subject line as this is a automatic system which will not work otherwise.


Your reply to ryan@nityainc.com -
To       

Subject   
Message -

Your email id:

Captcha Image:
Captcha Code:


Pages not loading, taking too much time to load, server timeout or unavailable, or any other issues please contact admin at me@nvoids.com


Time Taken: 9

Location: Basking Ridge, New Jersey