GreytHR
Reduction in Observability Support Tickets
Leveraging Aiden AI Agent via natural language queries
50%
Reduction in MTTD & MTTR
90%
Reduction in O11Y support tickets
65%
Reduction in Manual Effort for Incidents Remediation
highlights
Background
greytHR is a full-suite HRMS platform designed to automate and simplify complex, recurring, and critical HR and payroll functions, ensuring compliance and security. With over 50 tools, greytHR offers ‘Hire-to-Retire’ solutions for People Operations, including advanced modules for recruiting, onboarding, engaging, paying, appraising, retaining, and retiring employees.
The platform also leverages AI-driven analytics and recommendations to enhance employee engagement throughout the entire employee lifecycle. Trusted by CFOs and loved by CHROs, greytHR serves businesses of various sizes and is adaptable across industries like manufacturing, SaaS, healthcare, hospitality, education, and retail.
As India’s leading HRMS and payroll provider, greytHR is rapidly expanding in the MEA and SEA regions, offering world-class Made-in-India software solutions to emerging markets. The company proudly serves over 30,000 clients, managing 3 million+ employees across 25+ countries.
Techstack
Challenges
As greytHR scaled its services and customer base, observability became increasingly difficult to manage.
Across Clusters:
Each Kubernetes cluster had its own set of dashboards. Engineers had to switch between multiple dashboards to understand system health, making it difficult to get a unified view of the platform. This fragmentation significantly increased Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Metrics, Logs, and Traces:
Metrics, logs, and traces existed in silos. During incidents, teams manually stitched together data from different tools to identify root causes. The lack of correlation led to longer troubleshooting cycles and delayed incident resolution.
With Service Growth:
As the number of microservices increased, so did customer-reported issues and internal alerts. The operations team spent more time reacting to incidents instead of proactively improving system reliability.
With Open-Source Tools:
While open-source observability tools worked initially, maintaining and scaling them across multiple environments became operationally expensive. Upgrades, tuning, and managing integrations added ongoing overhead to the platform team.
Reporting:
Weekly incident and anomaly reports were created manually by aggregating data from multiple sources. This process was error-prone, time-consuming, and diverted engineering effort away from higher-value work.
StackGen Solution
To overcome these challenges, greytHR adopted StackGen Observability with Aiden, StackGen’s AI-powered observability assistant, to unify monitoring, accelerate incident response, and reduce operational overhead.
Across Clusters:
StackGen provided a centralized observability layer that consolidated metrics, logs, and traces across AWS and GCP environments. Engineers could now view platform health from a single pane of glass instead of managing cluster-specific dashboards.
and Traces:
By automatically correlating telemetry data, StackGen enabled faster root-cause analysis. Engineers could move seamlessly from a high-level metric anomaly to the exact logs and traces responsible for the issue, dramatically reducing investigation time.
for Self-Service Insights:
One of the biggest differentiators for greytHR was Aiden, StackGen’s AI-powered observability chatbot.
Before adopting Aiden, engineers had to rely on SREs to write complex LogQL, PromQL, and TraceQL queries to extract insights from observability data. This resulted in a large volume of support tickets raised just to answer questions such as:
"Why did latency spike for this service yesterday?"
"Which services were impacted during the last payroll run?"
"Show errors correlated with this deployment."
With Aiden, engineering teams can now ask these questions in natural language and instantly receive insights powered by correlated metrics, logs, and traces without needing to understand query languages.
This shift enabled true self-service observability, dramatically reducing dependency on the SRE team while empowering engineers to diagnose issues independently.
Observability:
By moving away from heavily customized open-source setups, greytHR reduced the operational burden of maintaining observability tooling. StackGen scaled effortlessly as new services and clusters were added, without increasing maintenance complexity.
Anomaly Reporting:
StackGen eliminated the need for manual weekly reports. Incident summaries, trends, and anomaly insights were generated automatically by Aiden, providing leadership and operations teams with consistent and reliable visibility into platform stability.
Results
and MTTR:
With unified dashboards and correlated telemetry, greytHR reduced detection times by 45-55% and resolution times by 55-65%. Engineers could identify issues faster and resolve them with greater confidence.
Operational Efficiency:
Automation and AI-assisted troubleshooting reduced manual incident response effort by 60-70%. The platform team reclaimed 15-20 engineering hours per week, enabling them to focus on reliability improvements and feature delivery.
and Leadership:
Automated reports and AI-generated summaries reduced manual reporting effort by 85-95%, providing clear insights into system health, incident trends, and recurring problem areas without manual data collection.
at Scale:
Future-Ready Observability at ScaleWith StackGen and Aiden, greytHR now has a scalable, intelligent observability foundation that grows with the platform, supporting increasing service complexity without added operational burden.