Gartner predicts, over 40% of agentic AI projects will be cancelled by the end of 2027, due to escalating costs, unclear business value or inadequate risk controls.
As per McKinsey, many organizations are ramping up their efforts to mitigate gen-AI-related risks. Respondents say their organizations are actively managing risks related to inaccuracy, cybersecurity, and intellectual property infringement - three of the gen-AI-related risks that respondents most commonly say have caused negative consequences for their organizations.
As a Product Management leader, one of my first priorities is to identify use cases that drive real business value. Once the team begins implementation, I advocate for an Evaluation-Driven Development (EDD) approach right from the start. Rigorous evaluation from the initial phases is crucial to building enterprise-grade, world-class AI applications that deliver meaningful results and earn user trust. This approach helps us to develop/iterate faster, reduce costs and risks.
1. The Stakes: Why AI Quality Evaluation Matters
Deploying an LLM-based application is not just a technical milestone—it’s a business-critical decision. Poorly evaluated AI can:
Damage Brand Reputation: Inaccurate, biased, or unsafe outputs can erode user trust.
Cause Regulatory & Legal Issues: Mishandling data or producing harmful content can lead to compliance violations.
Waste Resources: Shipping a subpar AI app leads to rework, increased support costs, and lost opportunities.
On the other hand, robust evaluation ensures your AI is an asset, not a liability.
2. Unique Challenges in Evaluating LLM Applications
LLMs are not like traditional software. Key challenges include:
Non-determinism: LLMs can generate different outputs for the same input.
Hallucination: LLMs can “hallucinate,” producing plausible but false or misleading content.
Open-endedness: There may be many valid responses, making evaluation less clear-cut.
Context Sensitivity: Outputs can vary depending on context, prompt wording, and prior conversation.
3. Understanding LLM-Powered Applications: RAG and Agentic AI
Of the various types of LLM-powered applications, two are very popular—and are personal favourites of mine:
a) Retrieval Augmented Generation (RAG) Applications
Enterprises are choosing Retrieval Augmented Generation (RAG) for 30-60% of their use cases. RAG comes into play whenever the use case demands high accuracy, transparency, and reliable outputs — particularly when the enterprise wants to use its own or custom data. RAG's abilities to reduce hallucinations, provide explainability and transparency, and ensure the security and privacy of enterprise data sets have made RAG emerge as a standard.
Now, a more advanced version—Agentic RAG—is gaining attention. Unlike traditional RAG, it uses intelligent agents to make decisions, learn from interactions, and provide more accurate answers.
b) Agentic AI Applications
Meanwhile, agentic applications—where AI agents autonomously perform tasks and interact with other systems—are gaining significant traction. According to Capgemini research, only 10% of organizations currently employ AI agents, a large majority (82%) intend to integrate them within 1–3 years. Enterprises are quickly adapting and weaving these AI agents into their workflows to drive efficiency and innovation. RAG could be one of the component of Agentic AI system.
Given the growing importance of both RAG and agentic applications in enterprise settings, this blog focuses on evaluation of NLP (Natural Language Processing) applications using RAG and Agentic AI
4. Defining quality in– RAG and Agentic AI applications
Metrics for RAG applications
a) Context Relevance - The initial stage in RAG application is retrieval. To ensure effective retrieval, it’s important to verify that every context chunk is relevant to the input query. A good context relevance score means the RAG system is starting with relevant information, which is a prerequisite for generating high-quality, accurate, and helpful responses.
b) Faithfulness - Faithfulness means that answer must be based on the provided context. This is essential to prevent hallucinations and to make sure that the retrieved context can support or justify the generated response. RAG systems are frequently used in scenarios where it is critical for the generated text to remain factually consistent with the original sources, such as in legal domain, medical domain.
c) Answer Relevance - Generated answer should address the actual question that was provided. A low answer relevance score can indicate problems with the RAG system's generation component, such as generating irrelevant information or failing to address the core of the question.
Metrics for Agentic AI applications
a) Agent Planning
Evaluate your agent’s plan validity and task feasibility. Agent planning is a crucial aspect and a key metric for evaluating agentic AI systems. Agentic AI, which emphasizes autonomy and goal-directed behaviour, relies heavily on the ability of an AI agent to plan and execute complex tasks. Therefore, evaluating how well an agent plans is essential for assessing its overall effectiveness and capabilities.
b) Tool Call Accuracy
Measures how accurately the AI system calls tools and handles their responses. Is the agent using tools correctly?
c) Intent Resolution
Evaluates the system’s ability to understand and resolve user intentions. Did the agent understand the user’s Goal?
d) Task Adherence
Checks how closely the system follows the defined tasks and expected outcomes.
e) Final Response Evaluation
Evaluate the final output of an agent (whether or not the agent achieved its goal)
By focusing on these metrics, we can pave the way for safer, more reliable, and more effective AI systems.
5. Evaluation Strategies: Making Your AI Assessments Scalable and Meaningful
When evaluating LLM-powered applications, it’s helpful to consider your strategies along two key dimensions: Scalability and Meaningfulness. Systematic evaluation helps to track and improve crucial factors such as relevance and task accuracy.
a) Golden Datasets
Evaluating LLM application using golden datasets (high-quality, annotated data) ensures that application performs optimally in well-understood scenarios. At earliest stages of development, start with ground truth evaluations. These are typically carried out by domain experts and/or developers who are deeply familiar with the core use cases your app is designed to address. While this approach provides rich, in-depth insights into performance, it doesn’t scale well as your application grows.
b) “LLM as a Judge” Evaluations
LLMs themselves can be powerful tools for evaluating AI applications. By prompting an LLM to review and score outputs, you can receive feedback that frequently aligns closely with human judgment. This method, often referred to as “LLM as a judge,” allows one LLM to evaluate another’s outputs and provide clear explanations—making it a scalable solution, especially when human-labeled data is scarce or expensive.
c) Small language model Evaluations
Smaller models are cost-effective to run at scale and can provide nuanced, domain-specific feedback—especially when fine-tuned for your particular application.
d) Code-Based Evaluations
Another cost-effective option is code-based evaluation. Here, you use automated scripts or tests to assess your LLM’s performance on key criteria: output format, inclusion of required data, and passing structured, automated tests. Since this approach doesn’t require additional token usage or introduce latency, it’s a practical choice for ongoing quality assurance.
e) User Feedback Evaluations
Once you’ve established a baseline of quality through expert review, gathering feedback from real users becomes essential. This often takes the form of simple, binary feedback (like thumbs up/down) on outputs. User feedback offers greater scalability than ground truth evaluations, but sometimes it can be inconsistent and can be costly to collect at scale.
By combining these evaluation strategies, you can confidently measure and enhance the quality of your LLM-powered applications—moving beyond simple output checks to robust, scalable, and meaningful assessments at every stage of development.
Conclusion
The leap from promising prototype to production-ready LLM application is not trivial. Unlike traditional software, AI agents bring new risks, new failure modes, and new expectations. Rigorous evaluation—across correctness, safety, robustness, and usability—is the only way to earn user trust and deliver business value.
The future belongs to those who can harness the power of LLMs responsibly. Make your AI an asset, not a liability—evaluate thoroughly, and ship with confidence.