Tools/AI Observability & Evaluation/DeepEval

DeepEval

Python framework for unit testing and evaluating LLM applications with metrics like G-Eval.

Open SourceSelf HostedOffline Capable

0.0 (0)

Visit Website View on GitHub Documentation

About

DeepEval brings a Pytest-style workflow to testing LLM applications. The open-source Python framework from Confident AI provides more than 30 evaluation metrics, including G-Eval, DAG, answer relevancy, faithfulness, contextual precision, hallucination, bias, and agent-focused checks such as task completion and tool correctness, with evaluation running locally through LLM-as-a-judge and NLP models. It covers end-to-end and component-level testing, multi-turn conversation metrics, multimodal evaluation, synthetic test dataset generation, and tracing for full observability of a pipeline. Integrations span OpenAI, Anthropic, LangChain, LangGraph, LlamaIndex, CrewAI, and Pydantic AI, so it slots into most RAG pipelines, chatbots, and agent stacks. The framework requires Python 3.9 or later and is released under the Apache 2.0 license, with an optional Confident AI cloud platform for dataset management and reporting. AI engineers use it to catch regressions before shipping prompt or model changes.