LLM Evaluation: Model-Based, Labeling & User Feedback

Evaluation is a critical aspect of developing and deploying LLM applications. Usually, teams use a multitude of different evaluation methods to score the performance of their AI application depending on the use case and the stage of the development process.

Why are LLM Evals Important?

LLM evaluation is crucial for improving the accuracy and robustness of language models, ultimately enhancing the user experience and trust in your AI application. It helps detect hallucinations and measure performance across diverse tasks. A structured evaluation in production is vital for continuously improving your application.

Plot evaluation results in the Langfuse Dashboard.

Add custom evaluation results, supports numeric, boolean and categorical values.

POST /api/public/scores

Add scores via Python or JS SDK.

Example (Python)

langfuse.score(
  trace_id="123",
  name="my_custom_evaluator",
  value=0.5,
)

Langfuse provides a flexible scoring system to capture all your evaluations in one place and make them actionable.

Evaluation Methods

1. Model-based Evaluation (LLM-as-a-Judge)

Model-based evaluations (LLM-as-a-judge) are a powerful tool to automatically assess LLM applications integrated with Langfuse. With this approach, an LLMs scores a particular session, trace, or LLM call in Langfuse based on factors such as accuracy, toxicity, or hallucinations.

There are two ways to run model-based evaluations in Langfuse:

2. Manual Annotation / Data Labeling (in UI)

With manual annotations, you can annotate a subset of traces and observations by hand. This allows you to collaborate with your team and add scores via the Langfuse UI. Annotations can be used to establish a baseline for your evaluation metrics and to compare them with automated evaluations.

3. User Feedback

Capturing user feedback in your AI application can be a valuable evaluation metric. You can add explicit (e.g., thumbs up/down, 1-5 star rating) or implicit (e.g., time spent on a page, click-through rate, accepting/rejecting a model-generated output, human-in-the-loop) user feedback to your LLM traces in Langfuse.

4. Custom Evaluation via SDKs/API

Langfuse gives you full flexibility to ingest custom scores via the Langfuse SDKs or API. The scoring workflow allows you to run custom quality checks (e.g. valid structured output format) on the output of your workflows at runtime, or to run custom external evaluation workflows.

Getting Started

Learn how to configure and utilize scores in Langfuse to assess quality, accuracy, style, and security metrics in your LLM applications.

Getting Started with Langfuse Evaluation

GitHub Discussions

Model Usage & Cost Getting Started

Was this page useful?

Questions? We're here to help

GitHub Q&AEmail Talk to sales