Langfuse
Langfuse is an open-source LLM engineering platform (GitHub) that helps teams collaboratively debug, analyze, and iterate on their LLM applications. All platform features are natively integrated to accelerate the development workflow.
Why Langfuse?
- Most used open-source LLMOps platform (blog post)
- Model and framework agnostic
- Built for production
- Incrementally adoptable, start with one feature and expand to the full platform over time
- API-first, all features are available via API for custom integrations
- Optionally, Langfuse can be easily self-hosted
Learn more about why teams choose Langfuse.
Challenges of building LLM applications
In implementing popular LLM use cases – such as retrieval augmented generation, agents using internal tools & APIs, or background extraction/classification jobs – developers face a unique set of challenges that is different from traditional software engineering:
Tracing & Control Flow: Many valuable LLM apps rely on complex, repeated, chained or agentic calls to a foundation model. This makes debugging these applications hard as it is difficult to pinpoint the root cause of an issue in an extended control flow.
With Langfuse, it is simple to capture the full context of an LLM application. Our client SDKs and integrations are model and framework agnostic and able to capture the full context of an execution. Users commonly track LLM inference, embedding retrieval, API usage and any other interaction with internal systems that helps pinpoint problems. Users of frameworks such as LangChain benefit from automated instrumentation, otherwise the SDKs offer an ergonomic way to define the steps to be tracked by Langfuse.
Output quality: In traditional software engineering, developers are used to testing for the absence of exceptions and compliance with test cases. LLM-based applications are non-deterministic and there rarely is a hard-and-fast standard to assess quality. Understanding the quality of an application, especially at scale, and what ‘good’ evaluation looks like is a main challenge. This problem is accelerated by changes to hosted models that are outside of the user’s control.
With Langfuse, users can attach scores to production traces (or even sub-steps of them) to move closer to measuring quality. Depending on the use case, these can be based on model-based evaluations, user feedback, manual labeling or other e.g. implicit data signals. These metrics can then be used to monitor quality over time, by specific users, and versions/releases of the application when wanting to understand the impact of changes deployed to production.
Mixed intent: Many LLM apps do not tightly constrain user input. Conversational and agentic applications often contend with wildly varying inputs and user intent. This poses a challenge: teams build and test their app with their own mental model but real world users often have different goals and lead to many surprising and unexpected results.
With Langfuse, users can classify inputs as part of their application and ingest this additional context to later analyze their users behavior in-depth.
Langfuse features along the development lifecycle
Langfuse is a set of tools that helps accelerate the development workflow for LLM applications. You can use the features individually or collectively to build, test, and iterate on your applications.
Simplified lifecycle from PoC to production:
Find an overview of all Langfuse features below. For details, refer to the individual documentation pages.
LLM Application Observability
Instrument your app and start ingesting traces to Langfuse, thereby tracking LLM calls and other relevant logic in your app such as retrieval, embedding, or agent actions. Inspect and debug complex logs and user sessions.
Learn more about tracing in Langfuse or play with the interactive demo.
Key benefits
- Full context: Capture the complete execution flow including API calls, context, prompts, parallelism and more
- Conversation/session view: In multi-turn conversations, group interactions into sessions
- User tracking: Add your own identifiers to inspect traces from specific users
- Cost tracking: Monitor model usage and costs across your application
- Quality insights: Collect user feedback and identify low-quality outputs
- Low overhead: Designed for production with minimal performance impact
- Best-in-class SDKs: We offer best-in-class SDKs for Python, JS/TS for easy integration
- Framework support: Integrated with popular frameworks like OpenAI SDK, LangChain, and LlamaIndex
- Multi-modal: Support for tracing text, images and other modalities
- Open source: Fully open source with public API for custom integrations
Traces allow to track every LLM call and other relevant logic in your app.
Prompt Management
Langfuse Prompt Management helps you centrally manage, version control, and collaboratively iterate on your prompts.
Key benefits
- Decoupled from code: Deploy new prompts without application redeployment
- Version control: Track changes and quickly rollback when needed
- Performance optimized: Client-side caching prevents latency or availability issues
- Multi-format support: Works with both text and chat prompts
- Flexible access: Edit via UI, SDKs, or API
- Non-technical friendly: Business users can update prompts via Console
Collaboratively version and edit prompts via UI, API, or SDKs.
Evaluations
Evaluations are the most important part of the LLM Application development workflow. Langfuse adapts to your needs and supports:
- LLM-as-a-judge: Fully managed evaluators run on production or development traces within Langfuse
- User feedback: Collect feedback from your users and add it to traces in Langfuse
- Manual labeling: Annotate traces with human feedback in managed workflows
- Custom: Build your own evaluation pipelines via Langfuse APIs/SDKs for full flexibility
Plot evaluation results in the Langfuse Dashboard.
Playground
The LLM Playground is a tool for testing and iterating on your prompts and model configurations, shortening the feedback loop and accelerating development.
Key features
- Integrated: Jump to the playground from prompt management and observability
- Supports variables: Use variables in your prompts to dynamically change the input
- Broad model support: OpenAI, Anthropic, Azure OpenAI, and Amazon Bedrock
- Custom model configurations: You can configure custom model endpoints and credentials
Datasets
Via Langfuse Datasets you can create test sets and benchmarks to evaluate the performance of your LLM application.
- Continuous improvement: Create datasets from production edge cases to improve your application
- Pre-deployment testing: Benchmark new releases before deploying to production
- Structured testing: Run experiments on collections of inputs and expected outputs
- Flexible evaluation: Add custom evaluation metrics or use llm-as-a-judge
- Integration ready: Works with popular frameworks like LangChain and LlamaIndex
Get started
Updates
Langfuse evolves quickly, check out the changelog for the latest updates. Subscribe to the mailing list to get notified about new major features:
Get in touch
We actively develop Langfuse in open source together with our community:
- Contribute and vote on the Langfuse roadmap.
- Ask questions on GitHub Discussions or private support channels.
- Report bugs via GitHub Issues.
- Chat with the Langfuse maintainers and community on Discord.