Eval Observability

Tool
Category
Segment
Platform / Tool
Plan / License
Monthly Price USD
Pricing Model
Free Tier / OSS
Included Usage / Limits
Evaluation Capabilities
Observability / Tracing
Prompt / Dataset / Experiment Features
Integrations / Frameworks
Deployment / Hosting
Security / Privacy
Team / Governance
Best Fit
Main Limits / Caveats
No tagline
Eval ObservabilityLLM observability and eval platformLangSmithDeveloper$0/seat/mo then pay-as-you-goPer-seat plus trace/Fleet usage1 free seat; up to 5k base traces/month included; 50 Fleet runs/month; community supportOnline/offline evals, annotation queues, prompt improvement and monitoringTracing for agent execution, run trees, monitoring and alertingPrompt Hub, Playground, Canvas, datasets and human feedback workflowsLangChain, LangGraph and custom SDK/API integrationsLangSmith Cloud; self/hybrid hosting on EnterpriseData retention and support are limited on Developer; enterprise has custom hosting/security1 seat only; community supportSolo developers debugging/evaluating LangChain or LangGraph apps5k traces/month and one-seat limit make it a prototype tier
No tagline
Eval ObservabilityLLM observability and eval platformLangSmithPlus / Enterprise$39/seat/mo Plus; Enterprise customPer-seat plus pay-as-you-go trace/Fleet usageDeveloper free plan existsPlus includes 10k base traces/month, one dev-sized agent deployment, 500 Fleet runs/month, unlimited seats and up to 3 workspaces; Enterprise customFull online/offline evals, annotation workflows, prompt improvement and monitoringProduction tracing, monitoring/alerting and agent deployment visibilityPrompt Hub, Playground, datasets, annotation queues and Fleet workflowsLangChain, LangGraph, SDK/API and enterprise deploymentsCloud plus Enterprise hybrid/self-hosting optionsEnterprise SSO/RBAC, support SLA and alternative hostingUnlimited seats on Plus; Enterprise custom workspaces/seats/securityTeams operating LangChain/LangGraph apps in productionTrace overage and retention choices can drive cost; advanced hosting requires Enterprise
No tagline
Eval ObservabilityOpen-source LLM observability platformLangfuseHobby Cloud / OSS$0Freemium units or free self-hosted OSSCloud Hobby includes 50k units/month, 30 days data access, 2 users and all platform features with limits; self-hosted full product is open sourceOnline/offline evaluation, datasets, experiments, scores, LLM-as-judge evaluators and human annotationLLM/agent tracing, sessions, token/cost tracking, OpenTelemetry and proxy loggingPrompt versioning/fetching/release management, playground and prompt experimentsPython/JS SDKs, OpenTelemetry Java/Go/custom, LiteLLM proxy and framework integrationsLangfuse Cloud or self-hosted Docker/Kubernetes/cloud deploymentData regions US/EU/JP; Hobby has limited users/support and 30-day data access2 users on Hobby; GitHub community supportIndie projects and teams wanting open-source LangSmith alternativeUnit-based pricing counts observations/scores too, not only top-level traces
No tagline
Eval ObservabilityOpen-source LLM observability platformLangfuseCore / Pro / Enterprise$29/mo Core; $199/mo Pro; $2,499/mo EnterpriseMonthly subscription plus unit overageHobby free plan existsCore/Pro include 100k units/month and $8/100k additional; Core has 90 days data access; Pro has 3 years; Enterprise includes Pro+Teams and enterprise controlsEvaluation datasets, experiments, scores, LLM-as-judge, human annotation queues and external evaluation pipelinesHigh-volume tracing, token/cost tracking, multimodal beta, proxy/OpenTelemetry loggingPrompt management, prompt experiments, release labels, playground and webhooks/SlackSDKs, OpenTelemetry, LiteLLM, framework integrations and public APICloud or self-host; Enterprise custom volume/marketplace/invoice optionsPro has SOC2/ISO27001 reports and BAA available; Enterprise adds audit logs, SCIM, SLA and dedicated supportUnlimited users from Core; Teams add-on adds SSO/RBAC/support on ProTeams needing open-source-friendly observability with predictable cloud tiersCore/Pro both start with 100k units; Teams add-on and overages can materially change price
No tagline
Eval ObservabilityAI eval and observability platformBraintrustStarter$0 platform feeUsage-based free tier plus overage1 GB processed data/month then $4/GB; 10k scores/month then $2.50/1k; 14 days retention; $10/month Topics creditExperiments, scorers, online scoring, eval datasets, human review scorers and sandbox evals by planProduction logs, tracing, dashboards, topics and monitoringPrompt playgrounds, datasets, experiments, exports and environment workflowsSDK/API, custom functions, AI provider gateway and app integrationsBraintrust Cloud; self-hosted customers can adjust some system limitsStarter has Google-only SSO, owner-only permission group and no SOC2/DPA/BAAUnlimited users/projects/datasets in current Starter model, but limited advanced governanceSmall teams starting evals without per-seat feesProcessed-data and score overages can appear once usage exceeds free allocation
No tagline
Eval ObservabilityAI eval and observability platformBraintrustPro / Enterprise$249/mo Pro; Enterprise customMonthly platform plus usage overageStarter free tier existsPro includes 5 GB processed data/month then $3/GB, 50k scores/month then $1.50/1k, 30 days retention and launch Topics credit; Enterprise customAdvanced evals, custom charts, environments, dataset snapshots, playground annotations and sandbox evalsProduction observability, topics, dashboards, logs and monitoringDatasets, experiments, prompts, functions, environments and exportsSDK/API, gateway, provider integrations and custom functionsCloud; enterprise/self-host options by contractEnterprise adds SAML/OIDC SSO, custom permission groups, retention, exports, SOC2, BAA and custom legal termsPro has Owner/engineer/viewer permission groups; Enterprise customGrowing production teams needing eval/observability plus gateway workflowsRetention is 30 days on Pro; custom retention/SAML/BAA require Enterprise
No tagline
Eval ObservabilityHosted observability and eval platformArize AX / Phoenix CloudAX Free / AX Pro$0 Free; $50/mo ProHosted SaaS with span/GB quotas and overagesAX Free: 25k spans/month, 1 GB/month, 15 days retention; AX Pro: 50k spans/month, 10 GB/month, 30 days retention, higher limits and email supportOnline/offline evaluations, datasets, experiments, LLM-as-judge/code evals, session/agent path evals and labeling queuesHosted tracing, product observability, custom metrics, monitors and Alyx agent assistancePrompt management, prompt serving, prompt environment tags, replay and optimizationSDKs, OpenTelemetry and framework integrationsHosted SaaS; Enterprise SaaS or self-hostedAX Free/Pro regions US/EU/CA; Enterprise adds SOC2/HIPAA, SLA, dedicated support and self-host add-on1 organization on AX Free/Pro; Enterprise customTeams wanting hosted Phoenix with simple span/GB pricingAX Pro span/GB overage is separate; Enterprise required for advanced governance
No tagline
Eval ObservabilityLLM request observabilityHeliconeHobby$0Free request/storage quota10,000 requests/month, 1 GB storage, 1 seat and 1 organizationPrompt/request analysis and regression-style evaluation workflows depending feature useRequest logging, usage/cost tracking, metrics, caching, alerts/reporting on paid tiersPrompts, experiments and query language stronger on Pro+OpenAI-compatible proxy style, provider integrations and app SDK/API workflowsHosted Helicone; self-host/open-source options should be checked in docs/repoFree plan has one seat/org and limited storage1 seat/1 org on HobbyIndie apps tracking LLM request costs and latency quicklyFree quota is request-limited and storage-limited; team features start paid
No tagline
Eval ObservabilityLLM request observabilityHeliconePro / Team$79/mo Pro; $799/mo TeamMonthly plan plus usage-based pricingHobby free plan existsPro and Team include 10k free requests plus usage-based pricing; Pro has unlimited seats, alerts/reports and HQL; Team adds 5 organizations and scaling-company featuresEvaluation and prompt iteration workflows through request logs, reports and query languageLLM request tracing, usage, cost, latency, alerts, reports and storageHQL query language, prompt/request analysis and team reportingProvider proxy/integration workflows for LLM appsHosted HeliconeEnterprise adds unlimited orgs and custom terms; Team/Enterprise for broader governanceUnlimited seats on Pro/Team; org count rises from 1 to 5 on TeamTeams needing request-level observability with simple fixed starting priceUsage-based charges apply after included quota; Team price jumps sharply
No tagline
Eval ObservabilityAI app and model tracking platformWeights & Biases / WeaveFree$0/moFree cloud plan for personal/small projectsFree plan includes AI application evaluations/tracing/scorers, experiment tracking, registry/lineage, CI/CD automations, Slack/email alerts, 5 GB storage and 1 GB/month Weave ingestionAI application evaluations and scorersWeave tracing for GenAI applications plus W&B experiment trackingDatasets/registry/lineage and CI/CD automationsW&B SDK, Weave integrations, model tracking and app tracing workflowsCloud-hosted Free or local personal server; corporate use rules differ for personal self-hostFree lacks enterprise security; academic research gets separate free programFree is for personal development/small projectsDevelopers combining ML experiment tracking with LLM app tracingCorporate/professional team use generally moves to Pro/Enterprise; ingestion/storage limits apply
No tagline
Eval ObservabilityAI app and model tracking platformWeights & Biases / WeavePro / EnterpriseStarts at $60/mo Pro; Enterprise customMonthly plan plus storage/Weave ingestion/inference usageFree plan existsPro starts at $60/month, includes up to 10 model seats, 100 GB storage and 1.5 GB/month Weave ingestion; additional storage $0.03/GB and Weave ingestion $0.10/MBAI app evaluations/scorers plus ML experiment and model evaluation workflowsWeave production/development tracing and W&B experiment observabilityRegistry, lineage, automations, datasets and team collaborationW&B SDK, Weave, model/inference integrationsCloud or enterprise/private-hosted deployment optionsEnterprise adds SSO, SCIM, audit logs, HIPAA option, customer-managed encryption and enterprise supportPro for teams under stated guidelines; Enterprise for compliance/securityTeams already using W&B for ML who need GenAI app tracingWeave ingestion overage can be expensive at high trace payload volume
No tagline
Eval ObservabilityOpen-source GenAI observability/eval platformComet OpikOSS / Free Cloud$0Open source or free hosted cloudOSS full feature set; Free Cloud up to 10 team members, 25k spans/month and 60-day retentionTest suites, assertions, agent testing, evaluations and prompt/trace analysisAgent tracing, execution graphs, sessions, token/cost tracking and multimedia loggingAgent Playground, prompts/configuration, datasets/experiments and commentsPython/TypeScript SDKs, public API and MCP serverSelf-host OSS or Comet-hosted cloudFree Cloud has usage limits; OSS data stays self-hostedFree Cloud up to 10 team membersTeams wanting a very generous free LLM observability/eval stackCloud span quota is lower than some competitors; self-hosting requires ops
No tagline
Eval ObservabilityOpen-source GenAI observability/eval platformComet OpikPro Cloud / Enterprise$19/mo Pro; Enterprise customMonthly cloud plan or custom enterpriseOSS and Free Cloud existPro Cloud includes up to 50 team members, 100k spans/month and 60-day retention; Enterprise custom usage and unlimited team membersTest suites/assertions, agent testing, playground and evaluation workflowsTracing, execution graphs, sessions, token/cost tracking and error surfacingPrompt/config management, datasets, experiments, annotations and exportSDKs, public API, MCP server and Comet ecosystemHosted cloud, self-host OSS or enterprise flexible deploymentsEnterprise SSO, dedicated support/SLA and compliance reportsUp to 50 team members on Pro; Enterprise unlimited/customSmall teams wanting low-cost hosted eval/observabilityPro retention is still 60 days; advanced compliance/deployment requires Enterprise
No tagline
Eval ObservabilityLLM security and eval CLIPromptfooCommunity$0Open-source local/self-hosted toolAll LLM evaluation features, all model providers/integrations, red teaming up to 10k probes/month, custom app integration and vulnerability scanningPrompt/model/RAG evaluations, red teaming, factuality, hallucination and vulnerability testingLocal reports and scans rather than hosted trace observability by defaultYAML/config-driven test cases, assertions, model comparison and CI integrationAll model providers, custom integrations, CI/CD, app targets and security pluginsRun locally or self-host on own infrastructureData stays local/self-hosted in Community; community supportIndividual/small team use; no hosted team collaborationDevelopers adding eval and red-team tests to CI without SaaS10k free red-team probes/month; team dashboards/API/cloud require Enterprise
No tagline
Eval ObservabilityLLM security and eval platformPromptfooEnterprise / On-PremiseCustomCustom enterprise subscription/deploymentCommunity free plan existsCustom red-team limits, team sharing, continuous monitoring, security/compliance dashboard, SSO, API access, managed cloud or on-prem deploymentAdvanced LLM security testing, monitoring, red-teaming and evaluations at org scaleContinuous monitoring and centralized dashboardsSaved targets, attack profiles, API access and organization-specific configsCI/CD, model providers, app integrations, Promptfoo API and managed/on-prem infrastructureManaged cloud deployment or on-premise deployment with complete data isolationSSO, granular permissions, compliance dashboard, support and SLA guaranteesTeams-based access controls and custom rolesOrganizations needing formal AI security testing and red-team monitoringPricing is custom; advanced cloud/on-prem features unavailable in Community
No tagline
Eval ObservabilityOpen-source LLM unit testingDeepEvalOpen source framework$0 softwareOpen-source local framework; provider API costs separateRuns local evals/CI; most metrics are LLM-as-judge and default to OpenAI unless configured; can use Anthropic, Gemini, Ollama, Azure OpenAI or custom LLMLLM unit tests, RAG/agent/multi-turn/safety/MCP metrics, synthetic data and benchmarksLocal testing reports; can integrate with Confident AI for hosted observabilityPytest-like CLI, evaluate(), metrics, datasets and CI/CD workflowsPython framework with provider/model integrations and Confident AI integrationLocal/open-source; optional Confident AI cloudBasic non-identifying telemetry by default can be opted out; cloud data stored in private AWS per FAQOSS used by developers/CI; no team governance locallyEngineering teams adding test assertions to LLM appsJudge model calls can cost money; dependency/runtime compatibility matters in CI
No tagline
Eval ObservabilityAI quality platformConfident AIFree / Starter / Premium$0 Free; from $19.99/user/mo Starter; from $49.99/user/mo PremiumPer-user plus project and GB-month/eval-run overageFree: 2 users, 1 project, 5 test runs/week, 1 GB-month trace spans and 1 week retention; Starter/Premium add paid users/projects, online eval metric runs and retention controlsLLM eval benchmark/testing reports, unit/regression tests, online evals and custom metricsLLM tracing, monitoring, alerts and trace span storage by GB-monthPrompt versioning, cloud dataset annotation, no-code workflows and pre-commit evals on PremiumDeepEval, DeepTeam, OpenTelemetry, TypeScript SDK and APIsHosted Confident AI; Enterprise dedicated on-prem availableSOC2/HIPAA/GDPR listed; data stored in private AWS per docsFree limited to 2 seats/1 project; paid per user/project; Team customTeams wanting hosted DeepEval-style eval workflowsSelf-serve plan math includes user/project/GB/eval overages; Free test runs are capped
No tagline
Eval ObservabilityRAG and LLM evaluation frameworkRagasOpen source$0 softwareOpen-source Python framework; optional services/consulting separateLibrary for systematic evaluation loops, metrics, experiments, datasets and testset generation; no hosted quota on docs pageRAG metrics such as context precision/recall, faithfulness, response relevancy plus agent/tool, SQL and general-purpose metricsIntegrates with observability tools including Arize and LangSmith; not a full tracing SaaS by itselfExperiments, evaluation datasets, metrics, prompt evaluation and test data generationLangChain, LlamaIndex, Haystack, LangGraph, Gemini, Bedrock, Vertex AI and other integrationsLocal Python library; can plug into external observability toolsData handling depends on your runner/model providers; open-source code visibleNo built-in team governance unless integrated with another platformTeams evaluating RAG quality with standardized metricsLLM-as-judge/testset generation can incur model costs; no hosted collaboration tier captured
No tagline
Eval ObservabilityOpen-source eval registry/frameworkOpenAI EvalsOpen source$0 softwareOpen-source framework; model/API costs separateFramework for evaluating LLMs and LLM systems plus open-source benchmark registryBenchmark and custom eval workflows for LLM systemsNot a tracing/production observability platformEval registry, custom eval definitions and scriptsOpenAI API and Python-based eval workflows; can be adapted to other model callsLocal/open-source repoData sent to configured model/API providers; repo license governs sourceNo team governance; repo/CI handles collaborationDevelopers creating repeatable model/system benchmarksOlder/evolving repo; may require adaptation for modern agent app evals
No tagline
Eval ObservabilityOpen-source model benchmark harnesslm-evaluation-harnessOpen source$0 softwareOpen-source benchmark harnessFew-shot evaluation harness for language models with many tasks/backends; run costs depend on model backend/APIStandardized model benchmark evaluation and task suitesNo production tracing; focused on offline benchmark runsTask configs, metrics, model adapters and result reportingLocal/HF/vLLM/API-style backends depending harness supportLocal/open-sourceSelf-managed data and model accessNo SaaS governanceResearchers benchmarking base/instruct modelsBest for model benchmarks, not app-level RAG/agent observability
No tagline
Eval ObservabilityOpen-source LLM benchmark platformOpenCompassApache-2.0 / open source$0 softwareOpen-source platform; API/model costs separateSupports many models and over 100 datasets; can evaluate open-source and API models with CLI or Python scriptsGeneral/scientific/reasoning benchmarks, LLM judge, math evaluation and long-context benchmarksNo production app tracing; offline/leaderboard-oriented evaluationDataset configs, model configs, summarizers and benchmark result workflowsHuggingFace, vLLM, LMDeploy, OpenAI/API, ModelScope and other backendsLocal/open-source; leaderboard/community infra separateSelf-managed API keys/data; Apache-2.0 repoCommunity/open-source governanceModel evaluation teams needing broad benchmark coverageSetup/dataset prep can be heavy; not app observability
No tagline
Eval ObservabilityOpen-source model evaluation frameworkEvalScopeOpen source$0 softwareOpen-source framework; model/API costs separateStreamlined/customizable framework for efficient LLM, VLM and AIGC evaluation and performance benchmarkingModel and application benchmark/evaluation workflowsNo hosted tracing platform by defaultBenchmarks, reports and performance testing workflowsModelScope ecosystem plus local/model backends depending configurationLocal/open-sourceSelf-managed data/model/API usageNo SaaS governance by defaultTeams evaluating LLM/VLM/AIGC model performanceBest for benchmark/performance eval, not production trace management
No tagline
Eval ObservabilityOpen-source LLM eval toolkitHugging Face LightEvalOpen source$0 softwareOpen-source toolkit; model/API costs separateAll-in-one toolkit for evaluating LLMs across multiple backendsOffline model benchmark/evaluation toolkitNo production observability/tracingTask configuration and evaluation reportingHugging Face ecosystem and multiple model backendsLocal/open-sourceSelf-managed data/model executionNo team governance unless combined with HF/CI workflowsResearchers and model builders using Hugging Face workflowsModel-centric eval rather than app/RAG/agent observability
No tagline
Eval ObservabilityOpen-source LLM experiment tracking/evaluationTruLensOpen source$0 softwareOpen-source framework; model/API costs separateEvaluation and tracking for LLM experiments and AI agentsFeedback functions, RAG/agent app evaluation and experiment comparisonTracking/tracing within local/app workflows; hosted governance depends on external platformExperiment records, feedback functions, leaderboards and app-level eval workflowsPython ecosystem, LlamaIndex/LangChain style app integrationsLocal/open-sourceSelf-managed data unless connected to external servicesNo built-in cloud team governance in OSS rowDevelopers evaluating RAG/agent apps with feedback functionsRequires custom setup and model/provider calls; not a turnkey SaaS dashboard alone
No tagline
Eval ObservabilityLLM response comparatorLLM ComparatorOpen source$0 softwareOpen-source visualization toolInteractive data visualization tool for evaluating and analyzing LLM responses side-by-sideSide-by-side response comparison and qualitative eval analysisNo request tracing or monitoringDatasets/outputs visualization rather than prompt registryBrowser/data visualization workflowLocal/open-sourceSelf-managed datasets/output filesNo SaaS governanceTeams comparing model outputs and evaluator disagreement visuallyNarrower than full eval suites; needs prepared outputs
No tagline
Eval ObservabilityAI coding agent observabilityagenttraceOpen source$0 softwareLocal-first TUILocal-first TUI observability for AI coding agents; tracks cost, tokens, tool failures, anomalies, health and CI gates across agent exports per local resourceEvaluation/quality gates for agent sessions and CI evidenceCost/token/tool-failure observability for Claude Code, Codex CLI, Gemini CLI, Aider and Cursor exportsSession exports and CI gates rather than LLM prompt datasetsClaude Code, Codex CLI, Gemini CLI, Aider and Cursor exportsLocal/self-hosted CLI/TUILocal-first; data stays in workspace unless exportedGovernance through local repo/CI policiesDevelopers monitoring AI coding agent runs and failuresNot a general LLM app observability platform; project maturity depends on repo
No tagline
Eval ObservabilityPR quality evaluationPR TriageMIT / BYOK$0 softwareOpen-source BYOK web appOpen-source PR evaluation tool scoring pull requests on six quality dimensions with diff evidence; bring your own key per local resourcePR/code-quality evaluation and evidence-backed scoringNo runtime LLM app tracing; focused on code review evidenceScore reports over PR diffs rather than prompt/dataset managementGit/PR diff workflows and BYOK model accessHosted demo/web app or self-host from source if availableBYOK; data exposure depends on where hosted/model providerNo enterprise governance capturedDevelopers wanting lightweight PR eval reportsNarrow code-review use case; not general model/app eval platform