LLM Inference

Tool
Category
Segment
Platform / Tool
Plan / License
Monthly Price USD
Pricing Model
Free Tier / OSS
Included Usage / Limits
Model / Runtime Support
Serving API / Scaling
Acceleration / Quantization
Integrations / Frameworks
Deployment / Hosting
Security / Privacy
Team / Governance
Best Fit
Main Limits / Caveats
No tagline
LLM InferenceOpen-source serving enginevLLMApache-2.0 / open source$0 software; GPU/hosting costs separateSelf-hosted inference engine; optional managed hosting through third partiesOfficial docs cover offline inference, OpenAI-compatible online serving, distributed deployment, Kubernetes/Docker and integrationsGenerative, pooling, embedding, scoring, reward, multimodal and selected speech-to-text model paths through supported HF-style modelsOpenAI-compatible server, batch/offline inference, Ray serving examples, Kubernetes and Helm deployment pathsPagedAttention lineage, prefix caching, speculative decoding, quantization backends, tensor/data/expert parallel and disaggregated serving featuresHugging Face models, LangChain, LlamaIndex, Ray Serve, Kubernetes, Docker, Prometheus/Grafana and many app frameworksSelf-hosted local/GPU server, Kubernetes, cloud GPU VMs or managed partnersData stays in chosen infrastructure; usage stats and security docs should be reviewed before regulated deploymentNo SaaS team layer in OSS; ops governance through Kubernetes/cloud/IAMHigh-throughput production serving for open-weight LLMs and VLMsFast-moving engine; model, hardware and quantization support must be validated per release
No tagline
LLM InferenceLocal model runnerOllamaOpen source core / local app$0 software; optional third-party hosting/model costs separateLocal model runner with model library and local APIOfficial site presents Ollama as a way to get up and running with large language models locallyModel library includes many open LLM families packaged for local use; hardware limits determine practical model sizeLocal CLI and HTTP API; used by many desktop apps, coding agents and RAG toolsQuantized local model distribution, llama.cpp-backed runtime lineage and model pull/run workflowOpen WebUI, Continue, Cline, Aider, LangChain, LlamaIndex, desktop apps and local RAG stacksLocal macOS/Windows/Linux machines, workstations or self-hosted serversPrompts can stay local when using local models and local clientsNo hosted team governance in core local workflowDevelopers wanting a simple local model runner without manual GGUF/server setupNot a managed production autoscaling platform; model/version and hardware fit need testing
No tagline
LLM InferenceOpen-source serving engineSGLangApache-2.0 / open source$0 software; GPU/hosting costs separateSelf-hosted serving framework and runtimeDocs position SGLang as a fast serving framework for large language models and vision-language modelsLLMs and VLMs with structured generation, tool use and reasoning-oriented serving features depending model/backendRuntime/server for OpenAI-compatible APIs, data-parallel serving and production deployment examplesRadixAttention-style KV reuse, batching, constrained decoding, speculative decoding and multi-node/distributed serving featuresPython, Hugging Face models, OpenAI-compatible clients, Kubernetes/cloud GPU deployment recipesSelf-hosted GPU servers, containers, Kubernetes or cloud GPU infrastructureData path controlled by self-hosted deployment and selected model weightsNo SaaS governance by default; cluster/cloud controls applyTeams optimizing LLM/VLM serving latency and structured generation workloadsFeature maturity and model compatibility move quickly; requires benchmark validation
No tagline
LLM InferenceNVIDIA optimization libraryNVIDIA TensorRT-LLMApache-2.0 / open source$0 software; NVIDIA GPU/hosting costs separateSelf-hosted optimization/runtime libraryOfficial docs describe TensorRT-LLM as tooling to build and run optimized LLM inference on NVIDIA GPUsNVIDIA GPU-targeted LLM and VLM model families depending release and engine build supportRuntime libraries and deployment examples; often paired with Triton or NIM for servingTensorRT engines, quantization, kernel fusion, paged KV cache, inflight batching and multi-GPU parallelismNVIDIA GPUs, Triton Inference Server, NIM, NeMo, Kubernetes and cloud GPU stacksSelf-hosted NVIDIA GPU servers, containers or managed NVIDIA ecosystem platformsData stays in chosen GPU environment; enterprise security depends deployment stackGovernance through NVIDIA/cloud/Kubernetes controls rather than a standalone SaaS layerMaximizing performance on NVIDIA GPUs for production LLM inferenceHardware-specific and build-heavy; less portable than CPU/GPU-neutral runtimes
No tagline
LLM InferenceOpen-source deployment toolkitLMDeployApache-2.0 / open source$0 software; GPU/hosting costs separateSelf-hosted toolkit for compression, deployment and servingDocs describe LMDeploy as a toolkit for compressing, deploying and serving LLMs and VLMsLLMs and vision-language models, especially InternLM/OpenMMLab and HF-compatible model familiesTurboMind/PyTorch engines, serving APIs and deployment tutorialsKV cache management, quantization, tensor parallelism and inference acceleration pathsHugging Face, OpenMMLab/InternLM ecosystem, Docker, Kubernetes and OpenAI-compatible clientsSelf-hosted GPU servers, containers and cloud GPU infrastructureData stays in selected infrastructure; model/license governance remains with operatorNo SaaS governance by default; infra controls applyTeams in the InternLM/OpenMMLab ecosystem needing efficient LLM/VLM servingModel coverage and best performance may be ecosystem-specific; benchmark before standardizing
No tagline
LLM InferenceOpen-source serving engineHugging Face Text Generation InferenceApache-2.0 / open source$0 software; hardware or hosted endpoint costs separateSelf-hosted container/server; also used in Hugging Face managed infrastructureDocs describe TGI as a toolkit for deploying and serving large language modelsTransformer text-generation models from Hugging Face ecosystem; model support follows TGI backend/versionHTTP server for text generation with production serving features and containerized deploymentContinuous batching, tensor parallelism, quantization integrations and optimized transformer serving featuresHugging Face Hub, Inference Endpoints, Docker, Kubernetes and provider integrationsSelf-hosted GPU infrastructure or Hugging Face managed endpointsSelf-hosting keeps data in your infrastructure; managed endpoints follow Hugging Face terms and region controlsHF org/repo governance if using Hub or managed endpoints; otherwise infra controlsProduction text-generation serving close to the Hugging Face model ecosystemPrimarily text-generation oriented; compare against vLLM/SGLang for newest model/runtime features
No tagline
LLM InferenceDistributed serving frameworkRay ServeApache-2.0 / open source$0 software; cluster/cloud costs separateSelf-hosted scalable serving frameworkDocs describe Ray Serve as scalable and programmable serving, including LLM serving paths in the Ray ecosystemAny Python model/service, with LLM serving examples and integrations for vLLM/backendsAutoscaling deployments, replicas, request routing and Python-native service compositionCluster autoscaling, batching and backend integration rather than model-kernel optimization by itselfRay, vLLM, Kubernetes, Anyscale, FastAPI-style Python services and distributed appsSelf-hosted Ray clusters, Kubernetes or managed Anyscale/cloud deploymentsData stays in selected Ray/cloud deployment; security depends cluster/IAM setupCluster-level governance; managed Anyscale adds org/security features if usedTeams composing LLM inference with broader distributed Python servicesNot a low-level LLM engine alone; requires backend runtime and Ray operations expertise
No tagline
LLM InferenceLocal GPU inference libraryExLlamaV2MIT / open source$0 software; local GPU costs separateLocal inference library focused on quantized modelsGitHub describes ExLlamaV2 as a fast inference library for running LLMs locally on modern consumer-class GPUsQuantized local LLMs in EXL2/GPTQ-style ecosystem; model support follows project toolingLibrary and example server/UI integrations rather than a full managed platformEXL2 quantization, CUDA-focused kernels and consumer GPU memory/performance optimizationstext-generation-webui, local scripts, model quantization workflows and Python toolingLocal Windows/Linux GPU workstations or self-hosted GPU machinesLocal execution keeps prompts/models on the machine unless integrated with external servicesNo SaaS governance; local filesystem/GPU access controls applyPower users serving quantized models on consumer NVIDIA GPUsNarrower deployment surface than vLLM/TGI; mainly local enthusiast/workstation use
No tagline
LLM InferenceLightweight serving frameworkLitServeApache-2.0 / open source$0 software; cloud/compute costs separateOpen-source serving framework from Lightning AI ecosystemDocs list features for generative AI serving, including streaming and LLM serving patternsAny AI model served from Python, including LLMs through custom engines or LitGPT-related pathsFastAPI-style server, streaming responses, batching and self-hosting optionsFramework-level batching/streaming; model acceleration depends selected runtime/backendPyTorch, Lightning AI, LitGPT, Docker/cloud deployment and Python ML appsSelf-hosted machine/server or Lightning Studios/cloud workflowsSelf-hosting keeps data in selected environment; cloud features follow Lightning termsNo SaaS governance in OSS; cloud workspace controls if using Lightning platformSmall teams needing simple production APIs around custom AI/LLM modelsLess specialized than vLLM/SGLang for high-throughput LLM-only serving
No tagline
LLM InferenceNVIDIA model serving serverNVIDIA Triton Inference ServerBSD-3-Clause / open source$0 software; GPU/hosting costs separateSelf-hosted inference server for multiple frameworksOfficial docs cover Triton Inference Server for serving models across multiple backends and deployment targetsBroad ML models and LLM backends through TensorRT-LLM, Python, ONNX Runtime and other Triton backendsHTTP/gRPC inference APIs, model repository, dynamic batching, ensembles and Kubernetes deployment patternsDynamic batching, concurrent execution, TensorRT/TensorRT-LLM acceleration and GPU metrics integrationNVIDIA GPUs, TensorRT-LLM, ONNX Runtime, Kubernetes, Prometheus, cloud GPU platformsSelf-hosted GPU servers, Kubernetes or NVIDIA/cloud environmentsData remains in chosen server/cluster; enterprise hardening depends deploymentNo SaaS governance by itself; platform governance through Kubernetes/cloud/NVIDIA stackOrganizations already standardizing on NVIDIA serving infrastructure across model typesMore general-purpose than LLM-specialized; LLM performance depends backend configuration
No tagline
LLM InferenceManaged model serving platformBasetenHosted managed platform$0/month base, pay as you goPay-as-you-go dedicated deployments and Model APIs; Pro/Enterprise quoted/volumeNo durable free credits captured on pricing pagePricing page lists Basic at $0/month pay-as-you-go, dedicated deployments, model APIs and trainingCustom, fine-tuned and open-source models; Model APIs include pre-optimized models with per-token pricesDedicated deployments, autoscaling, model APIs and deployment options for Baseten/VPC/hybrid tiersBaseten Inference Stack, optimized model APIs, fast cold starts and dedicated compute optionsTruss/Baseten docs, custom containers, open-source models, fine-tuned models and cloud/VPC deploymentsHosted Baseten, customer VPC or hybrid depending planPricing page lists SOC 2 Type II and HIPAA compliant; Enterprise adds data residency and advanced security/complianceBasic support; Enterprise includes advanced RBAC with Teams and custom SLAsProduction teams that want managed custom/open-model inference without building platform opsDetailed GPU pricing may require current docs/quote; Enterprise/VPC features are not self-serve Basic
No tagline
LLM InferenceRust inference servermistral.rsMIT / open source$0 software; local/cloud hardware costs separateLocal/self-hosted Rust LLM runtime and serverGitHub describes mistral.rs as fast, flexible LLM inference with OpenAI and Anthropic compatible servingHugging Face models, GGUF/UQFF quantized models, multimodal models and embeddings depending releaseSingle binary CLI/server with OpenAI-compatible /v1 endpoints, Anthropic-compatible messages and built-in web UIHardware-aware tuning, quantization, CUDA performance paths, paged kernels and metricsRust/Python SDKs, OpenAI-compatible clients, local model files and Hugging Face modelsLocal desktop/server, containers or cloud GPU machinesCan run locally; server auth/network hardening is operator responsibilityNo SaaS governance; local/server controls applyDevelopers wanting a compact all-in-one local server with modern compatibility endpointsNewer ecosystem than llama.cpp/vLLM; validate stability and model support before production
No tagline
LLM InferenceCloud/multicloud serving orchestratorSkyPilot SkyServeApache-2.0 / open source$0 software; cloud/GPU costs separateOpen-source multicloud serving library for model endpointsDocs describe SkyServe as SkyPilot's model serving library with examples for vLLM and TGIModel servers such as vLLM, TGI or custom containers running across cloud/GPU providersManaged replicas, autoscaling, load balancing and endpoint management across cloudsInfrastructure cost/performance optimization through cloud/region/GPU placement; model acceleration comes from chosen serverAWS, GCP, Azure, Kubernetes, Lambda Labs, RunPod, vLLM, TGI and custom Docker images depending setupSelf-managed across cloud accounts, Kubernetes and GPU providersData and credentials flow through configured clouds; security depends account/IAM/network setupCloud IAM/project governance; no hosted SaaS team layer in OSSTeams wanting portable LLM serving across clouds and GPU availability poolsOperational complexity and cloud quota management remain with the operator
No tagline
LLM InferenceServerless GPU platformModalHosted serverless platform$0 base plan plus compute; Starter includes $30/month free creditsUsage-based serverless compute billed per second/resource; paid Team plan is $250/month plus computeYes, $30/month free compute credits on StarterPricing page lists GPU, CPU, memory and volume rates plus Starter/Team/Enterprise plan limitsRun custom LLM servers such as vLLM/TGI or arbitrary Python inference appsAutoscaling functions, web endpoints, containers, GPU concurrency and scheduled jobsAcceleration depends chosen GPU/runtime; Modal provides serverless scaling and fast cold-start platform featuresPython, containers, custom images, secrets, web functions, vLLM examples and cloud storage patternsHosted Modal serverless cloudSOC 2 listed; HIPAA compatibility, audit logs, RBAC and SSO are higher-tier features on pricing pageStarter includes 3 seats; Team has unlimited seats, higher concurrency and more governance featuresDevelopers shipping bursty LLM inference endpoints without managing KubernetesCosts can rise with high GPU seconds; some governance/features require paid Team/Enterprise
No tagline
LLM InferenceModel serving frameworkBentoMLApache-2.0 / open source$0 software; cloud/compute costs separateOpen-source model-serving framework with optional BentoCloudDocs position BentoML as a framework for building, packaging and deploying AI applications and model servicesAny Python model/service, including LLM inference services with custom runners/backendsHTTP APIs, service packaging, containerization and deployment workflowsDepends on chosen backend such as vLLM/TGI/transformers; BentoML focuses on packaging and serving architecturePython, Docker, Kubernetes, cloud deployment, model stores and BentoCloudSelf-hosted containers/Kubernetes or managed BentoCloudSelf-hosting keeps data under operator control; managed cloud follows BentoML cloud termsTeam governance depends self-hosted infra or BentoCloud org controlsPackaging custom LLM inference services with production API boundariesNot a specialized LLM kernel engine; performance depends on integrated backend
No tagline
LLM InferenceLocal distributed inference platformXinferenceApache-2.0 / open source$0 software; local/cloud hardware costs separateSelf-hosted model-serving platformDocs present Xinference as a versatile library to serve language, speech recognition and multimodal modelsLLMs, embeddings, rerankers, image/audio/speech and multimodal model families depending backendRESTful/API server, web UI, workers and distributed local/cluster servingBackend choices can include llama.cpp, vLLM, transformers and other runtime paths with quantized model supportLangChain, LlamaIndex, OpenAI-compatible clients, local model ecosystems and cluster workersLocal workstation, private server, Docker, Kubernetes or cloud GPU nodesCan run on local/private infrastructure; privacy depends chosen models/backendsNo hosted team governance by default; cluster/user controls are operator-managedUnified self-hosted model serving across LLM, embedding and multimodal workloadsBroader surface can mean more dependency/backend management than a single-purpose engine
No tagline
LLM InferencePortable compiler/runtimeMLC LLMApache-2.0 / open source$0 software; device/hosting costs separateCompiler/runtime for deploying LLMs across devicesOfficial docs present MLC LLM for compiling and deploying LLMs across platformsOpen LLMs compiled for CPUs, GPUs, mobile and browser/WebGPU-style targets depending supportLocal runtimes, CLI/app deployment and device-specific serving patternsTVM-based compilation, quantization and hardware-specific code generationPython, C++, JavaScript/WebGPU, mobile apps and local deployment workflowsLocal desktop, mobile, browser, edge and self-hosted devicesCan run on-device; privacy depends app wrapper and model distributionNo SaaS governance by defaultCross-platform/on-device LLM deployment where portability matters more than managed servingModel compilation and runtime support can be complex; not a turnkey cloud autoscaler
No tagline
LLM InferenceIntel runtime/libraryOpenVINO GenAIApache-2.0 / open source$0 software; Intel/client hardware costs separateRuntime library and pipelines for generative AI inferenceOfficial docs say OpenVINO GenAI extends OpenVINO Runtime with pipelines and methods for generative AI modelsLLMs, VLMs and text-to-image models supported by OpenVINO GenAI pipelines and model conversion flowApplication-level pipelines; model server/serving can be built around OpenVINO Runtime/Model ServerCPU, Intel GPU/NPU paths, compilation cache, speculative decoding and OpenVINO model optimizationOpenVINO Runtime, OpenVINO Model Server, notebooks, C++/Python/JS samples and Intel hardware stackLocal PC, edge, server, Intel GPU/NPU/CPU environments and containersOn-device/private inference possible when models run locallyNo SaaS governance by default; enterprise controls through deployment environmentIntel hardware users optimizing local/edge generative inferenceBest fit is Intel/OpenVINO stack; model conversion and accelerator support must be tested
No tagline
LLM InferencePackaged NVIDIA microserviceNVIDIA NIM for LLMsCommercial NVIDIA software / containersContact sales or NVIDIA entitlement; infrastructure costs separatePrebuilt inference microservices and containers for NVIDIA GPU environmentsSome trial/catalog access may be available; durable free tier not capturedNVIDIA docs describe NIM microservices for optimized model inference with OpenAI-compatible APIs for supported LLMsNVIDIA-optimized LLM containers and supported model profiles from NVIDIA catalogOpenAI-compatible endpoints, containerized deployment and enterprise operations patternsTensorRT-LLM optimization, prebuilt model profiles, GPU acceleration and NVIDIA runtime stackNVIDIA AI Enterprise, Kubernetes, cloud marketplaces, NGC catalog, Triton/TensorRT ecosystemSelf-hosted enterprise GPU clusters, cloud GPUs and NVIDIA-supported environmentsCan be deployed inside customer infrastructure; enterprise controls depend NVIDIA/cloud setupEnterprise licensing, support, RBAC/IAM through platform and NVIDIA subscriptionsEnterprises wanting supported, packaged NVIDIA LLM inference instead of assembling OSS runtime piecesLicensing/entitlement and supported model matrix must be checked before procurement
No tagline
LLM InferencePython binding and serverllama-cpp-pythonMIT / open source$0 software; local hardware costs separatePython package and optional OpenAI-compatible web serverGitHub README describes Python bindings for llama.cpp with high-level API and OpenAI-compatible web serverGGUF/local llama.cpp-supported models; text, function calling and vision paths depending build/modelPython API plus OpenAI-compatible local server for app integrationInherits llama.cpp backends including CPU, CUDA, Metal, ROCm, Vulkan, SYCL and quantized GGUF supportLangChain, LlamaIndex, Python apps, local Copilot replacements and llama.cpp ecosystemLocal Python app/server, containers or workstation deploymentsCan run fully local; Python app/server network exposure must be controlled by operatorNo SaaS governance; package/server controls are localPython developers embedding local LLM inference directly into appsInstall/build complexity and wheel availability vary by platform/backend
No tagline
LLM InferenceRuntime SDKONNX Runtime GenAIMIT / open source$0 software; hardware/hosting costs separateRuntime API for generative AI with ONNX modelsDocs describe a Generate API preview for generative AI scenarios on ONNX RuntimeONNX-converted LLMs/VLMs and generative models supported by ONNX Runtime GenAI APIsC/C++/C#/Python style runtime APIs; app developers build serving around the runtimeONNX graph/runtime optimizations, CPU/GPU/accelerator execution providers and model quantization workflowsONNX, Windows/DirectML, Azure/edge workflows and application runtimesEmbedded apps, local servers, edge devices or cloud VMsData stays in application environment; privacy depends host app/deploymentNo SaaS governance; controlled by app/cloud environmentTeams standardizing on ONNX for portable local or edge generative inferencePreview/API maturity and model conversion friction need validation
No tagline
LLM InferenceKubernetes inference platformKServeApache-2.0 / open source$0 software; Kubernetes/cloud costs separateKubernetes-native inference platformOfficial site describes KServe as a standardized distributed generative and predictive AI inference platform for KubernetesPredictive ML plus generative inference; LLM paths include vLLM and llm-d optimized backendsKubernetes CRDs, InferenceService, OpenAI-compatible protocol, traffic routing and autoscalingBackend-driven acceleration through vLLM/llm-d/Hugging Face servers; KServe adds routing/serverless platform layerKubernetes, Knative, Istio/Envoy, vLLM, Hugging Face runtime, Kubeflow ecosystemSelf-hosted or managed Kubernetes clustersData stays in cluster/cloud account; security controlled by Kubernetes/IAM/network policyKubernetes RBAC, namespaces, service accounts and platform governancePlatform teams standardizing model serving on KubernetesOperationally heavy for small teams; backend/model server still determines LLM speed
No tagline
LLM InferenceServerless GPU platformRunPod ServerlessHosted GPU/serverless platformNo fixed monthly fee captured; GPU workers billed by usage per RunPod pricingUsage-based GPU pods/serverless workers with per-GPU pricingNo durable free tier capturedDocs include a vLLM serverless worker guide; pricing page lists per-second/hour GPU rates by hardwareCustom containers and vLLM workers for open LLMs; model support follows container/runtimeServerless endpoints, worker templates, autoscaling and request queue patternsAcceleration depends selected GPU and runtime such as vLLM; platform provides serverless GPU orchestrationDocker, vLLM, Hugging Face models, RunPod templates, cloud storage and serverless APIsHosted RunPod GPU cloud/serverlessData path is hosted RunPod infrastructure; security controls depend account/network configurationAccount/team controls and enterprise options should be checked for production governanceCost-sensitive GPU inference where serverless workers fit bursty workloadsCold starts, queueing and GPU availability can affect latency; costs depend hardware/runtime choices