LLM Inference

Tool	Category	Segment	Platform / Tool	Plan / License	Monthly Price USD	Pricing Model	Free Tier / OSS	Included Usage / Limits	Model / Runtime Support	Serving API / Scaling	Acceleration / Quantization	Integrations / Frameworks	Deployment / Hosting	Security / Privacy	Team / Governance	Best Fit	Main Limits / Caveats
vLLM OSS No tagline	LLM Inference	Open-source serving engine	vLLM	Apache-2.0 / open source	$0 software; GPU/hosting costs separate	Self-hosted inference engine; optional managed hosting through third parties	✓	Official docs cover offline inference, OpenAI-compatible online serving, distributed deployment, Kubernetes/Docker and integrations	Generative, pooling, embedding, scoring, reward, multimodal and selected speech-to-text model paths through supported HF-style models	OpenAI-compatible server, batch/offline inference, Ray serving examples, Kubernetes and Helm deployment paths	PagedAttention lineage, prefix caching, speculative decoding, quantization backends, tensor/data/expert parallel and disaggregated serving features	Hugging Face models, LangChain, LlamaIndex, Ray Serve, Kubernetes, Docker, Prometheus/Grafana and many app frameworks	Self-hosted local/GPU server, Kubernetes, cloud GPU VMs or managed partners	Data stays in chosen infrastructure; usage stats and security docs should be reviewed before regulated deployment	No SaaS team layer in OSS; ops governance through Kubernetes/cloud/IAM	High-throughput production serving for open-weight LLMs and VLMs	Fast-moving engine; model, hardware and quantization support must be validated per release
Ollama Local No tagline	LLM Inference	Local model runner	Ollama	Open source core / local app	$0 software; optional third-party hosting/model costs separate	Local model runner with model library and local API	✓	Official site presents Ollama as a way to get up and running with large language models locally	Model library includes many open LLM families packaged for local use; hardware limits determine practical model size	Local CLI and HTTP API; used by many desktop apps, coding agents and RAG tools	Quantized local model distribution, llama.cpp-backed runtime lineage and model pull/run workflow	Open WebUI, Continue, Cline, Aider, LangChain, LlamaIndex, desktop apps and local RAG stacks	Local macOS/Windows/Linux machines, workstations or self-hosted servers	Prompts can stay local when using local models and local clients	No hosted team governance in core local workflow	Developers wanting a simple local model runner without manual GGUF/server setup	Not a managed production autoscaling platform; model/version and hardware fit need testing
SGLang OSS No tagline	LLM Inference	Open-source serving engine	SGLang	Apache-2.0 / open source	$0 software; GPU/hosting costs separate	Self-hosted serving framework and runtime	✓	Docs position SGLang as a fast serving framework for large language models and vision-language models	LLMs and VLMs with structured generation, tool use and reasoning-oriented serving features depending model/backend	Runtime/server for OpenAI-compatible APIs, data-parallel serving and production deployment examples	RadixAttention-style KV reuse, batching, constrained decoding, speculative decoding and multi-node/distributed serving features	Python, Hugging Face models, OpenAI-compatible clients, Kubernetes/cloud GPU deployment recipes	Self-hosted GPU servers, containers, Kubernetes or cloud GPU infrastructure	Data path controlled by self-hosted deployment and selected model weights	No SaaS governance by default; cluster/cloud controls apply	Teams optimizing LLM/VLM serving latency and structured generation workloads	Feature maturity and model compatibility move quickly; requires benchmark validation
TensorRT-LLM OSS No tagline	LLM Inference	NVIDIA optimization library	NVIDIA TensorRT-LLM	Apache-2.0 / open source	$0 software; NVIDIA GPU/hosting costs separate	Self-hosted optimization/runtime library	✓	Official docs describe TensorRT-LLM as tooling to build and run optimized LLM inference on NVIDIA GPUs	NVIDIA GPU-targeted LLM and VLM model families depending release and engine build support	Runtime libraries and deployment examples; often paired with Triton or NIM for serving	TensorRT engines, quantization, kernel fusion, paged KV cache, inflight batching and multi-GPU parallelism	NVIDIA GPUs, Triton Inference Server, NIM, NeMo, Kubernetes and cloud GPU stacks	Self-hosted NVIDIA GPU servers, containers or managed NVIDIA ecosystem platforms	Data stays in chosen GPU environment; enterprise security depends deployment stack	Governance through NVIDIA/cloud/Kubernetes controls rather than a standalone SaaS layer	Maximizing performance on NVIDIA GPUs for production LLM inference	Hardware-specific and build-heavy; less portable than CPU/GPU-neutral runtimes
LMDeploy OSS No tagline	LLM Inference	Open-source deployment toolkit	LMDeploy	Apache-2.0 / open source	$0 software; GPU/hosting costs separate	Self-hosted toolkit for compression, deployment and serving	✓	Docs describe LMDeploy as a toolkit for compressing, deploying and serving LLMs and VLMs	LLMs and vision-language models, especially InternLM/OpenMMLab and HF-compatible model families	TurboMind/PyTorch engines, serving APIs and deployment tutorials	KV cache management, quantization, tensor parallelism and inference acceleration paths	Hugging Face, OpenMMLab/InternLM ecosystem, Docker, Kubernetes and OpenAI-compatible clients	Self-hosted GPU servers, containers and cloud GPU infrastructure	Data stays in selected infrastructure; model/license governance remains with operator	No SaaS governance by default; infra controls apply	Teams in the InternLM/OpenMMLab ecosystem needing efficient LLM/VLM serving	Model coverage and best performance may be ecosystem-specific; benchmark before standardizing
TGI OSS No tagline	LLM Inference	Open-source serving engine	Hugging Face Text Generation Inference	Apache-2.0 / open source	$0 software; hardware or hosted endpoint costs separate	Self-hosted container/server; also used in Hugging Face managed infrastructure	✓	Docs describe TGI as a toolkit for deploying and serving large language models	Transformer text-generation models from Hugging Face ecosystem; model support follows TGI backend/version	HTTP server for text generation with production serving features and containerized deployment	Continuous batching, tensor parallelism, quantization integrations and optimized transformer serving features	Hugging Face Hub, Inference Endpoints, Docker, Kubernetes and provider integrations	Self-hosted GPU infrastructure or Hugging Face managed endpoints	Self-hosting keeps data in your infrastructure; managed endpoints follow Hugging Face terms and region controls	HF org/repo governance if using Hub or managed endpoints; otherwise infra controls	Production text-generation serving close to the Hugging Face model ecosystem	Primarily text-generation oriented; compare against vLLM/SGLang for newest model/runtime features
Ray Serve OSS No tagline	LLM Inference	Distributed serving framework	Ray Serve	Apache-2.0 / open source	$0 software; cluster/cloud costs separate	Self-hosted scalable serving framework	✓	Docs describe Ray Serve as scalable and programmable serving, including LLM serving paths in the Ray ecosystem	Any Python model/service, with LLM serving examples and integrations for vLLM/backends	Autoscaling deployments, replicas, request routing and Python-native service composition	Cluster autoscaling, batching and backend integration rather than model-kernel optimization by itself	Ray, vLLM, Kubernetes, Anyscale, FastAPI-style Python services and distributed apps	Self-hosted Ray clusters, Kubernetes or managed Anyscale/cloud deployments	Data stays in selected Ray/cloud deployment; security depends cluster/IAM setup	Cluster-level governance; managed Anyscale adds org/security features if used	Teams composing LLM inference with broader distributed Python services	Not a low-level LLM engine alone; requires backend runtime and Ray operations expertise
ExLlamaV2 OSS No tagline	LLM Inference	Local GPU inference library	ExLlamaV2	MIT / open source	$0 software; local GPU costs separate	Local inference library focused on quantized models	✓	GitHub describes ExLlamaV2 as a fast inference library for running LLMs locally on modern consumer-class GPUs	Quantized local LLMs in EXL2/GPTQ-style ecosystem; model support follows project tooling	Library and example server/UI integrations rather than a full managed platform	EXL2 quantization, CUDA-focused kernels and consumer GPU memory/performance optimizations	text-generation-webui, local scripts, model quantization workflows and Python tooling	Local Windows/Linux GPU workstations or self-hosted GPU machines	Local execution keeps prompts/models on the machine unless integrated with external services	No SaaS governance; local filesystem/GPU access controls apply	Power users serving quantized models on consumer NVIDIA GPUs	Narrower deployment surface than vLLM/TGI; mainly local enthusiast/workstation use
LitServe OSS No tagline	LLM Inference	Lightweight serving framework	LitServe	Apache-2.0 / open source	$0 software; cloud/compute costs separate	Open-source serving framework from Lightning AI ecosystem	✓	Docs list features for generative AI serving, including streaming and LLM serving patterns	Any AI model served from Python, including LLMs through custom engines or LitGPT-related paths	FastAPI-style server, streaming responses, batching and self-hosting options	Framework-level batching/streaming; model acceleration depends selected runtime/backend	PyTorch, Lightning AI, LitGPT, Docker/cloud deployment and Python ML apps	Self-hosted machine/server or Lightning Studios/cloud workflows	Self-hosting keeps data in selected environment; cloud features follow Lightning terms	No SaaS governance in OSS; cloud workspace controls if using Lightning platform	Small teams needing simple production APIs around custom AI/LLM models	Less specialized than vLLM/SGLang for high-throughput LLM-only serving
Triton Inference Server OSS No tagline	LLM Inference	NVIDIA model serving server	NVIDIA Triton Inference Server	BSD-3-Clause / open source	$0 software; GPU/hosting costs separate	Self-hosted inference server for multiple frameworks	✓	Official docs cover Triton Inference Server for serving models across multiple backends and deployment targets	Broad ML models and LLM backends through TensorRT-LLM, Python, ONNX Runtime and other Triton backends	HTTP/gRPC inference APIs, model repository, dynamic batching, ensembles and Kubernetes deployment patterns	Dynamic batching, concurrent execution, TensorRT/TensorRT-LLM acceleration and GPU metrics integration	NVIDIA GPUs, TensorRT-LLM, ONNX Runtime, Kubernetes, Prometheus, cloud GPU platforms	Self-hosted GPU servers, Kubernetes or NVIDIA/cloud environments	Data remains in chosen server/cluster; enterprise hardening depends deployment	No SaaS governance by itself; platform governance through Kubernetes/cloud/NVIDIA stack	Organizations already standardizing on NVIDIA serving infrastructure across model types	More general-purpose than LLM-specialized; LLM performance depends backend configuration
Baseten Basic No tagline	LLM Inference	Managed model serving platform	Baseten	Hosted managed platform	$0/month base, pay as you go	Pay-as-you-go dedicated deployments and Model APIs; Pro/Enterprise quoted/volume	No durable free credits captured on pricing page	Pricing page lists Basic at $0/month pay-as-you-go, dedicated deployments, model APIs and training	Custom, fine-tuned and open-source models; Model APIs include pre-optimized models with per-token prices	Dedicated deployments, autoscaling, model APIs and deployment options for Baseten/VPC/hybrid tiers	Baseten Inference Stack, optimized model APIs, fast cold starts and dedicated compute options	Truss/Baseten docs, custom containers, open-source models, fine-tuned models and cloud/VPC deployments	Hosted Baseten, customer VPC or hybrid depending plan	Pricing page lists SOC 2 Type II and HIPAA compliant; Enterprise adds data residency and advanced security/compliance	Basic support; Enterprise includes advanced RBAC with Teams and custom SLAs	Production teams that want managed custom/open-model inference without building platform ops	Detailed GPU pricing may require current docs/quote; Enterprise/VPC features are not self-serve Basic
mistral.rs OSS No tagline	LLM Inference	Rust inference server	mistral.rs	MIT / open source	$0 software; local/cloud hardware costs separate	Local/self-hosted Rust LLM runtime and server	✓	GitHub describes mistral.rs as fast, flexible LLM inference with OpenAI and Anthropic compatible serving	Hugging Face models, GGUF/UQFF quantized models, multimodal models and embeddings depending release	Single binary CLI/server with OpenAI-compatible /v1 endpoints, Anthropic-compatible messages and built-in web UI	Hardware-aware tuning, quantization, CUDA performance paths, paged kernels and metrics	Rust/Python SDKs, OpenAI-compatible clients, local model files and Hugging Face models	Local desktop/server, containers or cloud GPU machines	Can run locally; server auth/network hardening is operator responsibility	No SaaS governance; local/server controls apply	Developers wanting a compact all-in-one local server with modern compatibility endpoints	Newer ecosystem than llama.cpp/vLLM; validate stability and model support before production
SkyServe OSS No tagline	LLM Inference	Cloud/multicloud serving orchestrator	SkyPilot SkyServe	Apache-2.0 / open source	$0 software; cloud/GPU costs separate	Open-source multicloud serving library for model endpoints	✓	Docs describe SkyServe as SkyPilot's model serving library with examples for vLLM and TGI	Model servers such as vLLM, TGI or custom containers running across cloud/GPU providers	Managed replicas, autoscaling, load balancing and endpoint management across clouds	Infrastructure cost/performance optimization through cloud/region/GPU placement; model acceleration comes from chosen server	AWS, GCP, Azure, Kubernetes, Lambda Labs, RunPod, vLLM, TGI and custom Docker images depending setup	Self-managed across cloud accounts, Kubernetes and GPU providers	Data and credentials flow through configured clouds; security depends account/IAM/network setup	Cloud IAM/project governance; no hosted SaaS team layer in OSS	Teams wanting portable LLM serving across clouds and GPU availability pools	Operational complexity and cloud quota management remain with the operator
Modal Starter No tagline	LLM Inference	Serverless GPU platform	Modal	Hosted serverless platform	$0 base plan plus compute; Starter includes $30/month free credits	Usage-based serverless compute billed per second/resource; paid Team plan is $250/month plus compute	Yes, $30/month free compute credits on Starter	Pricing page lists GPU, CPU, memory and volume rates plus Starter/Team/Enterprise plan limits	Run custom LLM servers such as vLLM/TGI or arbitrary Python inference apps	Autoscaling functions, web endpoints, containers, GPU concurrency and scheduled jobs	Acceleration depends chosen GPU/runtime; Modal provides serverless scaling and fast cold-start platform features	Python, containers, custom images, secrets, web functions, vLLM examples and cloud storage patterns	Hosted Modal serverless cloud	SOC 2 listed; HIPAA compatibility, audit logs, RBAC and SSO are higher-tier features on pricing page	Starter includes 3 seats; Team has unlimited seats, higher concurrency and more governance features	Developers shipping bursty LLM inference endpoints without managing Kubernetes	Costs can rise with high GPU seconds; some governance/features require paid Team/Enterprise
BentoML OSS No tagline	LLM Inference	Model serving framework	BentoML	Apache-2.0 / open source	$0 software; cloud/compute costs separate	Open-source model-serving framework with optional BentoCloud	✓	Docs position BentoML as a framework for building, packaging and deploying AI applications and model services	Any Python model/service, including LLM inference services with custom runners/backends	HTTP APIs, service packaging, containerization and deployment workflows	Depends on chosen backend such as vLLM/TGI/transformers; BentoML focuses on packaging and serving architecture	Python, Docker, Kubernetes, cloud deployment, model stores and BentoCloud	Self-hosted containers/Kubernetes or managed BentoCloud	Self-hosting keeps data under operator control; managed cloud follows BentoML cloud terms	Team governance depends self-hosted infra or BentoCloud org controls	Packaging custom LLM inference services with production API boundaries	Not a specialized LLM kernel engine; performance depends on integrated backend
Xinference OSS No tagline	LLM Inference	Local distributed inference platform	Xinference	Apache-2.0 / open source	$0 software; local/cloud hardware costs separate	Self-hosted model-serving platform	✓	Docs present Xinference as a versatile library to serve language, speech recognition and multimodal models	LLMs, embeddings, rerankers, image/audio/speech and multimodal model families depending backend	RESTful/API server, web UI, workers and distributed local/cluster serving	Backend choices can include llama.cpp, vLLM, transformers and other runtime paths with quantized model support	LangChain, LlamaIndex, OpenAI-compatible clients, local model ecosystems and cluster workers	Local workstation, private server, Docker, Kubernetes or cloud GPU nodes	Can run on local/private infrastructure; privacy depends chosen models/backends	No hosted team governance by default; cluster/user controls are operator-managed	Unified self-hosted model serving across LLM, embedding and multimodal workloads	Broader surface can mean more dependency/backend management than a single-purpose engine
MLC LLM OSS No tagline	LLM Inference	Portable compiler/runtime	MLC LLM	Apache-2.0 / open source	$0 software; device/hosting costs separate	Compiler/runtime for deploying LLMs across devices	✓	Official docs present MLC LLM for compiling and deploying LLMs across platforms	Open LLMs compiled for CPUs, GPUs, mobile and browser/WebGPU-style targets depending support	Local runtimes, CLI/app deployment and device-specific serving patterns	TVM-based compilation, quantization and hardware-specific code generation	Python, C++, JavaScript/WebGPU, mobile apps and local deployment workflows	Local desktop, mobile, browser, edge and self-hosted devices	Can run on-device; privacy depends app wrapper and model distribution	No SaaS governance by default	Cross-platform/on-device LLM deployment where portability matters more than managed serving	Model compilation and runtime support can be complex; not a turnkey cloud autoscaler
OpenVINO GenAI No tagline	LLM Inference	Intel runtime/library	OpenVINO GenAI	Apache-2.0 / open source	$0 software; Intel/client hardware costs separate	Runtime library and pipelines for generative AI inference	✓	Official docs say OpenVINO GenAI extends OpenVINO Runtime with pipelines and methods for generative AI models	LLMs, VLMs and text-to-image models supported by OpenVINO GenAI pipelines and model conversion flow	Application-level pipelines; model server/serving can be built around OpenVINO Runtime/Model Server	CPU, Intel GPU/NPU paths, compilation cache, speculative decoding and OpenVINO model optimization	OpenVINO Runtime, OpenVINO Model Server, notebooks, C++/Python/JS samples and Intel hardware stack	Local PC, edge, server, Intel GPU/NPU/CPU environments and containers	On-device/private inference possible when models run locally	No SaaS governance by default; enterprise controls through deployment environment	Intel hardware users optimizing local/edge generative inference	Best fit is Intel/OpenVINO stack; model conversion and accelerator support must be tested
NVIDIA NIM LLM No tagline	LLM Inference	Packaged NVIDIA microservice	NVIDIA NIM for LLMs	Commercial NVIDIA software / containers	Contact sales or NVIDIA entitlement; infrastructure costs separate	Prebuilt inference microservices and containers for NVIDIA GPU environments	Some trial/catalog access may be available; durable free tier not captured	NVIDIA docs describe NIM microservices for optimized model inference with OpenAI-compatible APIs for supported LLMs	NVIDIA-optimized LLM containers and supported model profiles from NVIDIA catalog	OpenAI-compatible endpoints, containerized deployment and enterprise operations patterns	TensorRT-LLM optimization, prebuilt model profiles, GPU acceleration and NVIDIA runtime stack	NVIDIA AI Enterprise, Kubernetes, cloud marketplaces, NGC catalog, Triton/TensorRT ecosystem	Self-hosted enterprise GPU clusters, cloud GPUs and NVIDIA-supported environments	Can be deployed inside customer infrastructure; enterprise controls depend NVIDIA/cloud setup	Enterprise licensing, support, RBAC/IAM through platform and NVIDIA subscriptions	Enterprises wanting supported, packaged NVIDIA LLM inference instead of assembling OSS runtime pieces	Licensing/entitlement and supported model matrix must be checked before procurement
llama-cpp-python OSS No tagline	LLM Inference	Python binding and server	llama-cpp-python	MIT / open source	$0 software; local hardware costs separate	Python package and optional OpenAI-compatible web server	✓	GitHub README describes Python bindings for llama.cpp with high-level API and OpenAI-compatible web server	GGUF/local llama.cpp-supported models; text, function calling and vision paths depending build/model	Python API plus OpenAI-compatible local server for app integration	Inherits llama.cpp backends including CPU, CUDA, Metal, ROCm, Vulkan, SYCL and quantized GGUF support	LangChain, LlamaIndex, Python apps, local Copilot replacements and llama.cpp ecosystem	Local Python app/server, containers or workstation deployments	Can run fully local; Python app/server network exposure must be controlled by operator	No SaaS governance; package/server controls are local	Python developers embedding local LLM inference directly into apps	Install/build complexity and wheel availability vary by platform/backend
ONNX Runtime GenAI No tagline	LLM Inference	Runtime SDK	ONNX Runtime GenAI	MIT / open source	$0 software; hardware/hosting costs separate	Runtime API for generative AI with ONNX models	✓	Docs describe a Generate API preview for generative AI scenarios on ONNX Runtime	ONNX-converted LLMs/VLMs and generative models supported by ONNX Runtime GenAI APIs	C/C++/C#/Python style runtime APIs; app developers build serving around the runtime	ONNX graph/runtime optimizations, CPU/GPU/accelerator execution providers and model quantization workflows	ONNX, Windows/DirectML, Azure/edge workflows and application runtimes	Embedded apps, local servers, edge devices or cloud VMs	Data stays in application environment; privacy depends host app/deployment	No SaaS governance; controlled by app/cloud environment	Teams standardizing on ONNX for portable local or edge generative inference	Preview/API maturity and model conversion friction need validation
KServe OSS No tagline	LLM Inference	Kubernetes inference platform	KServe	Apache-2.0 / open source	$0 software; Kubernetes/cloud costs separate	Kubernetes-native inference platform	✓	Official site describes KServe as a standardized distributed generative and predictive AI inference platform for Kubernetes	Predictive ML plus generative inference; LLM paths include vLLM and llm-d optimized backends	Kubernetes CRDs, InferenceService, OpenAI-compatible protocol, traffic routing and autoscaling	Backend-driven acceleration through vLLM/llm-d/Hugging Face servers; KServe adds routing/serverless platform layer	Kubernetes, Knative, Istio/Envoy, vLLM, Hugging Face runtime, Kubeflow ecosystem	Self-hosted or managed Kubernetes clusters	Data stays in cluster/cloud account; security controlled by Kubernetes/IAM/network policy	Kubernetes RBAC, namespaces, service accounts and platform governance	Platform teams standardizing model serving on Kubernetes	Operationally heavy for small teams; backend/model server still determines LLM speed
RunPod Serverless vLLM No tagline	LLM Inference	Serverless GPU platform	RunPod Serverless	Hosted GPU/serverless platform	No fixed monthly fee captured; GPU workers billed by usage per RunPod pricing	Usage-based GPU pods/serverless workers with per-GPU pricing	No durable free tier captured	Docs include a vLLM serverless worker guide; pricing page lists per-second/hour GPU rates by hardware	Custom containers and vLLM workers for open LLMs; model support follows container/runtime	Serverless endpoints, worker templates, autoscaling and request queue patterns	Acceleration depends selected GPU and runtime such as vLLM; platform provides serverless GPU orchestration	Docker, vLLM, Hugging Face models, RunPod templates, cloud storage and serverless APIs	Hosted RunPod GPU cloud/serverless	Data path is hosted RunPod infrastructure; security controls depend account/network configuration	Account/team controls and enterprise options should be checked for production governance	Cost-sensitive GPU inference where serverless workers fit bursty workloads	Cold starts, queueing and GPU availability can affect latency; costs depend hardware/runtime choices