LLM Inference
Tool | Category | Segment | Platform / Tool | Plan / License | Monthly Price USD | Pricing Model | Free Tier / OSS | Included Usage / Limits | Model / Runtime Support | Serving API / Scaling | Acceleration / Quantization | Integrations / Frameworks | Deployment / Hosting | Security / Privacy | Team / Governance | Best Fit | Main Limits / Caveats |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
No tagline | LLM Inference | Open-source serving engine | vLLM | Apache-2.0 / open source | $0 software; GPU/hosting costs separate | Self-hosted inference engine; optional managed hosting through third parties | ✓ | Official docs cover offline inference, OpenAI-compatible online serving, distributed deployment, Kubernetes/Docker and integrations | Generative, pooling, embedding, scoring, reward, multimodal and selected speech-to-text model paths through supported HF-style models | OpenAI-compatible server, batch/offline inference, Ray serving examples, Kubernetes and Helm deployment paths | PagedAttention lineage, prefix caching, speculative decoding, quantization backends, tensor/data/expert parallel and disaggregated serving features | Hugging Face models, LangChain, LlamaIndex, Ray Serve, Kubernetes, Docker, Prometheus/Grafana and many app frameworks | Self-hosted local/GPU server, Kubernetes, cloud GPU VMs or managed partners | Data stays in chosen infrastructure; usage stats and security docs should be reviewed before regulated deployment | No SaaS team layer in OSS; ops governance through Kubernetes/cloud/IAM | High-throughput production serving for open-weight LLMs and VLMs | Fast-moving engine; model, hardware and quantization support must be validated per release |
No tagline | LLM Inference | Local model runner | Ollama | Open source core / local app | $0 software; optional third-party hosting/model costs separate | Local model runner with model library and local API | ✓ | Official site presents Ollama as a way to get up and running with large language models locally | Model library includes many open LLM families packaged for local use; hardware limits determine practical model size | Local CLI and HTTP API; used by many desktop apps, coding agents and RAG tools | Quantized local model distribution, llama.cpp-backed runtime lineage and model pull/run workflow | Open WebUI, Continue, Cline, Aider, LangChain, LlamaIndex, desktop apps and local RAG stacks | Local macOS/Windows/Linux machines, workstations or self-hosted servers | Prompts can stay local when using local models and local clients | No hosted team governance in core local workflow | Developers wanting a simple local model runner without manual GGUF/server setup | Not a managed production autoscaling platform; model/version and hardware fit need testing |
No tagline | LLM Inference | Open-source serving engine | SGLang | Apache-2.0 / open source | $0 software; GPU/hosting costs separate | Self-hosted serving framework and runtime | ✓ | Docs position SGLang as a fast serving framework for large language models and vision-language models | LLMs and VLMs with structured generation, tool use and reasoning-oriented serving features depending model/backend | Runtime/server for OpenAI-compatible APIs, data-parallel serving and production deployment examples | RadixAttention-style KV reuse, batching, constrained decoding, speculative decoding and multi-node/distributed serving features | Python, Hugging Face models, OpenAI-compatible clients, Kubernetes/cloud GPU deployment recipes | Self-hosted GPU servers, containers, Kubernetes or cloud GPU infrastructure | Data path controlled by self-hosted deployment and selected model weights | No SaaS governance by default; cluster/cloud controls apply | Teams optimizing LLM/VLM serving latency and structured generation workloads | Feature maturity and model compatibility move quickly; requires benchmark validation |
No tagline | LLM Inference | NVIDIA optimization library | NVIDIA TensorRT-LLM | Apache-2.0 / open source | $0 software; NVIDIA GPU/hosting costs separate | Self-hosted optimization/runtime library | ✓ | Official docs describe TensorRT-LLM as tooling to build and run optimized LLM inference on NVIDIA GPUs | NVIDIA GPU-targeted LLM and VLM model families depending release and engine build support | Runtime libraries and deployment examples; often paired with Triton or NIM for serving | TensorRT engines, quantization, kernel fusion, paged KV cache, inflight batching and multi-GPU parallelism | NVIDIA GPUs, Triton Inference Server, NIM, NeMo, Kubernetes and cloud GPU stacks | Self-hosted NVIDIA GPU servers, containers or managed NVIDIA ecosystem platforms | Data stays in chosen GPU environment; enterprise security depends deployment stack | Governance through NVIDIA/cloud/Kubernetes controls rather than a standalone SaaS layer | Maximizing performance on NVIDIA GPUs for production LLM inference | Hardware-specific and build-heavy; less portable than CPU/GPU-neutral runtimes |
No tagline | LLM Inference | Open-source deployment toolkit | LMDeploy | Apache-2.0 / open source | $0 software; GPU/hosting costs separate | Self-hosted toolkit for compression, deployment and serving | ✓ | Docs describe LMDeploy as a toolkit for compressing, deploying and serving LLMs and VLMs | LLMs and vision-language models, especially InternLM/OpenMMLab and HF-compatible model families | TurboMind/PyTorch engines, serving APIs and deployment tutorials | KV cache management, quantization, tensor parallelism and inference acceleration paths | Hugging Face, OpenMMLab/InternLM ecosystem, Docker, Kubernetes and OpenAI-compatible clients | Self-hosted GPU servers, containers and cloud GPU infrastructure | Data stays in selected infrastructure; model/license governance remains with operator | No SaaS governance by default; infra controls apply | Teams in the InternLM/OpenMMLab ecosystem needing efficient LLM/VLM serving | Model coverage and best performance may be ecosystem-specific; benchmark before standardizing |
No tagline | LLM Inference | Open-source serving engine | Hugging Face Text Generation Inference | Apache-2.0 / open source | $0 software; hardware or hosted endpoint costs separate | Self-hosted container/server; also used in Hugging Face managed infrastructure | ✓ | Docs describe TGI as a toolkit for deploying and serving large language models | Transformer text-generation models from Hugging Face ecosystem; model support follows TGI backend/version | HTTP server for text generation with production serving features and containerized deployment | Continuous batching, tensor parallelism, quantization integrations and optimized transformer serving features | Hugging Face Hub, Inference Endpoints, Docker, Kubernetes and provider integrations | Self-hosted GPU infrastructure or Hugging Face managed endpoints | Self-hosting keeps data in your infrastructure; managed endpoints follow Hugging Face terms and region controls | HF org/repo governance if using Hub or managed endpoints; otherwise infra controls | Production text-generation serving close to the Hugging Face model ecosystem | Primarily text-generation oriented; compare against vLLM/SGLang for newest model/runtime features |
No tagline | LLM Inference | Distributed serving framework | Ray Serve | Apache-2.0 / open source | $0 software; cluster/cloud costs separate | Self-hosted scalable serving framework | ✓ | Docs describe Ray Serve as scalable and programmable serving, including LLM serving paths in the Ray ecosystem | Any Python model/service, with LLM serving examples and integrations for vLLM/backends | Autoscaling deployments, replicas, request routing and Python-native service composition | Cluster autoscaling, batching and backend integration rather than model-kernel optimization by itself | Ray, vLLM, Kubernetes, Anyscale, FastAPI-style Python services and distributed apps | Self-hosted Ray clusters, Kubernetes or managed Anyscale/cloud deployments | Data stays in selected Ray/cloud deployment; security depends cluster/IAM setup | Cluster-level governance; managed Anyscale adds org/security features if used | Teams composing LLM inference with broader distributed Python services | Not a low-level LLM engine alone; requires backend runtime and Ray operations expertise |
No tagline | LLM Inference | Local GPU inference library | ExLlamaV2 | MIT / open source | $0 software; local GPU costs separate | Local inference library focused on quantized models | ✓ | GitHub describes ExLlamaV2 as a fast inference library for running LLMs locally on modern consumer-class GPUs | Quantized local LLMs in EXL2/GPTQ-style ecosystem; model support follows project tooling | Library and example server/UI integrations rather than a full managed platform | EXL2 quantization, CUDA-focused kernels and consumer GPU memory/performance optimizations | text-generation-webui, local scripts, model quantization workflows and Python tooling | Local Windows/Linux GPU workstations or self-hosted GPU machines | Local execution keeps prompts/models on the machine unless integrated with external services | No SaaS governance; local filesystem/GPU access controls apply | Power users serving quantized models on consumer NVIDIA GPUs | Narrower deployment surface than vLLM/TGI; mainly local enthusiast/workstation use |
No tagline | LLM Inference | Lightweight serving framework | LitServe | Apache-2.0 / open source | $0 software; cloud/compute costs separate | Open-source serving framework from Lightning AI ecosystem | ✓ | Docs list features for generative AI serving, including streaming and LLM serving patterns | Any AI model served from Python, including LLMs through custom engines or LitGPT-related paths | FastAPI-style server, streaming responses, batching and self-hosting options | Framework-level batching/streaming; model acceleration depends selected runtime/backend | PyTorch, Lightning AI, LitGPT, Docker/cloud deployment and Python ML apps | Self-hosted machine/server or Lightning Studios/cloud workflows | Self-hosting keeps data in selected environment; cloud features follow Lightning terms | No SaaS governance in OSS; cloud workspace controls if using Lightning platform | Small teams needing simple production APIs around custom AI/LLM models | Less specialized than vLLM/SGLang for high-throughput LLM-only serving |
No tagline | LLM Inference | NVIDIA model serving server | NVIDIA Triton Inference Server | BSD-3-Clause / open source | $0 software; GPU/hosting costs separate | Self-hosted inference server for multiple frameworks | ✓ | Official docs cover Triton Inference Server for serving models across multiple backends and deployment targets | Broad ML models and LLM backends through TensorRT-LLM, Python, ONNX Runtime and other Triton backends | HTTP/gRPC inference APIs, model repository, dynamic batching, ensembles and Kubernetes deployment patterns | Dynamic batching, concurrent execution, TensorRT/TensorRT-LLM acceleration and GPU metrics integration | NVIDIA GPUs, TensorRT-LLM, ONNX Runtime, Kubernetes, Prometheus, cloud GPU platforms | Self-hosted GPU servers, Kubernetes or NVIDIA/cloud environments | Data remains in chosen server/cluster; enterprise hardening depends deployment | No SaaS governance by itself; platform governance through Kubernetes/cloud/NVIDIA stack | Organizations already standardizing on NVIDIA serving infrastructure across model types | More general-purpose than LLM-specialized; LLM performance depends backend configuration |
No tagline | LLM Inference | Managed model serving platform | Baseten | Hosted managed platform | $0/month base, pay as you go | Pay-as-you-go dedicated deployments and Model APIs; Pro/Enterprise quoted/volume | No durable free credits captured on pricing page | Pricing page lists Basic at $0/month pay-as-you-go, dedicated deployments, model APIs and training | Custom, fine-tuned and open-source models; Model APIs include pre-optimized models with per-token prices | Dedicated deployments, autoscaling, model APIs and deployment options for Baseten/VPC/hybrid tiers | Baseten Inference Stack, optimized model APIs, fast cold starts and dedicated compute options | Truss/Baseten docs, custom containers, open-source models, fine-tuned models and cloud/VPC deployments | Hosted Baseten, customer VPC or hybrid depending plan | Pricing page lists SOC 2 Type II and HIPAA compliant; Enterprise adds data residency and advanced security/compliance | Basic support; Enterprise includes advanced RBAC with Teams and custom SLAs | Production teams that want managed custom/open-model inference without building platform ops | Detailed GPU pricing may require current docs/quote; Enterprise/VPC features are not self-serve Basic |
No tagline | LLM Inference | Rust inference server | mistral.rs | MIT / open source | $0 software; local/cloud hardware costs separate | Local/self-hosted Rust LLM runtime and server | ✓ | GitHub describes mistral.rs as fast, flexible LLM inference with OpenAI and Anthropic compatible serving | Hugging Face models, GGUF/UQFF quantized models, multimodal models and embeddings depending release | Single binary CLI/server with OpenAI-compatible /v1 endpoints, Anthropic-compatible messages and built-in web UI | Hardware-aware tuning, quantization, CUDA performance paths, paged kernels and metrics | Rust/Python SDKs, OpenAI-compatible clients, local model files and Hugging Face models | Local desktop/server, containers or cloud GPU machines | Can run locally; server auth/network hardening is operator responsibility | No SaaS governance; local/server controls apply | Developers wanting a compact all-in-one local server with modern compatibility endpoints | Newer ecosystem than llama.cpp/vLLM; validate stability and model support before production |
No tagline | LLM Inference | Cloud/multicloud serving orchestrator | SkyPilot SkyServe | Apache-2.0 / open source | $0 software; cloud/GPU costs separate | Open-source multicloud serving library for model endpoints | ✓ | Docs describe SkyServe as SkyPilot's model serving library with examples for vLLM and TGI | Model servers such as vLLM, TGI or custom containers running across cloud/GPU providers | Managed replicas, autoscaling, load balancing and endpoint management across clouds | Infrastructure cost/performance optimization through cloud/region/GPU placement; model acceleration comes from chosen server | AWS, GCP, Azure, Kubernetes, Lambda Labs, RunPod, vLLM, TGI and custom Docker images depending setup | Self-managed across cloud accounts, Kubernetes and GPU providers | Data and credentials flow through configured clouds; security depends account/IAM/network setup | Cloud IAM/project governance; no hosted SaaS team layer in OSS | Teams wanting portable LLM serving across clouds and GPU availability pools | Operational complexity and cloud quota management remain with the operator |
No tagline | LLM Inference | Serverless GPU platform | Modal | Hosted serverless platform | $0 base plan plus compute; Starter includes $30/month free credits | Usage-based serverless compute billed per second/resource; paid Team plan is $250/month plus compute | Yes, $30/month free compute credits on Starter | Pricing page lists GPU, CPU, memory and volume rates plus Starter/Team/Enterprise plan limits | Run custom LLM servers such as vLLM/TGI or arbitrary Python inference apps | Autoscaling functions, web endpoints, containers, GPU concurrency and scheduled jobs | Acceleration depends chosen GPU/runtime; Modal provides serverless scaling and fast cold-start platform features | Python, containers, custom images, secrets, web functions, vLLM examples and cloud storage patterns | Hosted Modal serverless cloud | SOC 2 listed; HIPAA compatibility, audit logs, RBAC and SSO are higher-tier features on pricing page | Starter includes 3 seats; Team has unlimited seats, higher concurrency and more governance features | Developers shipping bursty LLM inference endpoints without managing Kubernetes | Costs can rise with high GPU seconds; some governance/features require paid Team/Enterprise |
No tagline | LLM Inference | Model serving framework | BentoML | Apache-2.0 / open source | $0 software; cloud/compute costs separate | Open-source model-serving framework with optional BentoCloud | ✓ | Docs position BentoML as a framework for building, packaging and deploying AI applications and model services | Any Python model/service, including LLM inference services with custom runners/backends | HTTP APIs, service packaging, containerization and deployment workflows | Depends on chosen backend such as vLLM/TGI/transformers; BentoML focuses on packaging and serving architecture | Python, Docker, Kubernetes, cloud deployment, model stores and BentoCloud | Self-hosted containers/Kubernetes or managed BentoCloud | Self-hosting keeps data under operator control; managed cloud follows BentoML cloud terms | Team governance depends self-hosted infra or BentoCloud org controls | Packaging custom LLM inference services with production API boundaries | Not a specialized LLM kernel engine; performance depends on integrated backend |
No tagline | LLM Inference | Local distributed inference platform | Xinference | Apache-2.0 / open source | $0 software; local/cloud hardware costs separate | Self-hosted model-serving platform | ✓ | Docs present Xinference as a versatile library to serve language, speech recognition and multimodal models | LLMs, embeddings, rerankers, image/audio/speech and multimodal model families depending backend | RESTful/API server, web UI, workers and distributed local/cluster serving | Backend choices can include llama.cpp, vLLM, transformers and other runtime paths with quantized model support | LangChain, LlamaIndex, OpenAI-compatible clients, local model ecosystems and cluster workers | Local workstation, private server, Docker, Kubernetes or cloud GPU nodes | Can run on local/private infrastructure; privacy depends chosen models/backends | No hosted team governance by default; cluster/user controls are operator-managed | Unified self-hosted model serving across LLM, embedding and multimodal workloads | Broader surface can mean more dependency/backend management than a single-purpose engine |
No tagline | LLM Inference | Portable compiler/runtime | MLC LLM | Apache-2.0 / open source | $0 software; device/hosting costs separate | Compiler/runtime for deploying LLMs across devices | ✓ | Official docs present MLC LLM for compiling and deploying LLMs across platforms | Open LLMs compiled for CPUs, GPUs, mobile and browser/WebGPU-style targets depending support | Local runtimes, CLI/app deployment and device-specific serving patterns | TVM-based compilation, quantization and hardware-specific code generation | Python, C++, JavaScript/WebGPU, mobile apps and local deployment workflows | Local desktop, mobile, browser, edge and self-hosted devices | Can run on-device; privacy depends app wrapper and model distribution | No SaaS governance by default | Cross-platform/on-device LLM deployment where portability matters more than managed serving | Model compilation and runtime support can be complex; not a turnkey cloud autoscaler |
No tagline | LLM Inference | Intel runtime/library | OpenVINO GenAI | Apache-2.0 / open source | $0 software; Intel/client hardware costs separate | Runtime library and pipelines for generative AI inference | ✓ | Official docs say OpenVINO GenAI extends OpenVINO Runtime with pipelines and methods for generative AI models | LLMs, VLMs and text-to-image models supported by OpenVINO GenAI pipelines and model conversion flow | Application-level pipelines; model server/serving can be built around OpenVINO Runtime/Model Server | CPU, Intel GPU/NPU paths, compilation cache, speculative decoding and OpenVINO model optimization | OpenVINO Runtime, OpenVINO Model Server, notebooks, C++/Python/JS samples and Intel hardware stack | Local PC, edge, server, Intel GPU/NPU/CPU environments and containers | On-device/private inference possible when models run locally | No SaaS governance by default; enterprise controls through deployment environment | Intel hardware users optimizing local/edge generative inference | Best fit is Intel/OpenVINO stack; model conversion and accelerator support must be tested |
No tagline | LLM Inference | Packaged NVIDIA microservice | NVIDIA NIM for LLMs | Commercial NVIDIA software / containers | Contact sales or NVIDIA entitlement; infrastructure costs separate | Prebuilt inference microservices and containers for NVIDIA GPU environments | Some trial/catalog access may be available; durable free tier not captured | NVIDIA docs describe NIM microservices for optimized model inference with OpenAI-compatible APIs for supported LLMs | NVIDIA-optimized LLM containers and supported model profiles from NVIDIA catalog | OpenAI-compatible endpoints, containerized deployment and enterprise operations patterns | TensorRT-LLM optimization, prebuilt model profiles, GPU acceleration and NVIDIA runtime stack | NVIDIA AI Enterprise, Kubernetes, cloud marketplaces, NGC catalog, Triton/TensorRT ecosystem | Self-hosted enterprise GPU clusters, cloud GPUs and NVIDIA-supported environments | Can be deployed inside customer infrastructure; enterprise controls depend NVIDIA/cloud setup | Enterprise licensing, support, RBAC/IAM through platform and NVIDIA subscriptions | Enterprises wanting supported, packaged NVIDIA LLM inference instead of assembling OSS runtime pieces | Licensing/entitlement and supported model matrix must be checked before procurement |
No tagline | LLM Inference | Python binding and server | llama-cpp-python | MIT / open source | $0 software; local hardware costs separate | Python package and optional OpenAI-compatible web server | ✓ | GitHub README describes Python bindings for llama.cpp with high-level API and OpenAI-compatible web server | GGUF/local llama.cpp-supported models; text, function calling and vision paths depending build/model | Python API plus OpenAI-compatible local server for app integration | Inherits llama.cpp backends including CPU, CUDA, Metal, ROCm, Vulkan, SYCL and quantized GGUF support | LangChain, LlamaIndex, Python apps, local Copilot replacements and llama.cpp ecosystem | Local Python app/server, containers or workstation deployments | Can run fully local; Python app/server network exposure must be controlled by operator | No SaaS governance; package/server controls are local | Python developers embedding local LLM inference directly into apps | Install/build complexity and wheel availability vary by platform/backend |
No tagline | LLM Inference | Runtime SDK | ONNX Runtime GenAI | MIT / open source | $0 software; hardware/hosting costs separate | Runtime API for generative AI with ONNX models | ✓ | Docs describe a Generate API preview for generative AI scenarios on ONNX Runtime | ONNX-converted LLMs/VLMs and generative models supported by ONNX Runtime GenAI APIs | C/C++/C#/Python style runtime APIs; app developers build serving around the runtime | ONNX graph/runtime optimizations, CPU/GPU/accelerator execution providers and model quantization workflows | ONNX, Windows/DirectML, Azure/edge workflows and application runtimes | Embedded apps, local servers, edge devices or cloud VMs | Data stays in application environment; privacy depends host app/deployment | No SaaS governance; controlled by app/cloud environment | Teams standardizing on ONNX for portable local or edge generative inference | Preview/API maturity and model conversion friction need validation |
No tagline | LLM Inference | Kubernetes inference platform | KServe | Apache-2.0 / open source | $0 software; Kubernetes/cloud costs separate | Kubernetes-native inference platform | ✓ | Official site describes KServe as a standardized distributed generative and predictive AI inference platform for Kubernetes | Predictive ML plus generative inference; LLM paths include vLLM and llm-d optimized backends | Kubernetes CRDs, InferenceService, OpenAI-compatible protocol, traffic routing and autoscaling | Backend-driven acceleration through vLLM/llm-d/Hugging Face servers; KServe adds routing/serverless platform layer | Kubernetes, Knative, Istio/Envoy, vLLM, Hugging Face runtime, Kubeflow ecosystem | Self-hosted or managed Kubernetes clusters | Data stays in cluster/cloud account; security controlled by Kubernetes/IAM/network policy | Kubernetes RBAC, namespaces, service accounts and platform governance | Platform teams standardizing model serving on Kubernetes | Operationally heavy for small teams; backend/model server still determines LLM speed |
No tagline | LLM Inference | Serverless GPU platform | RunPod Serverless | Hosted GPU/serverless platform | No fixed monthly fee captured; GPU workers billed by usage per RunPod pricing | Usage-based GPU pods/serverless workers with per-GPU pricing | No durable free tier captured | Docs include a vLLM serverless worker guide; pricing page lists per-second/hour GPU rates by hardware | Custom containers and vLLM workers for open LLMs; model support follows container/runtime | Serverless endpoints, worker templates, autoscaling and request queue patterns | Acceleration depends selected GPU and runtime such as vLLM; platform provides serverless GPU orchestration | Docker, vLLM, Hugging Face models, RunPod templates, cloud storage and serverless APIs | Hosted RunPod GPU cloud/serverless | Data path is hosted RunPod infrastructure; security controls depend account/network configuration | Account/team controls and enterprise options should be checked for production governance | Cost-sensitive GPU inference where serverless workers fit bursty workloads | Cold starts, queueing and GPU availability can affect latency; costs depend hardware/runtime choices |