Overview

The Vast.ai Serverless architecture is a multi-component system that manages GPU-backed workers to efficiently serve applications. It automatically scales up or down based on endpoint parameters, workergroup parameters, and measured load reported by workers.

Primary Components

Endpoints

An Endpoint is the highest-level construct in Vast Serverless. Endpoints are configured with endpoint-level parameters that control scaling behavior, capacity limits, and utilization targets. An endpoint consists of:

A named endpoint identifier
One or more Workergroups
Endpoint parameters such as max_workers, min_load, min_workers, cold_mult, min_cold_load, and target_util

Users typically create one endpoint per function (for example, text generation or image generation) and per environment (production, staging, development).

Workergroups

A Workergroup defines how workers are recruited and created. Workergroups are configured with workergroup-level parameters and are responsible for selecting which GPU offers are eligible for worker creation. Each Workergroup includes:

A serverless-compatible template (referenced by template_id or template_hash)
Hardware and marketplace filters defined via search_params
Optional instance configuration overrides via launch_args
Hardware requirements such as gpu_ram
A set of GPU instances (workers) created from the template

Multiple Workergroups can exist within a single Endpoint, each with different configurations. This enables advanced use cases such as hardware comparison, gradual model rollout, or mixed-model serving. For many applications, a single Workergroup is sufficient.

Workers

Workers are individual GPU instances created and managed by the Serverless engine. Each worker runs a PyWorker, a Python web server that loads the machine learning model and serves requests. Workers can exist in active or inactive states and are responsible for:

Receiving and processing inference requests
Reporting performance metrics (load, utilization, benchmark results)
Participating in automated scaling and routing decisions

Serverless Engine

The Serverless Engine is the decision-making service that manages workers across all endpoints and workergroups. Using configuration parameters and real-time metrics, it determines when to:

Recruit new workers
Activate inactive workers
Release or destroy workers

The engine continuously evaluates cost-performance tradeoffs using automated performance testing and measured load.

SDK

The Serverless SDK is the primary interface for interacting with Vast Serverless. It is a Python pip package that abstracts low-level details and manages:

Authentication
Routing requests to appropriate workers
Request queuing, retries, and error handling
Asynchronous request management
Worker status and lifecycle information

While CLI and API access are available, the SDK is the recommended method for most applications.

Example Workflow

The client application sends a request using the Serverless SDK.
The Serverless system routes the request and returns a suitable worker address based on current load and capacity.
The client sends the request directly to the selected worker’s API endpoint, including the required authentication data.
The PyWorker running on the GPU instance forwards the request to the machine learning model and performs inference.
The inference result is returned to the client.
Independently and continuously, each PyWorker reports operational and performance metrics back to the Serverless Engine, which uses this data to make ongoing scaling decisions.

Get Started

Instances

Serverless

Templates

Reference

Primary Components

Endpoints

Workergroups

Workers

Serverless Engine

SDK

Example Workflow

Get Started

Instances

Serverless

Templates

Reference

​Primary Components

​Endpoints

​Workergroups

​Workers

​Serverless Engine

​SDK

​Example Workflow

Primary Components

Endpoints

Workergroups

Workers

Serverless Engine

SDK

Example Workflow