Skip to main content
Vast Serverless offers unmatched control over endpoint scaling behavior. The following parameters control the serverless engine and are configured at the endpoint level. Below is an explanation of what these values control and guidance on how to set them.

Max Workers (max_workers)

A hard upper limit on the total number of workers (active and inactive) that the endpoint can have at any given time. If not specified during endpoint creation, the default value is 16.

Minimum Load (min_load)

Vast Serverless utilizes a concept of load as a metric of work that is performed by a worker, measured in performance (“perf”) per second. This is an internally computed value derived from benchmark tests and is normalized across different work types (tokens for LLMs, images for image generation, etc.). It is used to make scaling and capacity decisions. During endpoint configuration, min_load is used to set the target minimum number of active workers for the endpoint. This value can be edited on a live endpoint, and the serverless engine will work to match the new target.

Best practice for setting min_load

  • Start with min_load = 1 (the default), which guarantees at least one active worker
  • Run the benchmark test to determine measured performance
  • Update min_load using the following formula:
measured_performance × minimum_parallel_requests

Setting Minimum Inactive Workers

Because Vast Serverless can utilize multiple hardware types to achieve optimal cost efficiency, there are multiple methods for controlling the minimum number of inactive workers maintained by the serverless engine. For most applications, setting min_workers is sufficient—especially when endpoints target a single GPU type. For more advanced scaling behavior, cold_mult and min_cold_load provide finer-grained control. The serverless engine will maintain the largest inactive capacity specified by these three controls.

Minimum Inactive Workers (min_workers)

The minimum number of inactive workers (workers with the model loaded but not actively serving requests) that the serverless engine will maintain. If not specified during endpoint creation, the default value is 5.

Cold Multiplier (cold_mult)

While min_workers is fixed regardless of traffic patterns, cold_mult defines inactive capacity as a multiplier of the current active workload.

Example

For an active load of 100 and cold_mult = 2:
100 (active load) × 2 (cold_mult) = 200 total capacity
200 − 100 = 100 inactive load
If the active load increases to 150 with cold_mult = 2, the serverless engine will attempt to maintain 150 inactive load. If not specified during endpoint creation, the default value is 3.

Minimum Cold Load (min_cold_load)

min_cold_load sets the total capacity target directly, independent of cold_mult.

Example

For an active load of 100 and min_cold_load = 300:
300 − 100 = 200 inactive load
If active load increases to 150 with the same min_cold_load, inactive capacity becomes 150. If not specified during endpoint creation, the default value is 0.

Target Utilization (target_util)

Target Utilization defines the ratio of active capacity to anticipated load and determines how much spare capacity (headroom) is reserved to handle short-term traffic spikes. For example, if anticipated load is 900 tokens/sec and target_util = 0.9, the serverless engine will maintain:
900 ÷ 0.9 = 1000 tokens/sec capacity

Spare capacity examples

  • target_util = 0.9 → 11.1% spare capacity
  • target_util = 0.8 → 25% spare capacity
  • target_util = 0.5 → 100% spare capacity
  • target_util = 0.4 → 150% spare capacity
If not specified during endpoint creation, the default value is 0.9.