Settings

Context length

Context length is the maximum amount of text (messages + replies + system/instructions) the model can keep in mind.

Context length

Quantization

Weight/activation precision. Lower precision uses less memory and may be faster, but can affect quality and compatibility.

PrecisionPick one

Device

Where inference runs. Options come from runtime capabilities.

Available devicesPick one

Attention Implementation

Traditional attention is optimized for single requests; continuous batching is optimized for multiple parallel requests.

Traditional attentionOptimized for single requests
Continuous batchingOptimized for parallel requests
Save settings Updating the settings will evict all currently loaded models.