Skip to content

Commit fa10302

Browse files
authored
docs: updated Transformer parameters description (#2234)
updated Transformer parameters
1 parent 54faaa8 commit fa10302

File tree

1 file changed

+69
-3
lines changed

1 file changed

+69
-3
lines changed

docs/content/docs/features/text-generation.md

Lines changed: 69 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,7 @@ backend: transformers
296296
parameters:
297297
model: "facebook/opt-125m"
298298
type: AutoModelForCausalLM
299-
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit (optional)
299+
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
300300
```
301301

302302
The backend will automatically download the required files in order to run the model.
@@ -307,19 +307,85 @@ The backend will automatically download the required files in order to run the m
307307

308308
| Type | Description |
309309
| --- | --- |
310-
| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. |
311-
| `OVModelForCausalLM` | for OpenVINO models |
310+
| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
311+
| `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models |
312+
| `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
312313
| N/A | Defaults to `AutoModel` |
313314

315+
- `OVModelForCausalLM` requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
316+
- `OVModelForFeatureExtraction` works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)
317+
318+
Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
319+
AMD GPU support is not implemented.
320+
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
321+
322+
##### Embeddings
323+
Use `embeddings: true` if the model is an embedding model
324+
325+
##### Inference device selection
326+
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.
327+
328+
| Inference Engine | Applicable Values |
329+
| --- | --- |
330+
| CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output |
331+
| OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` |
332+
333+
Example for CUDA:
334+
`main_gpu: cuda.0`
335+
336+
Example for OpenVINO:
337+
`main_gpu: AUTO:-CPU`
338+
339+
This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
340+
341+
##### Inference Precision
342+
Transformer backend automatically select the fastest applicable inference precision according to the device support.
343+
CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter:
344+
345+
`f16: true`
314346

315347
##### Quantization
316348

317349
| Quantization | Description |
318350
| --- | --- |
319351
| `bnb_8bit` | 8-bit quantization |
320352
| `bnb_4bit` | 4-bit quantization |
353+
| `xpu_8bit` | 8-bit quantization for Intel XPUs |
321354
| `xpu_4bit` | 4-bit quantization for Intel XPUs |
322355

356+
##### Trust Remote Code
357+
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
358+
By default it is disabled for security.
359+
It can be manually enabled with:
360+
`trust_remote_code: true`
361+
362+
##### Maximum Context Size
363+
Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.
364+
365+
Usage example:
366+
`context_size: 8192`
367+
368+
##### Auto Prompt Template
369+
Usually chat template is defined by the model author in the `tokenizer_config.json` file.
370+
To enable it use the `use_tokenizer_template: true` parameter in the `template` section.
371+
372+
Usage example:
373+
```
374+
template:
375+
use_tokenizer_template: true
376+
```
377+
378+
##### Custom Stop Words
379+
Stopwords are usually defined in `tokenizer_config.json` file.
380+
They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.
381+
382+
Usage example:
383+
```
384+
stopwords:
385+
- "<|eot_id|>"
386+
- "<|end_of_text|>"
387+
```
388+
323389
#### Usage
324390

325391
Use the `completions` endpoint by specifying the `transformers` model:

0 commit comments

Comments
 (0)