You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/content/docs/features/text-generation.md
+69-3Lines changed: 69 additions & 3 deletions
Original file line number
Diff line number
Diff line change
@@ -296,7 +296,7 @@ backend: transformers
296
296
parameters:
297
297
model: "facebook/opt-125m"
298
298
type: AutoModelForCausalLM
299
-
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit (optional)
299
+
quantization: bnb_4bit # One of: bnb_8bit, bnb_4bit, xpu_4bit, xpu_8bit (optional)
300
300
```
301
301
302
302
The backend will automatically download the required files in order to run the model.
@@ -307,19 +307,85 @@ The backend will automatically download the required files in order to run the m
307
307
308
308
| Type | Description |
309
309
| --- | --- |
310
-
| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. |
311
-
| `OVModelForCausalLM` | for OpenVINO models |
310
+
| `AutoModelForCausalLM` | `AutoModelForCausalLM` is a model that can be used to generate sequences. Use it for NVIDIA CUDA and Intel GPU with Intel Extensions for Pytorch acceleration |
311
+
| `OVModelForCausalLM` | for Intel CPU/GPU/NPU OpenVINO Text Generation models |
312
+
| `OVModelForFeatureExtraction` | for Intel CPU/GPU/NPU OpenVINO Embedding acceleration |
312
313
| N/A | Defaults to `AutoModel` |
313
314
315
+
- `OVModelForCausalLM`requires OpenVINO IR [Text Generation](https://huggingface.co/models?library=openvino&pipeline_tag=text-generation) models from Hugging face
316
+
- `OVModelForFeatureExtraction`works with any Safetensors Transformer [Feature Extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers,safetensors) model from Huggingface (Embedding Model)
317
+
318
+
Please note that streaming is currently not implemente in `AutoModelForCausalLM` for Intel GPU.
319
+
AMD GPU support is not implemented.
320
+
Although AMD CPU is not officially supported by OpenVINO there are reports that it works: YMMV.
321
+
322
+
##### Embeddings
323
+
Use `embeddings: true` if the model is an embedding model
324
+
325
+
##### Inference device selection
326
+
Transformer backend tries to automatically select the best device for inference, anyway you can override the decision manually overriding with the `main_gpu` parameter.
327
+
328
+
| Inference Engine | Applicable Values |
329
+
| --- | --- |
330
+
| CUDA | `cuda`, `cuda.X` where X is the GPU device like in `nvidia-smi -L` output |
331
+
| OpenVINO | Any applicable value from [Inference Modes](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) like `AUTO`,`CPU`,`GPU`,`NPU`,`MULTI`,`HETERO` |
332
+
333
+
Example for CUDA:
334
+
`main_gpu: cuda.0`
335
+
336
+
Example for OpenVINO:
337
+
`main_gpu: AUTO:-CPU`
338
+
339
+
This parameter applies to both Text Generation and Feature Extraction (i.e. Embeddings) models.
340
+
341
+
##### Inference Precision
342
+
Transformer backend automatically select the fastest applicable inference precision according to the device support.
343
+
CUDA backend can manually enable *bfloat16* if your hardware support it with the following parameter:
344
+
345
+
`f16: true`
314
346
315
347
##### Quantization
316
348
317
349
| Quantization | Description |
318
350
| --- | --- |
319
351
| `bnb_8bit` | 8-bit quantization |
320
352
| `bnb_4bit` | 4-bit quantization |
353
+
| `xpu_8bit` | 8-bit quantization for Intel XPUs |
321
354
| `xpu_4bit` | 4-bit quantization for Intel XPUs |
322
355
356
+
##### Trust Remote Code
357
+
Some models like Microsoft Phi-3 requires external code than what is provided by the transformer library.
358
+
By default it is disabled for security.
359
+
It can be manually enabled with:
360
+
`trust_remote_code: true`
361
+
362
+
##### Maximum Context Size
363
+
Maximum context size in bytes can be specified with the parameter: `context_size`. Do not use values higher than what your model support.
364
+
365
+
Usage example:
366
+
`context_size: 8192`
367
+
368
+
##### Auto Prompt Template
369
+
Usually chat template is defined by the model author in the `tokenizer_config.json` file.
370
+
To enable it use the `use_tokenizer_template: true` parameter in the `template` section.
371
+
372
+
Usage example:
373
+
```
374
+
template:
375
+
use_tokenizer_template: true
376
+
```
377
+
378
+
##### Custom Stop Words
379
+
Stopwords are usually defined in `tokenizer_config.json` file.
380
+
They can be overridden with the `stopwords` parameter in case of need like in llama3-Instruct model.
381
+
382
+
Usage example:
383
+
```
384
+
stopwords:
385
+
- "<|eot_id|>"
386
+
- "<|end_of_text|>"
387
+
```
388
+
323
389
#### Usage
324
390
325
391
Use the `completions` endpoint by specifying the `transformers` model:
0 commit comments