|
1 |
| -# Rate-Distortion-Optimized Variational Auto-Encoder |
| 1 | +# Deep REDundancy (DRED) with RDO-VAE |
2 | 2 |
|
3 |
| -## Setup |
4 |
| -The python code requires python >= 3.6 and has been tested with python 3.6 and python 3.10. To install requirements run |
| 3 | +This is a rate-distortion-optimized variational autoencoder (RDO-VAE) designed |
| 4 | +to coding redundancy information. Pre-trained models are provided as C code |
| 5 | +in the dnn/ directory with the corresponding model in dnn/models/ directory |
| 6 | +(name starts with rdovae_). If you don't want to train a new DRED model, you can |
| 7 | +skip straight to the Inference section. |
| 8 | + |
| 9 | +## Data preparation |
| 10 | + |
| 11 | +For data preparation you need to build Opus as detailed in the top-level README. |
| 12 | +You will need to use the --enable-dred configure option. |
| 13 | +The build will produce an executable named "dump_data". |
| 14 | +To prepare the training data, run: |
5 | 15 | ```
|
6 |
| -python -m pip install -r requirements.txt |
| 16 | +./dump_data -train in_speech.pcm out_features.f32 out_speech.pcm |
7 | 17 | ```
|
| 18 | +Where the in_speech.pcm speech file is a raw 16-bit PCM file sampled at 16 kHz. |
| 19 | +The speech data used for training the model can be found at: |
| 20 | +https://media.xiph.org/lpcnet/speech/tts_speech_negative_16k.sw |
| 21 | +The out_speech.pcm file isn't needed for DRED, but it is needed to train |
| 22 | +the FARGAN vocoder (see dnn/torch/fargan/ for details). |
8 | 23 |
|
9 | 24 | ## Training
|
10 |
| -To generate training data use dump date from the main LPCNet repo |
| 25 | + |
| 26 | +To perform training, run the following command: |
11 | 27 | ```
|
12 |
| -./dump_data -train 16khz_speech_input.s16 features.f32 data.s16 |
| 28 | +python ./train_rdovae.py --cuda-visible-devices 0 --sequence-length 400 --split-mode random_split --state-dim 80 --batch-size 512 --epochs 400 --lambda-max 0.04 --lr 0.003 --lr-decay-factor 0.0001 out_features.f32 output_dir |
13 | 29 | ```
|
| 30 | +The final model will be in output_dir/checkpoints/chechpoint_400.pth. |
14 | 31 |
|
15 |
| -To train the model, simply run |
| 32 | +The model can be converted to C using: |
16 | 33 | ```
|
17 |
| -python train_rdovae.py features.f32 output_folder |
| 34 | +python export_rdovae_weights.py output_dir/checkpoints/chechpoint_400.pth dred_c_dir |
18 | 35 | ```
|
| 36 | +which will create a number of C source and header files in the fargan_c_dir directory. |
| 37 | +Copy these files to the opus/dnn/ directory (replacing the existing ones) and recompile Opus. |
19 | 38 |
|
20 |
| -To train on CUDA device add `--cuda-visible-devices idx`. |
| 39 | +## Inference |
21 | 40 |
|
22 |
| - |
23 |
| -## ToDo |
24 |
| -- Upload checkpoints and add URLs |
| 41 | +DRED is integrated within the Opus codec and can be evaluated using the opus_demo |
| 42 | +executable. For example: |
| 43 | +``` |
| 44 | +./opus_demo voip 16000 1 64000 -loss 50 -dred 100 -sim_loss 50 input.pcm output.pcm |
| 45 | +``` |
| 46 | +Will tell the encoder to encode a 16 kHz raw audio file at 64 kb/s using up to 1 second |
| 47 | +of redundancy (units are based on 10-ms) and then simulate 50% loss. Refer to `opus_demo --help` |
| 48 | +for more details. |
0 commit comments