Applio - Training

Training is the process by which the program will be able to clone any voice or sound via Self-Supervised Speech.

Training is only for NVDIA GPUs, if this is not available Training tab will be disabled.
In case the training tab is available, training is only recommended in Nvidia Series 2000 (RTX) or higher.
If you don’t have an compatible GPU, we offer alternatives to run Applio in the cloud.

Training a model

Dataset preparation

Before you can start the training process you need to have an audio set of the desired voice, it is recommended to have

A minimum amount of 10 minutes of clean audio, without noise or silences.
Uncompressed audio format, this can be either .wav or .flac

The recommended duration to have a good dataset is 30 minutes, but if you have even less audio than recommended, pretraineds will be your solution to get a good model with low data.

Upload your dataset using the Dataset Maker (if it is a single audio) or setup it manually (various audios) to applio/assets/datasets creating inside a folder for the program to read it.

Don't know how to create a dataset?, check the dataset section

For multispeaker models (Optional)

If you want to train a multispeaker model, you need to create a folder for the model, and inside that folder create a folder for each speaker.

Speaker folders should be named with numbers starting from 0
It is recommended that each speaker has a similar amount of audio

speaker0-audio1.flac

speaker0-audio2.flac

speaker0-audio3.flac

speaker1-audio1.flac

speaker1-audio2.flac

speaker1-audio3.flac

speaker2-audio1.flac

speaker2-audio2.flac

speaker2-audio3.flac

Once you have the audio ready, give your model a name and be sure to select the correct frequency for your file (32k, 40k, 48k), before running the "Preprocess dataset", consider the following options:

CPU cores: The default setting are your cpu cores, which is recommended for most cases.
Audio cutting: It's recommended to deactivate this option if your dataset has already been processed.
Process effects: It's recommended to deactivate this option if your dataset has already been processed.

Feature extraction

Now you are in the second last step!, but... what is feature extraction?

Extracting features is an essential step to train, this process will convert each audio fragment divided by the post-processing step to a file readable by the F0 (Fundamental frequency).

Several F0 models are available to choose from, but the best performer is RMVPE, besides CREPE (MANGIO-CREPE) is good too but it requires very clean audio and you need a hop length of 128 or lower.

CPU cores: adjust it to use in the index extraction process. The default setting are your cpu cores, which is recommended for most cases.
Embedder Model: Select the model used for learning speaker embedding.
Pitch Guidance: it becomes feasible to mirror the intonation of the original voice, including its pitch. This feature is particularly valuable for singing and other scenarios where preserving the original melody or pitch pattern is essential.
GPU Number: Specify the number of GPUs you wish to utilize for extracting by entering them separated by hyphens (-).

When you select your model, press Extract Features to start the process, remember to check your CMD until you see a message indicating that the process is complete.

Don't know how to check sample rate?, check the Audio analyzer section

Model & Index training

This is where the real work begins, to start training your model you will need to make a few small adjustments before you begin.

Save Every Epoch: Set this value between 10 and 50 to determine how often the model's state is saved during training.
Total Epochs: The number of epochs needed varies based on your dataset. Monitor progress using TensorBoard; typically, models perform well around 200-400 epochs.
Batch Size: Adjust based on your GPU's VRAM. For 8 GB VRAM, use a batch size between 6 and 8. Consider CUDA cores when experimenting with higher batch sizes.

Index Generation: generating an index is a must, just click on the “train index” button to perform the process.

Your trained model is located in the logs/model folder, and the .pth files are in the logs/zips folder. Also you can export your trained model directly from the Applio interface, go to the Export Model section in the train tab, click on the Refresh button and select the pth and the added index of the model to export.

Training

Training a model

Dataset preparation

For multispeaker models (Optional)

Dataset processing

Feature extraction

Model & Index training

On this page

Training

Training a model

Dataset preparation

For multispeaker models (Optional)

Dataset processing

Feature extraction

Model & Index training

Advanced Settings

On this page