Train TTS Model of The User's Voice (Voice Clone)
This endpoint allows to submit a POST request to initiate the training of a user's voice. The result of the training enables the TTS capability of the user's voice.
English and Chinese are most optimized. Other languages such as Arabic, Japanese, Korean, Thai, French, Spanish are supported. But the quality is less optimized. The only supported mode of training is "best". Requirements:
-
The voice file supports mp3 wav or m4a formats. You should upload an audio file with total duration >= 15 seconds and <= 60 seconds.
-
The mime type of your audio URL must be set correctly (e.g. audio/wav audio/mpeg). We use the mime of the URL header to determine the file type, not the suffix of the URL. If you use an object storage service of a popular cloud service (e.g. S3 of AWS), the mime is usually automatically set.
-
We do not allow space in the URL.
-
Address redirect is not allowed (i.e. 3xx response code from the http request). This is a common issue if someone provides a http link, but later the server redirects the http address to a https address.
-
The voice quality is more important than audio length. We recommend uploading high quality audio in wav format.
-
Audio: single person, clear vocals without any background noise, consistent volume, avoiding long silence, avoiding multiple speakers, avoiding noise from air conditioners or street.
-
Time to finish: The training usually will finish within 1 minute.
After the training is started, there will be two phases: (1) processing (2) completed. The result is ready usually within 2 minutes, by the time you see "completed" response.
Once the voice clone is done, you can provide TTS texts in multiple languages.
Auido files support mp3, m4a, wav, and mp4 formats, with total duration of at least 30 seconds.
Request
zh-CN
or en-US
. If you do not intend to clone Chinese language, please set it to en-US.