This endpoint allows to submit a POST request to initiate the training of a user's voice. The result of the training enables the TTS capability of the user's voice.
We support A2E, Cartesia, Minimax and Elevenlabs mode.
A2E Model
Language Support(13): Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese
English and Chinese are most optimized. Other languages such as Arabic, Japanese, Korean, Thai, French, Spanish are supported. But the quality is less optimized.
Minimax
Language Support(24): Chinese, Cantonese, English, Spanish, French, Russian, German, Portuguese, Arabic, Italian, Japanese, Korean, Indonesian, Vietnamese, Turkish, Dutch, Ukrainian, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi
Strengths: Fast inference, lightweight deployment, high efficiency.
Best for: Real-time applications, chatbots, and scalable services.
Recommended countries/regions:
China: Excellent Mandarin performance and real-time support.
Southeast Asia (e.g. Singapore, Malaysia, Vietnam): Low-latency applications and high Mandarin/English demand.
India: Efficient for voice assistants in multiple regional languages (Hindi, Tamil, etc.) via adaptation.
Cartesia
Language Support(15): English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, Turkish
Strengths: Multilingual fluency, clear pronunciation, suitable for global content.
Best for: E-learning platforms, translation tools, global voice applications.
Recommended countries/regions:
Europe (EU): Strong support for multilingual output—German, French, Spanish, Italian, etc.
Latin America: Neutral Spanish voice models ideal for cross-regional content.
Middle East & Africa: Capable of handling Arabic and other local languages with clarity.
Global EdTech markets: Ideal for teaching English or other second languages due to clear enunciation.
Elevenlabs
Language Support(35):English (USA, UK, Australia, Canada), Chinese, Japanese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), and 35+ languages and dialects
Strengths: Emotionally rich, expressive, great for storytelling and long-form content.
Best for: Podcasts, audiobooks, video narration, marketing content.
Recommended countries/regions:
United States / Canada: Excellent native English support with various accents (General American, Canadian English).
United Kingdom: British English support with diverse voice personalities.
Australia / New Zealand: Natural Australian English delivery.
Germany / France / Spain: High-quality support for major European languages.
Japan / Korea: Emotionally engaging Japanese/Korean voices (selected availability).
Requirements:
The voice file supports mp3 wav or m4a formats. You should upload an audio file with total duration >= 10 seconds and <= 60 seconds.
The mime type of your audio URL must be set correctly (e.g. audio/wav audio/mpeg). We use the mime of the URL header to determine the file type, not the suffix of the URL. If you use an object storage service of a popular cloud service (e.g. S3 of AWS), the mime is usually automatically set.
We do not allow space in the URL.
Address redirect is not allowed (i.e. 3xx response code from the http request). This is a common issue if someone provides a http link, but later the server redirects the http address to a https address.
The voice quality is more important than audio length. We recommend uploading high quality audio in wav format.
Audio: single person, clear vocals without any background noise, consistent volume, avoiding long silence, avoiding multiple speakers, avoiding noise from air conditioners or street.
Time to finish: The training usually will finish within 1 minute.
After the training is started, there will be two phases: (1) processing (2) completed. The result is ready usually within 2 minutes, by the time you see "completed" response.
Once the voice clone is done, you can provide TTS texts in multiple languages.
Auido files support mp3, m4a, wav, and mp4 formats, with total duration of at least 10 seconds.