A2E
  1. TTS and Voice Clone
A2E
  • AI Avatar API
  • Get Tokens
    • Getting API Tokens (2025 version)
  • TTS and Voice Clone
    • List Public TTS Options
      GET
    • List Voice Clone Options
      GET
    • Train TTS Model of The User's Voice (Voice Clone)
      POST
    • List Ongoing Voice Clone Tasks
      GET
    • Generate TTS Audio (Text-to-Speech)
      POST
    • Get Details of a Voice
      GET
    • Delete a User Voice
      DELETE
  • Generate Avatar Videos
    • Generate AI Avatar Videos
      POST
    • List of Result Videos
      GET
    • List One or All Avatars
      GET
    • Obtain the Status of One Avatar Video Task
      POST
    • Delete or Cancel a Video
      DELETE
    • Auto Language Detect
      POST
    • Auto Swith to Public Computing Pool
      POST
  • Create Avatars and Train Lip-sync Models
    • Create A Custom Avatar by a Video or an Image
      POST
    • Train a Personalized Lip-sync Model (Optional) a.k.a. Continue Training 💠
      POST
    • Remove A Customized Avatar
      POST
    • Get Status of All Tasks
      GET
    • Get All Ongoing "Training" Tasks
      GET
    • Status of One Task
      GET
    • Clone Voice from a Video
      POST
  • Background Matting and Replacement
    • Obtain the List of Background Images
      POST
    • Add Custom Background Image
      POST
    • Delete Custom Image
      POST
  • Face Swap
    • Manage Face Swap Resource
      • Add Face Swap Image
      • Get Records of Face Swap Images
      • Delete User Face Swap Image
    • Quickly Preview Face Swap
      • Add User Face Swap Preview
      • Get Status of Face Swap Preview Process
    • Start and Manage Face Swap Tasks
      • Start a Face Swap Task
      • Get Status of Face Swap Task
      • Get Face Swap Task Records
      • Get Details of Face Swap
      • Delete Record
  • AI Dubbing
    • Start dubbing
    • List Dubbing Tasks
    • List All Processing Dubbing Tasks
    • Get Details
    • Delete Record
  • Image to Video
    • Start Image-to-Video
    • Check Status of One Task
    • List Status of All Tasks
    • Delete Record
  • Caption Removal
    • Start Caption Removal
    • Get Records of All Tasks
    • Get Status of All Tasks in Processing
    • Get Details of One Task
    • Delete a Task
  • Streaming Avatar
    • Get All avatars
    • Get a Streaming Avatar Token
    • Set QA Context
    • Get QA Context
    • Ask a Question to the Avatar
    • Let the Avatar Speak Directly
    • Leave the Room
  • Miscellaneous
    • Add a User
    • Get User Remaining Credits
    • List Available Languages
    • Save URL to A2E's storage
    • Add Watermark to Video or Image
    • Get R2 Upload Presigned URL
  • Text to Image
    • Start Text-to-Image
    • List Tasks of Text-to-Image Tasks
    • Get Details of One Task
    • Delete Record
    • Quick Add Avarar
  • Product Holding
    • Start Product Holding
    • List Tasks of Product Holding Tasks
    • Get Details of One Task
    • Delete Record
  • Talking Photo
    • Start a Task
    • List Tasks
    • Get Task Detail
    • Delete Task
  • Virtual Try-On
    • Start Virtual Try-On
    • List Tasks of Virtual Try-On
    • Get Details of One Task
    • Delete Record
  1. TTS and Voice Clone

Train TTS Model of The User's Voice (Voice Clone)

Global Server
https://video.a2e.ai
Global Server
https://video.a2e.ai
POST
/api/v1/userVoice/training

This endpoint allows to submit a POST request to initiate the training of a user's voice. The result of the training enables the TTS capability of the user's voice.

image.png

We support A2E, Cartesia, Minimax and Elevenlabs mode.
A2E Model
Language Support(13): Chinese, English, Japanese, German, French, Spanish, Korean, Arabic, Russian, Dutch, Italian, Polish, Portuguese

English and Chinese are most optimized. Other languages such as Arabic, Japanese, Korean, Thai, French, Spanish are supported. But the quality is less optimized.

Minimax
Language Support(24): Chinese, Cantonese, English, Spanish, French, Russian, German, Portuguese, Arabic, Italian, Japanese, Korean, Indonesian, Vietnamese, Turkish, Dutch, Ukrainian, Thai, Polish, Romanian, Greek, Czech, Finnish, Hindi
Strengths: Fast inference, lightweight deployment, high efficiency.
Best for: Real-time applications, chatbots, and scalable services.
Recommended countries/regions:
China: Excellent Mandarin performance and real-time support.
Southeast Asia (e.g. Singapore, Malaysia, Vietnam): Low-latency applications and high Mandarin/English demand.
India: Efficient for voice assistants in multiple regional languages (Hindi, Tamil, etc.) via adaptation.

Cartesia
Language Support(15): English, French, German, Spanish, Portuguese, Chinese, Japanese, Hindi, Italian, Korean, Dutch, Polish, Russian, Swedish, Turkish
Strengths: Multilingual fluency, clear pronunciation, suitable for global content.
Best for: E-learning platforms, translation tools, global voice applications.
Recommended countries/regions:
Europe (EU): Strong support for multilingual output—German, French, Spanish, Italian, etc.
Latin America: Neutral Spanish voice models ideal for cross-regional content.
Middle East & Africa: Capable of handling Arabic and other local languages with clarity.
Global EdTech markets: Ideal for teaching English or other second languages due to clear enunciation.

Elevenlabs
Language Support(35):English (USA, UK, Australia, Canada), Chinese, Japanese, German, Hindi, French (France, Canada), Korean, Portuguese (Brazil, Portugal), Italian, Spanish (Spain, Mexico), and 35+ languages and dialects
Strengths: Emotionally rich, expressive, great for storytelling and long-form content.
Best for: Podcasts, audiobooks, video narration, marketing content.
Recommended countries/regions:
United States / Canada: Excellent native English support with various accents (General American, Canadian English).
United Kingdom: British English support with diverse voice personalities.
Australia / New Zealand: Natural Australian English delivery.
Germany / France / Spain: High-quality support for major European languages.
Japan / Korea: Emotionally engaging Japanese/Korean voices (selected availability).

Requirements:

  • The voice file supports mp3 wav or m4a formats. You should upload an audio file with total duration >= 10 seconds and <= 60 seconds.

  • The mime type of your audio URL must be set correctly (e.g. audio/wav audio/mpeg). We use the mime of the URL header to determine the file type, not the suffix of the URL. If you use an object storage service of a popular cloud service (e.g. S3 of AWS), the mime is usually automatically set.

  • We do not allow space in the URL.

  • Address redirect is not allowed (i.e. 3xx response code from the http request). This is a common issue if someone provides a http link, but later the server redirects the http address to a https address.

  • The voice quality is more important than audio length. We recommend uploading high quality audio in wav format.

  • Audio: single person, clear vocals without any background noise, consistent volume, avoiding long silence, avoiding multiple speakers, avoiding noise from air conditioners or street.

  • Time to finish: The training usually will finish within 1 minute.

After the training is started, there will be two phases: (1) processing (2) completed. The result is ready usually within 2 minutes, by the time you see "completed" response.

Once the voice clone is done, you can provide TTS texts in multiple languages.
Auido files support mp3, m4a, wav, and mp4 formats, with total duration of at least 10 seconds.

Request

Authorization
Provide your bearer token in the
Authorization
header when making requests to protected resources.
Example:
Authorization: Bearer ********************
Body Params application/json

Example
{
    "name": "your voice clone name",
    "voice_urls": [
        "https://abc.com/123.wav"
    ],
    "model": "a2e",
    "language": "en",
    "gender": "female"
}

Request Code Samples

Shell
JavaScript
Java
Swift
Go
PHP
Python
HTTP
C
C#
Objective-C
Ruby
OCaml
Dart
R
Request Request Example
Shell
JavaScript
Java
Swift
curl --location --request POST 'https://video.a2e.ai/api/v1/userVoice/training' \
--header 'Content-Type: application/json' \
--data-raw '{
    "name": "your voice clone name",
    "voice_urls": [
        "https://abc.com/123.wav"
    ],
    "model": "a2e",
    "language": "en",
    "gender": "female"
}'

Responses

🟢200training
application/json
Body

Example
{
    "code": 0,
    "data": {
        "_id": "67bc2c2cc0f5208c812f9438",
        "name": "your voice clone name",
        "voice_urls": [
            "https://dh24as48lv9ce.cloudfront.net/adam2eve/beta/user_voice_clone/2b3b0881-e9e9-4bc2-943f-c2127b4b4961/11月13日.MP3"
        ],
        "model": "a2e",
        "lang": "en",
        "gender": "female",
        "current_status": "sent",
        "createdAt": "2025-02-24T08:22:04.285Z",
        "updatedAt": "2025-02-24T08:22:04.285Z"
    },
    "trace_id": "0f4de019-e41d-4dac-a070-f7bf38722786"
}
Modified at 2025-07-25 04:18:13
Previous
List Voice Clone Options
Next
List Ongoing Voice Clone Tasks
Built with