Kokoro-TTS

High-quality neural text-to-speech that runs on your computer

Kokoro-TTS

Kokoro-TTS generates premium-quality speech audio using a neural voice model that runs directly on your computer. It produces more expressive, natural-sounding narration compared to Edge-TTS — and works offline.

📝 Note

Available on: Standard and Pro plans only. All plans can use Edge-TTS instead.

When You Need It

You want the highest quality narration audio
You need to work offline (no internet required after setup)
You want more natural, expressive voice output
You have an NVIDIA graphics card for fast generation

First-Time Setup

Kokoro-TTS needs a small one-time setup before first use:

Install eSpeak-NG

Download eSpeak-NG from GitHub and install it. This is a small language processing tool that Kokoro uses behind the scenes.

Launch Kokoro-TTS

Click "Kokoro-TTS" on the toolbar. The first launch downloads the voice model files (~500 MB). This only happens once.

Ready to Go

After setup, Kokoro-TTS launches instantly and works offline.

Interface Overview

Kokoro-TTS has two tabs:

Single Generate — Type or paste text and generate one audio file. Good for testing voices or generating a single clip.
Batch Generate — Load a TXT file and generate one audio clip per line automatically. Required for the Video Compositor.

Settings at the top (Voice, Speech Rate, Remove Silence, Output Format) apply to both tabs.

ℹ️ Info

Kokoro-TTS has fewer audio controls than Edge-TTS. There is no Pitch or Volume adjustment — only Speech Rate. Output formats are MP3 and WAV (no M4A).

Single Generate

Use this to preview a voice, test your settings, or generate audio for a single piece of text.

Open Kokoro-TTS → Single Generate Tab

Click "Kokoro-TTS" on the toolbar. Wait for the model status to show it's ready, then make sure you're on the Single Generate tab.

Choose a Voice

Select a voice from the Voice dropdown. Kokoro has a smaller voice selection than Edge-TTS, but each voice is higher quality and more expressive.

Adjust Settings

Setting	Range	Recommendation
Speech Rate	0.50x – 2.00x	1.00x default, try 1.10x–1.20x for narration
Remove Silence	On/Off	Enable to trim pauses before compositing
Output Format	MP3, WAV	MP3 for smallest file size

Enter Text

Type or paste your narration text into the input box.

Screenshot: Kokoro-TTS interface with text and voice settings

Generate

Click ✨ Generate Audio. Audio is processed on your computer — no internet needed. When done, click ▶️ Play Audio to preview, then 💾 Save Audio to save the file.

Batch Generate

Use this when generating audio for a full chapter — one clip per panel. This is what you need before using the Video Compositor.

Export Narration from the Pipeline

In the main WTN Suite window, after the AI pipeline finishes, click "Export Narration". This saves a TXT file where each line is the narration for one panel.

The file looks like this:

The city sleeps as dawn breaks over the horizon.
Mia rushes through the empty streets, her breath visible in the cold air.
She stops at the gate, eyes wide, heart pounding.

Each line will become one audio clip.

Open Kokoro-TTS → Batch Generate Tab

Click "Kokoro-TTS" on the toolbar and switch to the Batch Generate tab.

Choose Voice and Settings

Select your voice and adjust Speech Rate in the header. These settings apply to every clip in the batch.

Select Your TXT File

Click 📁 Select .txt and choose the narration TXT file exported from the pipeline.

Select Output Folder

Click 📂 Browse... and choose where the audio clips will be saved.

Generate Batch Audio

Click ⚡ Generate Batch Audio. A progress bar shows Processing X / Y as each clip is generated. Click ❌ Cancel Batch at any time to stop.

ℹ️ Info

Output files are named automatically using the TXT filename and a 4-digit number. For example, if your file is narration.txt, the output will be:

narration_0001.mp3
narration_0002.mp3
narration_0003.mp3
...

Each number matches a line in the TXT file, in order. The Video Compositor uses this numbering to match audio to panels.

Performance

Generation speed depends on your hardware:

Hardware	Speed	Example (40 clips)
NVIDIA GPU (RTX 3060+)	Real-time or faster	~2–3 minutes
NVIDIA GPU (older)	Near real-time	~5 minutes
CPU only (no GPU)	2–5x slower than real-time	~15–20 minutes

💡 Tip

If you have an NVIDIA graphics card, make sure your drivers are up to date. Kokoro-TTS automatically uses your graphics card for faster processing when available.

ℹ️ Info

No NVIDIA GPU? Kokoro-TTS still works on CPU — it's just slower. For faster processing without a GPU, consider using Edge-TTS instead.

Tips

💡 Tip

Kokoro-TTS shines with English narration. If you're narrating in other languages, Edge-TTS may have better language coverage with its 50+ voice options.

💡 Tip

Enable Remove Silence before batch generating. This trims pauses from the output and produces cleaner audio timing when the Video Compositor combines clips into a video.

Common Issues

⚠️ Warning

"eSpeak-NG not found" error? You need to install eSpeak-NG separately (see First-Time Setup above). After installing, restart WTN Suite.

⚠️ Warning

Very slow generation? Your computer is likely using CPU processing instead of your graphics card. Make sure you have an NVIDIA GPU with updated drivers. AMD and Intel GPUs are not currently supported for acceleration.

⚠️ Warning

Batch generated fewer files than expected? Check that your TXT file has one narration per line with no blank lines. Empty lines are skipped, so each panel's narration must be on its own single line.

Next Steps

Video Compositor — Combine your panels and audio into a video
Edge-TTS — Alternative TTS available on all plans