From Zero to Cyber Wife! 3 - A Brief Discussion and Practice on TTS

Image Description

PIXIV: 71888962 @Yuuri

A Brief Discussion on TTS

TTS (Text To Speech) is a branch of speech synthesis, which converts normal language text into spoken words. TTS systems enhance interactivity, and recent technological advancements have led to a flourishing of TTS projects.

From RVC’s real-time voice modulation and dynamic pitch optimization to the impressive VITS, generative adversarial networks (GANs) have made voices sound much more natural than before. With the powerful tonal control capabilities of SO-VITS and the support of high-quality datasets with Diff-SVC, virtual singers can now produce voices that rival real ones, but this also brings about a series of challenges. My rudimentary understanding of speech synthesis has largely been rooted in the era of Hatsune Miku, yet technological progress has been rapid. Currently, there is a trend toward mutual integration among various TTS technologies, as seen in projects like fish-diffusion and others under Fishaudio, both of which show significant progress.

With such a vibrant open-source project landscape, commercial development is naturally not lagging behind. In the commercial arena, major companies have their own expertise. For example, Acapela Group specializes in TTS for deceased celebrities, a very unique offering. Many companies focus on high-quality, emotionally rich TTS and customized voices. Among commercial TTS offerings, Azure TTS stands out as a leader, and this article will also utilize Microsoft’s free EdgeTTS service as an example.

Here are some relevant project links:

Additionally, here are two Bilibili content creators who focus on music covers, which may be worth checking out:

Navigator Wumiao

Tōyōsetsu Ren (东洋雪莲) is also quite capable, but seems to disclose fewer technical details, so it’s merely mentioned here.

Note that there are a plethora of TTS engines available online. Besides those mentioned above, you can refer to the TTS introduction page of the Tavern~~(though I’m not particularly fond of the Tavern)~~. This article will use the more stable Edge-TTS for practical applications. If you need a fully local TTS solution or wish to have some fun by custom training voices and special models, this tutorial may not be suitable; you may want to look into other personalized TTS crafting tutorials like GPT-SO-VITS.

Edge TTS

TTS accepts text input and then outputs audio. The Python version of Edge TTS works through network requests. Thus, it operates efficiently on edge computing platforms, as audio synthesis is computed in the cloud and returned to you. We will use the Edge TTS Python library, and here is the GitHub project URL: edge-tts.

Installing Edge TTS

First, install it:

pip install edge-tts

Edge TTS operates in two modes; one is a command-line interactive mode. If you simply wish to use command mode, you can install it using pipx (from the official guide):

pipx install edge-tts

We will not delve into command line mode here.

Using Edge TTS

First, create a file named Edgetts.py, then import the necessary modules:

# Import required libraries
import asyncio
import edge_tts
import os

Next, initialize the TTS engine. To get the available languages and voice options, you can enter the following command in the command line:

edge-tts --list-voices

A long list of outputs will appear, which we will not show here. Select the desired NAME and continue writing the initialization code:

# Text to convert to speech
TEXT = "你好啊，这里是lico！欢迎来到lico的元宇宙！"
# Set the voice and language; note that the name is case-sensitive
VOICE = "zh-CN-XiaoyiNeural" 
# Set the output file path; the file will be saved in the folder where the script is located
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test.mp3")

Use EdgeTTS’s function to generate speech. Remember to run it asynchronously:

# Define the main function
async def _main() -> None:
    # Create a Communicate object for converting text to speech
    communicate = edge_tts.Communicate(TEXT, VOICE)
    # Save the speech to a file
    await communicate.save(OUTPUT_FILE)

Finally, add the main function; the complete code will look like this:

# Import required libraries
import asyncio
import edge_tts
import os

# Text to convert to speech
TEXT = "你好啊，这里是lico！欢迎来到lico的元宇宙！"
# Set the voice and language
VOICE = "zh-CN-XiaoyiNeural" 
# Set the output file path; the file will be saved in the folder where the script is located
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test.mp3")

# Define the main function
async def _main() -> None:
    # Create a Communicate object for converting text to speech
    communicate = edge_tts.Communicate(TEXT, VOICE)
    # Save the speech to a file
    await communicate.save(OUTPUT_FILE)

# If this script is run directly and not imported, run the main function
if __name__ == "__main__":
    asyncio.run(_main())

After running this in the command line, you should find a test.mp3 generated next to your script. You can try playing it in a manner that does not lead to social death or modifying it into ~~more socially disastrous~~ text.

Edge TTS also offers a streaming generation option, which I will not elaborate on here since streaming only refers to output; input cannot be streamed as the entire text needs to be planned for pitch and phonemes. However, we may address this issue later in future writings.

# Official example of streaming generation
async def amain() -> None:
    """Main function"""
    communicate = edge_tts.Communicate(TEXT, VOICE)
    with open(OUTPUT_FILE, "wb") as file:
        async for chunk in communicate.stream():
            if chunk["type"] == "audio":
                file.write(chunk["data"])
            elif chunk["type"] == "WordBoundary":
                print(f"WordBoundary: {chunk}")

Combining LLM with TTS

Next, we will attempt to add TTS speech output to the code we used last time. However, before that, we need to modify the TTS script to turn it into a function, allowing us to call it whenever we need.

We want to use command-line interaction and be able to play sounds. To ensure compatibility across platforms, we will install a library called pygame:

pip install pygame

We can then modify the previous script to see if we can play sounds:

# Import required libraries
import asyncio
import edge_tts
import os
import pygame

# Text to convert to speech
TEXT = "你好啊，这里是lico！欢迎来到lico的元宇宙！"
# Set the voice and language
VOICE = "zh-CN-XiaoyiNeural" 
# Set the output file path; the file will be saved in the folder where the script is located
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test.mp3")

# Define the main function
async def _main() -> None:
    # Create a Communicate object for converting text to speech
    communicate = edge_tts.Communicate(TEXT, VOICE)
    # Save the speech to a file
    await communicate.save(OUTPUT_FILE)

    print(f"文件已保存到: {OUTPUT_FILE}")

    # Check if the file exists
    if not os.path.exists(OUTPUT_FILE):
        print("错误: 文件未生成")
        return

    # Check the file size
    if os.path.getsize(OUTPUT_FILE) == 0:
        print("错误: 文件大小为0")
        return

    # Initialize pygame mixer
    pygame.mixer.init()
    # Load audio file
    pygame.mixer.music.load(OUTPUT_FILE)
    # Play audio file
    pygame.mixer.music.play()

    # Wait for audio playback to finish
    while pygame.mixer.music.get_busy():
        await asyncio.sleep(1)

# If this script is run directly and not imported, run the main function
if __name__ == "__main__":
    asyncio.run(_main())

If you successfully run the code, TTS’s voice will play through the default speakers. Note that the first time using Pygame will have a loading time, so the sound may take a few seconds to play after running the script.

Now that we have the audio playback function, we can combine this function with our previous one to achieve simultaneous TTS and text interaction. The overall code will look like this:

import asyncio  # Import async IO module
import edge_tts  # Import edge_tts module for text-to-speech conversion
import os  # Import os module for file path operations
import pygame  # Import pygame module for audio playback
from openai import OpenAI  # Import OpenAI module for interacting with OpenAI's API

# Initialize the OpenAI chat model
chat_model = OpenAI(
    # Replace this with your backend API address
    base_url="https://api.openai.com/v1/",
    # This is the API Key for authentication
    api_key="sk-SbmHyhKJHt3378h9dn1145141919810D1Fbcd12d"
)

# Set the voice and language
VOICE = "zh-CN-XiaoyiNeural"
# Set the output file path; the file will be saved in the folder where the script is located, it will overwrite each time, only for testing purposes
OUTPUT_FILE = os.path.join(os.path.dirname(os.path.realpath(__file__)), "test.mp3")
# Set up the chat history list
chat_history = []
# Initialize pygame mixer at the start of the script to avoid delays
pygame.mixer.init()

# Define the main function to execute asynchronously
async def _main(text: str = '测试') -> None:
    # Create a Communicate object for converting text to speech
    communicate = edge_tts.Communicate(text, VOICE)
    # Save the speech to a file
    await communicate.save(OUTPUT_FILE)

    print(f"文件已保存到: {OUTPUT_FILE}")

    # Check if the file exists
    if not os.path.exists(OUTPUT_FILE):
        print("错误: 文件未生成")
        return

    # Check the file size
    if os.path.getsize(OUTPUT_FILE) == 0:
        print("错误: 文件大小为0")
        return

    # Load audio file
    pygame.mixer.music.load(OUTPUT_FILE)
    # Play audio file
    pygame.mixer.music.play()

    # Wait for audio playback to finish
    while pygame.mixer.music.get_busy():
        await asyncio.sleep(0.5)

# Define a function to get responses from the language model
def get_response_from_llm(question):
    # Print the current chat history
    print(f'Here is the history list: {chat_history}')
    # Get the latest chat history window
    chat_history_window = "\n".join([f"{role}: {content}" for role, content in chat_history[-2*4:-1]])
    # Generate chat history prompt
    chat_history_prompt = f"Here is the chat history:\n {chat_history_window}"
    # Construct the message list
    message = [
        {"role": "system", "content": "You are a catgirl! Output in Chinese."},
        {"role": "assistant", "content": chat_history_prompt},
        {"role": "user", "content": question},
    ]
    # Print the message sent to the backend
    print(f'Message sent to backend: {message}')
    # Call OpenAI's API for a response
    response = chat_model.chat.completions.create(
        model='gpt-4o-mini',
        messages=message,
        temperature=0.7,
    )
    # Get response content
    response_str = response.choices[0].message.content
    return response_str

# Main program entry
if __name__ == "__main__":
    while True:
        # Get user input
        user_input = input("\n输入问题或者请输入'exit'退出：")
        if user_input.lower() == 'exit':
            print("再见")
            break
        # Add user input to chat history
        chat_history.append(('human', user_input))
        # Get response from language model
        response = get_response_from_llm(user_input)
        # Print response
        print(response)
        # Add response to chat history
        chat_history.append(('ai', response))
        # Asynchronously run the main function to convert response to speech and play it
        asyncio.run(_main(response))

This integration has a few optimizations:

The Pygame initialization is moved outside the function to improve speed by initializing it just once at the beginning of the script.
The function now includes a parameter text of type str, defaulting to “测试,” preventing errors in case of a backend failure leading to no text to speak.
Several print statements now employ formatted string output for clearer command-line interaction.

However, there are some potential issues with this script (which don’t necessitate fixing since it is a tutorial, but are worth mentioning):

Generated files are not renamed and will overwrite each time; adding a renaming operation will allow for retracing all generated statements.
The voice generation function lacks blocking mechanisms. Therefore, if you ask the AI twice in quick succession, it may start playing the next phrase before the previous one has finished, resulting in overlapping sounds (due to asynchronous issues). This can be resolved by implementing a buffer, but this is not a topic for a beginner tutorial and may be discussed in advanced topics later.

Extensions: Full-Stream TTS (FSTTS) and Speech Synthesis Markup Language (SSML)

Full-Stream TTS is a concept that should be called “Full-Stream Text to Speech.” In simple terms, TTS can accept text input and output audio simultaneously in a relatively natural and smooth manner. The challenge lies in streaming input. In natural language, a complete sentence heavily influences tone and intonation. How can we determine the tone for a sentence even before its full input? I personally believe this requires close collaboration between LLMs and TTS engines. Firstly, the LLM can provide an emotional marker based on training parameters and built-in emotional models to set the emotional tone for the upcoming audio output. On the other hand, the TTS engine must be able to accept this emotional tone and generate speech in real-time. This requires tight cooperation between two LLMs, both endowed with relevant functions. Another implementation method involves using multimodal output. This can effectively resolve coordination issues between the two engines, but the process is uncontrollable and operates as a black box. The same sentence can convey hundreds of different tones depending on the context (think about the many usages of “卧槽”). If we leave this control process entirely to multimodal models, the resulting effectiveness largely depends on the model’s capabilities, which creates a ceiling effect and hampers future extensibility. Given that voice recognition (ASR), voiceprint recognition, and offline wake-up technologies need to interact with LLM, it becomes very challenging to integrate so many modules into a singular, expansive multimodal model with optimal performance. Therefore, we are temporarily adopting a modularized structure where different functional modules operate independently.

I sketched a simple flowchart reflecting this system logic:

Speech Synthesis Markup Language is a concept within Microsoft’s Azure Speech. It is an XML-based markup language used to fine-tune text-to-speech output attributes such as pitch, pronunciation, speed, volume, etc. Compared to simple text input, it offers more control and flexibility. Beyond simple emotional tone definitions, SSML allows for more granular control of voice details, presenting greater challenges for its generation, requiring a robust LLM to accomplish this complex task, and concurrent and asynchronous operations to ensure interaction fluidity. (Hopefully, open-source alternatives emerge quickly, as this can be quite expensive.) SSML development requires a large volume of high-quality annotated datasets as support. For practical production to improve speed, a database matching method could be utilized to load similar configurations from existing cases, thus reducing costs while increasing efficiency.

A Brief Discussion on TTS#

Edge TTS#

Installing Edge TTS#

Using Edge TTS#

Combining LLM with TTS#

Extensions: Full-Stream TTS (FSTTS) and Speech Synthesis Markup Language (SSML)#