Brain-to-Speech: Speech Synthesis from Non-invasive Brain Signals



Abstract

Most communication studies on non-invasive brain signals using a brain–computer interface (BCI) have primarily applied a stimulus-driven approach (e.g., ERP or SSVEP spellers). An invasive brain signal (i.e., Electrocorticography (ECoG)) based speech synthesis was the first technology introduced that translates neural activities into speech from a neural decoding of spoken sentences. The fundamental objective of the present study is to investigate a sentence-level speech synthesis based on non-invasive brain signals, here, Electroencephalography (EEG), for developing more intuitive BCI communication systems. Thus, our study presents a brain-to-speech (BTS) synthesis model that can generate speech from the EEG signals of spoken sentences, namely, a BTS framework. The proposed BTS framework consists of brain signal processing (e.g., recording and artifact removal), frame-level linguistic-conditional feed-forward Transformer networks to generate mel-spectrograms from EEG spectrograms, and a vocoder to generate a high-quality waveform from the neural network generated mel-spectrograms. We extract attention alignments from an encoder-attention-decoder-based autoregressive teacher text-to-speech model for frame-level target character mapping to predict the frame-level linguistic information. In this manner, we successfully demonstrate a non-invasive based sentence-level BTS synthesis for an intuitive BCI communication system. Thus, our study contributes to the deep question of how to solve a sentence-level speech synthesis from a non-invasive brain signal with large noise. The results indicate that the use of the proposed BTS framework could ultimately be beneficial for a brain-to-speech synthesis when developing intuitive BCI communication applications for severely paralyzed patients with neurological disorders or motor disabilities, including those with amyotrophic lateral sclerosis or spinal cord injury..

Contents


Section 1: Video - Experimental Paradigm


Section 2:Audio Demo

Spoken speech<Top>

Note : All samples were synthesized from unseen brain signals


1. Korean : "여기에서 가까운 곳에 서점이 있나요?"

Translation : "Is there a bookstore near here?"

Pronunciation : "yeogieseo gakkaun gos-e seojeom-i issnayo?"

Raw Audio Mel + MelGAN

Tacotron2(TTS) FastSpeech(TTS)

BTS(with speech-related artifacts)

Red color : Wrong Pronanciation
Blue color : Right Pronanciation

Pronunciation : "yeogieseo gakkaun gos-e seojeom-i issnayo?"


BTS(with artifacts removal)

Pronunciation : "yeogieseo gakkaun gos-e seojeom-i issnayo?"

2. Korean : "그는 아주 부지런한 사람이다."

Translation : "He is a very diligent person."

Pronunciation : "geuneun aju bujileonhan salam-ida."

Raw Audio Mel + MelGAN

Tacotron2(TTS) FastSpeech(TTS)

BTS(with speech-related artifacts)

Pronunciation : "geuneun aju bujileonhan salam-ida."


BTS(with artifacts removal)

Pronunciation : "geuneun aju bujileonhan salam-ida."

3. Korean : "대규모 구조조정이 불가피합니다."

Translation : "Massive restructuring is inevitable."

Pronunciation : "daegyumo gujojojeong-i bulgapihabnida."

Raw Audio Mel + MelGAN

Tacotron2(TTS) FastSpeech(TTS)

BTS(with speech-related artifacts)

Pronunciation : "daegyumo gujojojeong-i bulgapihabnida."


BTS(with artifacts removal)

Pronunciation : "daegyumo gujojojeong-i bulgapihabnida."


4. Korean : "그 둘 중의 하나가 거짓말을 하고 있는 게 분명하다."

Translation : "It is clear that one of the two is lying."

Pronunciation : "geu dul jung-ui hanaga geojismal-eul hago issneun ge bunmyeonghada."

Raw Audio Mel + MelGAN

Tacotron2(TTS) FastSpeech(TTS)

BTS(with speech-related artifacts)

Pronunciation : "geu dul jung-ui hanaga geojismal-eul hago issneun ge bunmyeonghada."


BTS(with artifacts removal)

Pronunciation : "geu dul jung-ui hanaga geojismal-eul hago issneun ge bunmyeonghada."


5. Korean : "이 문장은 조금 어색해요."

Translation : "This sentence is a bit awkward."

Pronunciation : "i munjang-eun jogeum eosaeghaeyo."

Raw Audio Mel + MelGAN

Tacotron2(TTS) FastSpeech(TTS)

BTS(with speech-related artifacts)

Pronunciation : "i munjang-eun jogeum eosaeghaeyo."


BTS(with artifacts removal)

Pronunciation : "i munjang-eun jogeum eosaeghaeyo."

Video: Result - Spoken speech


Mimed speech<Top>

Note : All samples were synthesized from unseen brain signals and generated sentences are not used during training


1. Korean : "범인은 이십대 중반에서 후반의 남성으로 보입니다."

Translation : "The criminal seems to be a man in his mid twenties to late twenties."

Pronunciation : "beom-in-eun isibdae jungban-eseo huban-ui namseong-eulo boibnida."

Spoken Speech
(Raw audio)
Mimed Speech
(Raw audio)

Spoken Speech
(BTS with artifacts removal)

Pronunciation : "beom-in-eun isibdae jungban-eseo huban-ui namseong-eulo boibnida."


Mimed Speech
(BTS with artifacts removal)

Pronunciation : "beom-in-eun i sibdae jung ban-eseo huban-ui namseong-eulo boibnida."


Section 3: Model<Top>

Brain-to-speech architecture


Architecture details


Section 4: Code<Top>

Code for training and inference will be released after the paper is accepted

Section 5: Data and materials availability<Top>

EEG and audio data used to create Figure 3 and 4 will be available from the public repository when the paper is accepted. Additional materials can be requested from author upon reasonable request.