English and then concatenated sequentially to reconstruct the

English Text to Speech Synthesizer using Concatenation
Technique

Sai Sawant1, M. S. Deshpande2

Department of Electronics and Telecommunication
Engineering

Vishwakarma Institute of Technology, Pune, India

 

[email protected]

[email protected]

Abstract. Text to speech synthesis (TTS) system is used to produce
artificial human speech.  Any language
text can be converted into speech signal using TTS system. This paper presents
a method to design a text to speech synthesis system for English language with
the use of MATLAB. Simple matrix operations and container map data structure
available in MATLAB are used to design this system. Phoneme concatenation is
performed to get speech signal for input text. Initially some words are
recorded that contain all the phonemes of English language. Phonemes are
extracted from these recorded words using PRAAT tool. The extracted phonemes
are compared with input text phonemes and then concatenated sequentially to
reconstruct the desired words. Implementation of this method is simple and
requires less memory usage.

 

Keywords: Text to speech, English, Phonetic
concatenation, MATLAB, PRAAT Tool

1              
Introduction

Text to speech system transforms
linguistic information present in the form of data or text into speech signal.
TTS acts as an interface between digital content and a greater population, such
as people with literacy difficulties, learning disabilities, reduced vision and
those learning a language. It is helpful for those people who are looking for
simple ways to access digital content. It can be used for
Telecommunication, Industrial and educational applications.

Synthetic speech can be formed by
concatenation of recorded speech units that are stored in a database. Systems
using concatenation technique for synthesis differ in the size of stored speech
units. Phones, diaphones, syllables etc. can be used as speech units. A system
that uses phones or diaphones provides the largest output range. For some
domains the usage of entire words or sentences allows high quality speech
signal. A synthesizer can also incorporate a model of the vocal tract and other
human voice characteristics to form a completely synthetic voice output.

The overall work is summarized as
follows: section 2 gives the brief description of concatenative synthesis and
its subtypes. Section 3 provides the flow diagram of implemented TTS. Section 4
and section 5 describe the implementation steps and experimental results
respectively. Section 6 concludes the discussion by summarizing the findings
and explaining the future direction of the work.

2              
Concatenative Syntheis

Concatenative synthesis is the
concatenation of the segments of recorded speech. This synthesis technique is
simple to implement as it doesn’t involve any mathematical model. Speech is
produced using natural, human speech. Concatenation of prerecorded speech
utterances produces understandable and natural sounding synthesis speech.
Concatenation can be done using different size of the stored speech units.
There are 4 subtypes of this synthesis method, depending upon the speech unit
size and use 4:

 1.  Unit
selection synthesis

 2. 
Domain specific synthesis

 3. 
Diphone synthesis

 4.  Phoneme
based synthesis

 The most important aspect of concatenative
synthesis is to select correct unit length. With selection of longer speech
unit high naturalness, less concatenation points are achievable, but the amount
of required units and memory is increased. For shorter units less memory is
needed, but the sample collection and labeling procedures become difficult and
complex 10. The present system is implemented using phonemes as speech units.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.1          
Phoneme based Synthesis

In this synthesis technique,
sequential combination of phonemes is used to synthesis desired continuous
speech signal. For extraction of phonemes, different words need to be recorded that
contain all possible phonemes of desired TTS system language. From these recorded
word utterances, phonemes of specified duration are extracted. It creates
database of extracted phoneme sounds. Whenever the word is to be synthesized,
corresponding phonemes are fetched from the database and concatenated to obtain
required word sound. Following figure shows how phoneme based synthesis is
performed.

Fig. 1. Phoneme based Speech Synthesis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3              
Methodology

 

Fig. 2. Flow Chart showing the Methodology

4              
Implementation
of Text to Speech System

4.1          
Recording of
Words

Different
English words are recorded by a single speaker using Voice Recorder application
for android phones. Words selection is done in such a way that they covered all
the phonemes present in English language.

4.2          
Extraction of
Phonemes

The 44 phonemes of
English language are considered as speech units for concatenation. These phonemes
are selected from the source, Orchestrating Success in Reading by Dawn Reithaug
(2002) 1. The sounds pertaining to these 44 phonemes form a database for
creating any English word in a standard lexicon. So these 44 phoneme sounds are
extracted from recorded words using existing PRAAT tool. This tool can be used
to segment recorded words into its constituents such as syllables, phonemes
etc. The TextGrid editor of PRAAT tool is used for segmenting recorded sounds
and labeling the segments 3. Hence, with the help of this tool, words are segmented
and annotated to obtain phonemes as shown in the following examples:

Table
1. English Phonemes with Examples

 

Phonemes

Example Words

a

Hat, Map, Cat

ae

Train, Eight, Day

ee

Key, Sweet

oy

Toy, Coin

 

 

 

 

Fig. 3. Extraction of phoneme /k/ from word
Cat using PRAAT

4.3          
Creation of
Phoneme Database

MATLAB has Containers package with a Map class. Map
object which is an instance of the MATLAB containers.Map class is used. A
Map
object is a data structure that allows retrieving values using a corresponding
key. Keys can be real numbers or character vectors and provide more flexibility
for data access than array indices and must be positive integers. Values can be
in the form of scalar or non-scalar arrays 2. Using this data structure,
extracted phoneme sounds are taken as values and keys are nothing but their
labels. So every unique label or annotation corresponds to particular phoneme
sound. This forms the key-value pair of phonemes and their respective
annotations.

4.4          
Grapheme to Phoneme Conversion

This process is
used to generate a pronunciation for a word using certain rules. The job of a
grapheme to phoneme algorithm is to convert a letter string like ‘Toy’ into a
phone string like t oy. Position of a letter in the given word is considered
to design rules. The input sequence is processed sequentially i.e., from left
to right. For each input word, a sequence of phoneme labels is selected. Every
time when the match is occurred between input letter (or group of letters) and
phoneme labels then the phonemic representation is stored in another variable.
The decision for every letter is taken before proceeding to the next letter;
this is a technique of local classification. It avoids the need to use a search
algorithm that is generally required to find the globally optimal solution.

 

Table 2. Phoneme and Grapheme Representation

 

Phoneme

Grapheme

Example Words

/b/

b, bb

Bag, Rubber

/sh/

sh, ss, ch

Ship, Mission, Chef

/e/

e, ea

Bed, Head

/ch/

ch, tch

Chip, Match

/ow/

ow, ou

Cow, Out

 

4.5          
Concatenation

After grapheme to phoneme conversion of
input text, the phonemic representation is compared with keys (recorded phoneme
labels) of map data structure. If this representation has given keys then
values (phoneme sounds) corresponding to respective phoneme labels are fetched.
Since, all these speech units (phonemes) are just column vectors, their
constituent elements are placed one after another and stored in another vector.
This is how concatenation is done. In this way, all the words from input text
are played by selecting the phonemes and placing the phoneme vectors one after
another.

 

 

5              
Experimental Results

Input text: Coin

Phoneme sequence: /k/ /oy/ /n/

For input text
‘Coin’, its grapheme sequence k oy n is used to obtain corresponding phoneme
sound files. These sound files are concatenated to obtain sound file for word
‘Coin’.

 

Fig. 4. Waveform of Phoneme /k/

 

Fig. 5. Waveform of Phoneme /oy/

Fig. 6. Waveform of Phoneme /n/

 

Fig. 7. Waveform of word ‘Coin’ after Concatenation

Fig. 8. Waveform of Coin utterance

 

Both waveforms (Fig. 7 and Fig. 8)
for originally uttered and concatenated word Coin are compared and some
similarities are found. The concatenated sound is close to the original sound.
The degree of similarity increases with the precision in extracting the 44
phonemes.

6              
Conclusion

In
this work, English text to speech synthesis system using phoneme based
concatenative synthesis is developed. The system is implemented by the use of
MATLAB map data structure and simple matrix operations. Hence, this method is simple
and efficient to implement unlike other methods that involve complex algorithms
and techniques. As English phonemes are used as speech units, less memory is
required. In order to bring more naturalness in the speech output, text
analysis and prosody need to be improved.

7              
References

1.      
Orchestrating Success in Reading by Dawn Reithaug
(2002).

2.     
MathWorks
– MATLAB and Simulink for Technical Computing www.mathworks.com

3.       Paul
Boersma & David
Weenink (2013),
Praat: doing phonetics by computer Computer program, http://www.praat.org/

4.       Dr. Shaila D. Apte, Speech and Audio Processing,
Wiley-India, 2012.

5.       Narendra, N.P., Rao, K.S., Ghosh, K. et al. Int J
Speech Technol (2011) 14: 167. https://doi.org/10.1007/s10772-011-9094-4.

6.       Panda, S.P.
& Nayak, A.K. Int J Speech Technol (2017) 20: 959. https://doi.org/10.1007/s10772-017-9463-8.

7.      
Mrs. S. D. Suryawanshi, Mrs. R. R. Itkarkar and Mr. D. T.
Mane, “High Quality Text to Speech Synthesizer using Phonetic Integration”,
International Journal of Advanced Research in Electronics and Communication Engineering
(IJARECE) Volume 3, Issue 2, February 2014.

8.      
Bisani, M., Ney, H.,
“Joint-Sequence Models for Grapheme-to-Phoneme Conversion”, Speech
Communication (2008), doi: 10.1016/j.specom.2008.01.002.

9.      
Tapas
Kumar Patra, Biplab Patra and Puspanjali Mohapatra, “Text to Speech Conversion
with Phonematic Concatenation”, International Journal of Electronics
Communication and Computer Technology (IJECCT) Volume 2 Issue 5 (September
2012).

10.    R. Shantha selva kumari, R. Sangeetha, “Conversion of English text to
speech (TTS) using Indian speech signal”, IJSET, Vol. 4, issue No. 8, pp:
447-450, Aug 2015.

11.    Mr. S. D. Shirbahadurkar and Dr. D. S. Bormane, “Marathi Language Speech
Synthesizer Using Concatenative Synthesis Strategy (Spoken in Maharashtra,
India)”, 2009 Second International Conference on Machine Vision.

12.    Deepshikha Mahanta, Bidisha Sharma, Priyankoo Sarmah, S R Mahadeva
Prasanna, “Text
to Speech Synthesis System in Indian English”, Region 10
Conference (TENCON), 2016 IEEE.

13.    Hari Krishnan, Sree & Thomas, Samuel & Bommepally, Kartik &
Jayanthi, Karthik & Raghavan, Hemant & Murarka, Suket & Murthy,
Hema & Group, Tenet, “Design and Development of a Text-To-Speech Synthesizer
for Indian Languages”.

14.    S.D.Shirbahadurkar
and D.S.Bormane, “Marathi Language Speech Synthesizer Using Concatenative
Synthesis Strategy (Spoken in Maharashtra, India)”, Second International
Conference on Machine Vision, pp. 181-185,  (2009).

15. 
Vinodh.M.V., Ashwin Bellur, Badri Narayan K., Deepali M.
Thakare, Anila Susan, Suthakar N.M., Hema A. Murthy, “Using Polysyllabic units for
Text to Speech Synthesis in Indian languages”,  2010 National Conference on Communications (NCC).

16. 
Anusha
Joshi, Deepa Chabbi , Suman M and Suprita Kulkarni, “Text To Speech System For
Kannada Language”, 2015 International Conference on Communications and Signal
Processing (ICCSP).

BACK TO TOP