Richard Sproat
Thu 05 Mar 2015, 11:00 - 12:30
Informatics Forum (IF-4.31/4.33)

If you have a question about this talk, please contact: Nicola Drago-Ferrante (ndferran)

Abstract:

In this talk I will give an overview of the current state of the technology for Text-to-Speech (TTS) text normalization at Google, as well as current and future research.  I start with a quick overview of the Kestrel text normalization system. At the core of Kestrel are text-normalization grammars that are compiled into libraries of weighted finite-state transducers (WFSTs). While the use of WFSTs for text normalization is not new, Kestrel differs from previous systems in its separation of the initial tokenization and classification phase of analysis from verbalization, with communication between the two being mediated by "semiotic classes".

I then discuss the use of Maximum Entropy Rankers for prediction in cases where there is no predefined set of classes (so it is not naturally a classification problem), but where global features are desirable (so it is less naturally a sequence prediction problem). I give as an example the verbalization of OOV letter sequences.  One of the problems in TTS text normalization is when the system is "too clever by half" and predicts the wrong normalization. This can particularly be a problem with abbreviation expansion, where it is usually better to leave an abbreviation unexpanded if one is not sure of the expansion. I present a "do no harm", high precision approach yielding few expansion errors at the cost of leaving relatively many abbreviations unexpanded. This includes methods for training classifiers to establish whether a particular expansion is apt. The approach achieves a large increase in correct abbreviation expansion when combined with the baseline text normalization component of the TTS system, together with a substantial reduction in incorrect expansions.

Finally, I present some of our ongoing work on minimally supervised text normalization, which we hope will be useful in languages where we cannot afford to expend the resources to build detailed hand-built grammars or large annotated databases.

 

Bio:

From January, 2009, through October 2012, Richard Sproat was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University.

Prior to going to OHSU, he was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. He was also a full-time faculty member at the Beckman Institute, and still holds adjunct positions in Linguistics and ECE at UIUC.

Before joining the faculty at UIUC Richard worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs --- Research where he worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques.

Before joining Ken's department Richard worked in the Human/Computer Interaction Research Department headed by Candy Kamm. His most recent project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC. WordsEye is particularly good for creating surrealistic images that Richard can easily conceive of but are well beyond his artistic ability to execute.

More info --- and many more publications --- on Richard's external website here.