IT-Forum: Statistical Language Modeling in the Era of Abundant Data

Statistical Language Modeling in the Era of Abundant Data
Friday, January 9, 2015 - 1:00pm to 2:00pm
Packard 202
Ciprian Chelba (Google)
Abstract / Description: 

The talk presents an overview of statistical language modeling as applied to real-word problems: speech recognition, machine translation, spelling correction, soft keyboards to name a few prominent ones. We summarize the most successful estimation techniques, and examine how they fare for applications with abundant data, e.g. voice search. We conclude by highlighting a few open problems: getting an accurate estimate for the entropy of text produced by a very specific source, e.g. query stream); optimally leveraging data that is of different degrees of relevance to a given "domain"; does a bound on the size of a "good" model for a given source exist?

Ciprian Chelba is a Research Scientist with Google. Previously he worked as a Researcher in the Speech Technology Group at Microsoft Research. His research interests are in statistical modeling of natural language and speech. Recent projects include: Google Audio Indexing; indexing, ranking and snippeting of speech content; Language Modeling for Google Search by Voice, and Android IME predictive keyboard.