Fri Sep 24, 2010
5101 Tolman, 11 AM–1 PM
|Institute of Cognitive and Brain Sciences
Dan Klein (University of California, Berkeley)
Phylogenetic models for natural language
Languages descend in a roughly tree-structured evolutionary process. In historical linguistics, this process is manually analyzed by comparing and contrasting modern languages. Many questions arise: What does the tree of languages look like? What are the ancestral forms of modern words? What functional pressures shape language change? In this talk, I’ll describe our work on bringing large-scale computational methods to bear on these problems.
In the task of proto-word reconstruction, we infer ancestral words from their modern forms. I’ll present a statistical model in which each word’s history is traced down a phylogeny. Along each branch, words mutate according to regular, learned sound changes. Experiments in the Romance and Oceanic families show that accurate automated reconstruction is possible; using more languages leads to better results.
Standard reconstruction models assume that one already knows which words are cognate, i.e., are descended from the same ancestral word. However, cognate detection is its own challenge. I’ll describe models which can automatically detect cognates (in similar languages) and translations (in divergent languages). Typical translation-learning approaches require virtual Rosetta stones – collections of bilingual texts. In contrast, I’ll discuss models which operate on monolingual texts alone.
Finally, I’ll present work on multilingual grammar induction, where many languages’ grammars are simultaneously induced. By assuming that grammar parameters vary slowly, again along a phylogenetic tree, we can obtain substantial increases in grammar quality across the board.