23.07 – The Web as a Training Set

Dr. Preslav Nakov

*** Due to last minute changes in Preslav’s schedule, we had to move the meetup to July 23. Thanks everyone for your understanding!

Online streaming

We are very happy to announce our next meetup on July 23. when Preslav Nakov will join us to share some of his experience. Preslav’s time in Bulgaria is usually very limited so this meetup will be a unique opportunity to catch up on some cutting-edge in the field of lexical semantics.

Topic: The Web as an Implicit Training Set: Application to Noun Compounds Syntax and Semantics
Speaker: Preslav Nakov – QCRIBerkeley

Bio: Preslav Nakov is a Senior Scientist at the Qatar Computing Research Institute (QCRI). He received his Ph.D. in Computer Science from the University of California at Berkeley in 2007 (supported by a Fulbright grant and a UC Berkeley fellowship). Before joining QCRI, Preslav was a Research Fellow at the National University of Singapore. He has also spent a few months at the Bulgarian Academy of Sciences and the Sofia University, where he was an honorary lecturer. Preslav’s research interests include lexical semantics (in particular, multi-word expressions, noun compounds syntax and semantics, and semantic relation extraction), machine translation, Web as a corpus, and biomedical text processing.

Time: Thursday, July 23, 19:00
Registration starts at 18:30
Place: Vivacom Art Hall


The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on sub-problems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field.

Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline.

In this talk, we will reveal some of the hidden potential of the Web that lies beyond the n-gram, with focus on the syntax and semantics of English noun compounds. First, we will present a highly accurate lightly supervised approach based on surface markers and linguistically-motivated paraphrases that yields state-of-the-art results for noun compound bracketing: e.g., “[[liver cell] antibody]” is left-bracketed, while “[liver [cell line]]” is right-bracketed. Second, we will present a simple unsupervised method for mining implicit predicates that can characterize the semantic relations holding between the nouns in noun compounds, e.g., “malaria mosquito” is a “mosquito that carries/spreads/causes/transmits/brings/infects with/… malaria”. Finally, we will show how these ideas can be used to improve statistical machine translation.

To attend this event you need to register and take your FREE ticket from Eventbrite.

Share this

Leave a Reply