The Web as a Training Set


Last Thursday all the friends of Data Science society had the great pleasure to listen to a talk by Preslav Nakov, hosted for the first time at Vivacom Art Hall. Preslav Nakov is a Senior Scientist at the Qatar Computing Research Institute (QCRI). He received his Ph.D. in Computer Science from the University of California at Berkeley in 2007. Before joining QCRI, Preslav was a Research Fellow at the National University of Singapore. He has also spent a few months at the Bulgarian Academy of Sciences and the Sofia University, where he was an honorary lecturer. Preslav’s research interests include lexical semantics, machine translation, Web as a corpus, and biomedical text processing ,and he covered the first three topics in his presentation.

Preslav introduced the field of computational linguistics (or NLP) by sharing what is the big dream of the people behind it  – to make computers capable of understanding human language, just like the HAL 9000 computer did in the movie 2001: A Space Odyssey. After going over the different services that use NLP, such as Google Translate and Siri, and the history of NLP itself starting from the 50s,  he emphasized the role of the big datasets in training algorithms (the so-called statistical revolution in the 90s). It turns out that getting more data improves accuracy of algorithms a lot more than fine-tuning them.

The largest available database is the Web itself, providing access to quadrillions of words. Preslav pointed out that research based on the Web has been restricted so far to using page hit counts as an estimate for n-gram word frequencies and this has led some researchers to conclude that the Web should be only used as a baseline (in the words of the late Adam Kilgarriff, “Googleology is bad science”). Preslav is an advocate for “smart” usage of the Web that goes beyond the n-gram, which he demonstrated by focusing on the syntax and semantics of English noun compounds. The noun compounds are sequences of nouns that function as a single noun, such as “healthcare reform”. He demonstrated several ways for understanding their internal syntactic structure (e.g., “[plastic water] bottle” vs. “plastic [water bottle]”).

Next, Preslav revealed a simple unsupervised method for determining the semantic relationship between nouns in noun compounds – by using multiple paraphrasing verbs. For example, “malaria mosquito” is a “mosquito that carries/spreads/causes/transmits/brings/infects with/… malaria”. The verbs are extracted from Google results. Our lecturer showed various applications of noun compounds, the most important being to machine translation. Machine translation nowadays works with phrases, and is really augmented by the paraphrasing techniques.

Preslav discussed another related topic – different sources of data such as Google N-grams, Google Book N-grams and Microsoft Web N-grams Services. Even though they have their shortcomings, they are superior to simply using a search engine. Yet, text is not the only useful database out there – images can also help. If searching for a word in English and Spanish returns similar images, then the word should have similar meaning in both languages.

Afterwards, our presenter shared his thoughts about the future. According to him, the next revolution in NLP should come from semantics and there are already signs for this – the number of scientific papers in the field has recently exploded. Using syntax and semantics may be a big step over the current phrase-based machine translation. Another big improvement might come from implementing deep neural networks.

Finally, Preslav involved the audience by challenging them to recognize if a short text is written by a computer or by a human. Our audience got the author of 6 out of 8 texts right, beating even Preslav’s score of 4-4. You can take the test here. The idea behind this demonstration is that computers are already capable of writing text themselves that might be indistinguishable from text written by people. Preslav also touched a topic that has recently become quite popular again – AI as a threat to humanity, by citing opinions of Stephen Hawking, Bill Gates and Elon Musk on the subject.

Take a look at part1, part2 and part3 of Preslav’s presentation if you want to learn more. A complete video from the event is also available here.

The lecture was followed by networking drinks next to the National Theater. Don’t miss our great upcoming events and projects – stay tuned by visiting our website, following our Facebook page, LinkedIn page or following our Twitter account.

Author: Vladimir Labov

Share this

Leave a Reply