Author: Andrea Zugarini
Date: May, 2021
Topics: Language Modeling; Language Generation; Language Understanding; Information Extraction.
Abstract
The ability to understand and generate language is one of the most fascinating and peculiar aspects of humankind. We can discuss with other individuals about facts, events, stories or the most abstract aspects of our existences, only because of the power and the expressiveness of language.
Natural Language Processing (NLP) studies the intriguing properties of languages, the rules, their evolution, its connections with semantics, knowledge and generation, and tries to harness its features into automatic processes. Language is built upon a collection of symbols, and meaning comes from their composition. Such symbolic nature limited for many years the development of Machine Learning solutions for NLP. Many models relied on rule-based methods, and Machine Learning models where based on handcrafted task-specific features. In the last decade there have been incredible advances in the study of language, thanks to the combination of Deep Learning models with Transfer Learning techniques. Deep models can learn from huge amounts of data, and they are proven to be particularly effective in learning feature representations automatically from high-dimensional input spaces. Transfer Learning techniques aim at reducing the need for data by reusing representations learned from related tasks. In the scope of NLP, it is possible thanks to Language Modeling. Language Modeling related tasks are essential in the construction of general purpose representations from unlabelled large textual corpora, allowing the shift from symbolic to sub-symbolic representations of language.
In this thesis, motivated by the need of moving steps toward unified NLP agents capable of understanding and generating language in a human-like fashion, we face different NLP challenges, proposing solutions based on Language-Modeling. In summary, we develop a character-aware neural language model to learn general purpose word and context representations, and use the encoder to face several language understanding problems, included an agent for the extraction of entities and relations from an online stream of text. We then focus on Language Generation, addressing two different problems: Paraphrasing and Poem Generation, where in one the generation is tied with information in input, whereas in the other the production of text requires creativity. In addition, we also present how language models can offer aid in the analysis of language varieties. We propose a new perplexity-based indicator to measure distances between different diachronic or dialectical languages.
Research Contributions
Character-aware Representations
Based on “An Unsupervised Character-Aware Neural Approach to Word and Context Representation Learning”, International Conference on Artificial Neural Networks (ICANN 2018). Springer, Cham, 2018
Information Extraction in Text Streams
Based on “Learning in Text Streams: Discovery and Disambiguation of Entity and Relation Instances”, IEEE Transactions on Neural Networks and Learning Systems 31.11 (2019): 4475-4486
Natural Language Generation
Based on “Neural Poetry: Learning to Generate Poems Using Syllables”, International Conference on Artificial Neural Networks (ICANN 2019). Springer, Cham, 2019
Based on “Neural Paraphrasing by Automatically Crawled and Aligned Sentence Pairs”, Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS). IEEE, 2019
Language Varieties
Based on “Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language”, Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects. (Coling 2020)
Related projects
Resources
Download: PDF – Andrea Zugarini – PhD Thesis – Language Models for Text Understanding and Generation