Vulgaris – SAILab

Have a a look at the Technical report here! Our paper was recently accepted at VarDial 2020 Seventh Workshop on NLP for Similar Languages, Varieties and Dialects co-located with COLING 2020.

Cite

@inproceedings{zugarini2020vulgaris,
  title={Vulgaris: Analysis of a Corpus for Middle-Age Varieties of Italian Language},
  author={Zugarini, Andrea and Tiezzi, Matteo and Maggini, Marco},
  booktitle={Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects},
  pages={150--159},
  year={2020}
}

The main goal of project Vulgaris is the analysis of the diachronic evolution and variance of the vulgar italian language. In order to do so, we retrieved an heterogeneous literary text corpus from Biblioteca Italiana, a digital library project collecting the most significant texts of the Italian literature, ranging from the Middle Age to the 20th century.

The analyzed corpus contains poetry, prose, epistles and correspondence by the most important Italian authors ranging from the dawn of the vulgar language to the Reinassance Age.

These texts represents a fundamental timeframe for the Italian language, including the first steps and diachronic evolutions departing from the Latin language. Moreover, through Vulgaris it is possible to gain evidence of the early language fragmentation deriving from the complex historical geo-political context of the Middle Age.

: Vulgaris’s Authors Map (© https://freevectormaps.com/italy/IT-EPS-01-0004 )

The vulgar Italian language, starting from the beginning of the 13th century, became more and more popular amongst various authors, evolving during the following years in several families. The highly fragmented geo-political context gave rise to different schools,groups, communities and hence many language varieties, dialects, that even nowadays are noticeable.

: Vulgaris’s Families Timeline

Corpus Structure

The corpus investigated in Vulgaris is extremely heterogeneous and composed by 4 million word occurrences, whose texts have been written by authors from a wide range of geographical regions and time periods. We summarize some statistics on the total amount of word occurrences, the number of unique words and the average occurrences per word for each text type.

: Word occurrences for each family

The total number of words in poetry and prose is almost balanced, whereas their composition is remarkably different. Indeed, poetry has a richer lexicon than prose, containing almost twice unique words.

Moreover, in the following Figures we report the average distribution of the text length, in both the styles (i.e, poetry on the left and prose on the right) among all the families. The bottom row of the Figure shows the average number of words contained in each collection, hence texts having similar characteristics or theme.