DSLCC 3.0 data

In the 2016 DSL shared task participants were asked to train systems to discriminate between similar languages, language varieties, and dialects.

For sub-task 1, we released version 3 of the DSL corpus collection (DSLCC). The corpus contains 20,000 training instances per class (18,000 training + 2,000 development) as well as a test set. Each instance is an excerpt extracted from journalistic texts with the country of origin of the text.

The languages and varieties included, grouped by similarity are:

For sub-task 1 two test sets (A and B) were released. Each of them contain 1,000 unidentified instances of each language to be classified according to the country of origin.

Download The Corpus

The entire collection of training and testing data can be downloaded here.

The data for sub-task 2, Arabic Dialect Identification, is not included here. See the shared task report paper for details on how to obtain that data.