DSLCC 3.0 data

In the 2016 DSL shared task participants were asked to train systems to discriminate between similar languages, language varieties, and dialects.

For sub-task 1, we released version 3 of the DSL corpus collection (DSLCC). The corpus contains 20,000 training instances per class (18,000 training + 2,000 development) as well as a test set. Each instance is an excerpt extracted from journalistic texts with the country of origin of the text.

The languages and varieties included, grouped by similarity are:

Bosnian, Croatian, and Serbian
Malay and Indonesian
Portuguese: Brazil and Portugal
Spanish: Argentina, Mexico, and Spain
French: France and Canada

For sub-task 1 two test sets (A and B) were released. Each of them contain 1,000 unidentified instances of each language to be classified according to the country of origin.

Test set A (in-domain): newspaper texts.
Test set B (out-of-domain): social media data.

Download The Corpus

The entire collection of training and testing data can be downloaded here.

The data for sub-task 2, Arabic Dialect Identification, is not included here. See the shared task report paper for details on how to obtain that data.