Dari Text Corpus

This is a corpus of Dari news texts that was used to discriminate between Farsi and Dari in this paper.

A total of 14k sentences (between 5-55 tokens in length) were collected from over 1k news articles.

You can contact me to obtain a copy of the dataset.