This is a corpus of Dari news texts that was used to discriminate between Farsi and Dari in this paper.
A total of 14k sentences (between 5-55 tokens in length) were collected from over 1k news articles.
You can contact me to obtain a copy of the dataset.