Sorani Kurdish Corpus

This is a corpus of Sorani Kurdish news texts that was used to analyze regional and subdialectal differences, as described in this paper.

Using articles from online news sources, a total of 200,000 sentences (between 5-55 tokens in length) were collected in text format.

You can contact me to obtain a copy of the Sorani corpus.