Resources and Downloads

These are assorted research resources developed as part of projects that I have worked on.

Jinan Chinese Learner Corpus
DSLCC 3.0 Dataset
MQ Names Corpus
Medical Word Clusters
Chinese Function Word List
Norwegian Function Word List
Dari Text Corpus
Sorani Text Corpus

Jinan Chinese Learner Corpus

The Jinan Chinese Learner Corpus (JCLC) is a 6 million token collection of texts written by Chinese L2 learners. Learn more about the learner corpus.

DSLCC 3.0 Dataset

Journalistic text from closely related languages and language varieties used in the 2016 DSL Shared Task. Click here for more details and download information.

MQ Names Corpus

The MQ Names Corpus contains over 13k names from 5 cultural groups, which have been annotated for gender. Details and access info.

Medical Word Clusters

A set of Brown clusters induced from 100 million tokens of medical research abstracts. Click here to view the clusters.

They were used to extract classification features in this paper.

Chinese Function Word List

As part of the research on Chinese Native Language Identification, we compiled a list of 449 Chinese function words to be used as features in our model. The function word list was compiled from Chinese language teaching resources. The complete list can be accessed here.

Norwegian Function Word List

As part the work on Norwegian Native Language Identification we used a list of 176 function words obtained from the distribution of the Apache Lucene search engine software. This list includes stop words for the Bokmål variant of the language and contains entries such as hvis (whose), ikke (not) and jeg (I). The list is available here.

Dari Text Corpus

This is a corpus of Dari news texts that was used to discriminate between Farsi and Dari. Learn more about the Dari corpus or click here to read the paper.

Sorani Text Corpus

This is a corpus of Sorani Kurdish sentences that was used to discriminate between two subdialects. Learn more about the Sorani corpus or click here to read the paper.