Sharon Goldwater, 3/11/08

README for corpora used in Goldwater et al. word segmentation papers.

Three files are included here, originally obtained from Michael Brent,
and redistributed with his permission.  See NOTE below for appropriate
citations.

br-text.txt: the orthographic transcript made by Brent of the
Bernstein-Ratner corpus in the CHILDES database.  This was made by
cleaning up non-standard spellings, removing partial words, utterances
not directed at the children, etc.

dict.txt: the phonological dictionary used to convert orthographic
forms into phonological forms, resulting in br-phono.txt.

br-phono.txt: the phonological transcript.

****************************************

NOTE: If using these corpora in published materials, please cite
the following:

CHILDES database:
B.MacWhinney and C. Snow. 1985. The child language data
exchange system. Journal of Child Language, 12:271-296.

Bernstein-Ratner corpus (original source of data):
N. Bernstein-Ratner. 1987. The phonology of parent-child
speech. In K. Nelson and A. van Kleeck, editors, Children's
Language, volume 6. Erlbaum, Hillsdale, NJ.

Brent version of B-R corpus:
Brent, M.R. and T.A. Cartwright. 1996. Distributional regularity and
phonotactic constraints are useful for segmentation. Cognition 61: 93-125.

