research/wordtuples/README

(c) Mark Johnson, 3rd May 2012

I resurrected this code in May 2012 for a project with NICTA.

wordtuples and wordtuples1 are two different collocation-finding 
programs.  They collect and print a list of word n-tuples
sorted by discounted strength of interaction.  They are called as follows:

  wordtuples tuple_size min_tuple_count significance_level

tuple_size         -- the number of words in each tuple
min_tuple_count    -- tuples smaller than this size are ignored
significance_level -- used to discount the lambda value

They both read lines from standard input, and extract tuples of size
tuple_size from them (lines with fewer than tuple_size words are
ignored).  The program then computes how surprising each tuple is, and
prints the surprising tuples onto standard output, prefixed by a
number that indicates how surprising that tuple is.  (This number is
actually the number of standard deviations beyond significance_level
of each tuple).

wordtuples calculates the "surprise" of an n-element tuple by
calculating how likely that tuple is given all of its contiguous
n-1 element sequences, while wordtuples1 calculates surprise just
using a unigram model.  These two measures identify very different
sequences as surprising!

For example, the tuples that wordtuples finds most surprising are
things like "in and out", "more and more", "the fact that", "in
exchange for", while the tuples that wordtuples1 finds most surprising
are things like "shearson lehman hutton", "dow jones industrials", etc.

The Makefile is set up to read data from a file data.txt in this
directory.  data.txt should consist of lines of text, with tokens
separated by spaces.  My data.txt (produced from the Penn treebank
using my program munge-trees) begins as follows:

      pierre vinken 61 years old will join the board as a nonexecutive director nov. 29
      mr. vinken is chairman of elsevier n.v. the dutch publishing group
      rudolph agnew 55 years old and former chairman of consolidated gold fields plc was named a nonexecutive director of this british industrial conglomerate
      a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago researchers reported

Running "make output.txt" produces the tuples found by wordtuples,
while "make output1.txt" produces the tuples found by wordtuples1.

output.txt begins as follows:

   4.24933 in and out
   2.83943 to stock of
   2.17761 said in a
   2.16194 securities and exchange
   1.90359 more and more

output1.txt begins as follows:

   26.8225 percent difference compares actual
   26.3538 source fulton prebon u.s.a
   26.2459 shearson lehman hutton inc
   25.996 shearson lehman hutton treasury
   25.5328 merrill lynch ready assets
   25.4672 shearson lehman hutton inc.

The number at the beginning of each line is the excess z-score, which
can be regarded as a measure of the strength of the association of the
following tuple.



------------------------------------------------------------------

Mark Johnson, 27th March 2001

This directory holds code for finding significantly related word
n-tuples, although the basic technique should apply to anything.

The idea is to estimate the n-way interaction term \lambda_{X1...Xn}
of the saturated log-linear model.  A large value for this term
indicates that the combination X1...Xn occurs more frequently than the
lower-order statistics would suggest.  It turns out that this n-way
interaction term is a generalization of the log odds ratio of a 2x2
table.

The n-way interaction is significant if \lambda_{X1...Xn} is a
suitable number of standard errors away from zero.  Alternatively, we
can obtain trade variance for bias by subtracting a suitable number of
standard errors from the \lambda_{X1...Xn}; this will give us a
discounted estimate of the interaction term.

The general framework of log-linear models is introduced in Agresti
(1990).  The formulae I have actually used come from L. A. Goodman
(1970) (this article is available electronically from JStor).

@Book{Agresti90,
  author =	 {Alan Agresti},
  title = 	 {Categorical Data Analysis},
  publisher = 	 {John Wiley and Sons},
  year = 	 1990,
  address =	 {New York}
}

@Article{Goodman70,
  author = 	 {Leo A. Goodman},
  title = 	 {The Multivariate Analysis of Qualitative Data: 
                  Interactions among Multiple Classifications},
  journal = 	 {Journal of the American Statistical Association},
  year = 	 1970,
  volume =	 65,
  pages =	 {226-256}
}

The basic code for calculating the n-way interaction term and its
asymptotic standard error is in interaction.[ch].  The estimator used
here is asymptotically correct, but may be inaccurate for small sample
sizes.  In practice, it seems to do fairly well if you restrict
attention if all cells have counts >= 5.

