~/research/py-cfg/README

Mark Johnson, 30th July 2012

See py-cfg.cc for instructions on running the py-cfg program after
you've compiled it, or just run "py-cfg -h".

Installation instructions for Linux:

If you have a reasonably modern g++, you should just have to run:

make

However, the standard version of g++ on Mac OSX is quite old,
so I found I had to install a new g++ and the getopt library
to get it to work.

I used MacPorts to install the required software on my Macbook Air.
I'm doing this from memory, but I think this is what I did:

sudo port install gcc47
sudo port install libgnugetopt

export CC=gcc-mp-4.7
export CXX=g++-mp-4.7
export GCCFLAGS=""
export GCCLDFLAGS="-L/opt/local/lib -l gnugetopt"

make

I'm afraid I'm not an expert on Mac OSX, so this may not be optimal
(e.g., maybe there are optimisation flags that would speed it up
considerably).

Mark

=======================================================================

Mark Johnson, 27th August 2009

Pitman-Yor Context-Free Grammars
================================

Rules are of format

[w [a [b]]] X --> Y1 ... Yn

where X is a nonterminal and Y1 ... Yn are either terminals or
nonterminals,

w is the Dirichlet hyper-parameter (i.e., pseudo-count) associated
with this rule (a positive real)

a is the PY "a" constant associated with X (a positive real less
than 1)

b is the PY "b" constant associated with X (a positive real)


The -h flag causes the program to print out a list of options.

The -A parses-file causes it to print out analyses of the training data
for the last few iterations (the number of iterations is specified by the
-N flag).

If you specify the -C flag, these trees are printed in "compact" format,
i.e., only cached categories are printed (I think the root node is always
printed, just so we have a tree).

If you don't specify the -C flag, cached nodes are suffixed by a 
'#' followed by a number, which is the number of customers at this
table.


Brief recap of Pitman-Yor processes
===================================

Suppose there are n samples occupying m tables.  Then the probability
that the n+1 sample occupies table 1 <= k <= m is:

  P(x_{n+1} = k) = (n_k - a)/(n + b)

and the probability that the n+1 sample occupies the new table m+1 is:

  P(x_{n+1} = m+1) = (m*a + b)/(n + b)

The probability of a configuration in which a restaurant contains n
customers at m tables, with n_k customers at table k is:


  a^{-m} G(m+b/a)  G(b)                 G(n_k-a)
         -------- ------  \prod_{k=1}^m --------
          G(b/a)  G(n+b)                 G(1-a)

where G is the Gamma function.
