~/research/gibbs-pcfg/README 

 (c) Mark Johnson, 19th January 2013

Gibbs and Hastings samplers for unsupervised estimation of PCFGs

gibbs-pcfg -- A Gibbs sampler that alternates between sampling parse trees 
              given rule probabilities and sampling rule probabilities given
              parse trees

hastings-pcfg -- A collapsed sampler that samples a parse tree given the other
                 parse trees (i.e., the rule probabilities are integrated out).
                 Christian Robert would call this "Metropolis-within-Gibbs".

To build under Linux it should be sufficient to just run "make".

Documentation is included in the programs themselves.  Just run "gibbs-pcfg --help"
or "hastings-pcfg --help" to see the documentation, or look at the top of 
gibbs-pcfg.cc and hastings-pcfg.cc.

-------

Example: Try running gibbs-pcfg as

gibbs-pcfg -d 100 testengger.lt < testeng.yld

Each rule that the Gibbs PCFG reads should be in the format

[alpha [theta_init]] Parent --> Child1 [Child2 ...]

alpha is the Dirichlet parameter alpha.

theta_init is the initial value of the rule probability

hastings_pcfg integrates out the rule probabilities, so it reads rules of the format

[alpha] Parent --> Child1 [Child2 ...]

You can set a default value for theta and alpha from flags on the command line.

To actually see some output, use the -d flag to set the debugging level to some positive value (say, 100)

------

Shay Cohen asked for code that produces sampled rule probabilities simply given
the rule counts (i.e., it doesn't sample parse trees).

Here's how to do this with this code:

./gibbs-pcfg -C -d 10 -a 1.0 -n 100 testengger.lt -P testengger.consistent.rprob < /dev/null

The flags mean the following:

  -C      -- filter inconsistent grammars

  -d 10   -- write out some minimal debugging info (e.g., number of inconsistent 
             grammars filtered)

  -a 1.0  -- default value for Dirichlet prior (which is the posterior here too,
             since there is no input) for rules where alpha is not specified

  -n 100  -- number of sample grammars to produce

  testengger.lt -- file containing grammar rules

  -P testengger.consistent.rprob -- write rule probabilities to this file

  /dev/null -- empty input file (i.e., there are no input strings, so we sample
               from prior)