To remove any frames surrounding this page, click here

Open-source software written by Mark Johnson

This software is open-source, but I do request acknowledgement whenever this software is used to produce published results or incorporated into other software.

This is research software, and while I have tried to write it as well as I can, it may still contain bugs, so users beware! If you find any bugs, please let me know. I believe that the programs compute what I claim they compute, but I do not guarantee this. The programs may be poorly and inconsistently documented and may contain undocumented components, features or modifications. I make no guarantee that these programs will be suitable for any application.

This software is provided AS IS with NO SUPPORT. These programs have no warranty, guarantee, express or implied representation of any kind whatsoever. All other warranties including, but not limited to, merchantability and fitness for purpose, whether express, implied, or arising by operation of law, course of dealing, or trade usage are hereby disclaimed.

The accuracy extrapolation code and data from our 2018 ACL paper, including the R code that generates the graphics.
The Pitman-Yor adaptor grammar sampler from our 2006 NIPS paper, now updated to estimate the Pitman-Yor hyperparameters as well using a Slice sampler and to use the C++ TR1 hash tables (August 2009), and to allow the sampler to be trained from a subset of the training corpus. The Makefile now builds four versions of the program: a double-precision version (py-cfg) and a quadruple-precision version (py-cfg-quad) which will run on much longer input strings without underflow, as well as multi-threaded versions of each (py-cfg-mp and py-cfg-quad-mp) which run roughly twice as fast as their single-threaded counterparts. (last update 23/09/2013 contains multi-threaded versions, previous versions 02/05/2013, 25/02/2013, 18/11/2012, 16/08/2012, 30/07/2012, 12/09/2011)
Data, adaptor grammars and supporting code for the experiments described in our NAACL 2009 paper.
Gibbs and Hastings samplers for PCFGs (these are MCMC algorithms for computing the Bayesian version of what the Inside-Outside algorithm computes), updated to handle non-tight PCFGs in three different ways. The Inside-Outside PCFG estimator available below has been updated to optionally use Variational Bayes, so it provides an alternative way of computing the same thing as these samplers. As far as I can tell, on small data sets the samplers work well and are more accurate, but on large data sets (say, more than 1 million tokens) the Variational Bayes estimators converge much faster. (Last updated on 22nd April 2013, old version of 11th March 2011 also available).
A gzipped tar archive containing the reranking parser (this is my version of December 2011, the version of November 2009 is available here ) primarily written by Eugene Charniak and me (with the assistance of many people, e.g., Matt Lease and David McCloskey), as described in Eugene Charniak's and my ACL 2005 paper and my 2005 CoNLL talk. With some feature tweaking it's now getting 91.4% f-score on section 23! This archive contains code for completely retraining the reranker from scratch, including:
- constructing the 20 folds of 50-best parses,
- extract features from these 50-best parses, and
- estimate the reranker feature weights using MaxEnt, Averaged Perceptron, etc.
You will need your own copy of the Penn Treebank and a machine with 4-8GB RAM to retrain the reranker (see the README and Makefiles). Once trained, the parser+reranker should run in about 1/2 GB RAM (and the tar file above includes a fully trained model which you should be able to run out of the box).
The code is stored in a gzipped archive file. Download the archive file, decompress it with gunzip and unpack it with tar. If you are using GNU tar, you can decompress and unpack in one step using tar -zxvf.
For fun, try the reranking parser on the Brown corpus! Even though the reranking parser is trained only on WSJ, it actually does surprisingly well on Brown (better than any other parser, as far as I know, including parsers trained on Brown). This suggests that we aren't overtraining on our training data.
The "nonfinal" features and data used in the ACL 2007 paper with Jianfeng Gao, Galen Andrew and Kristina Toutanova (of Microsoft research) can be reconstructed by setting "VERSION=nonfinal" in the top-level Makefile, or just downloaded from here.
If you're interested in writing new features for the reranker, the following talk slides may be helpful.
The empty node restorer program (C++ code, in a bzip2'd tar file) from my ACL 2002 paper A simple pattern-matching algorithm for recovering empty nodes and their antecedents. You can also read the README file.
A C implementation of the Inside-Outside algorithm for estimating PCFGs from terminal strings. This has an option to use Variational Bayes estimation (the -V flag) in place of the Maximum Likelihood estimation used in the Expectation-Maximization algorithm, which makes it comparable to the Gibbs PCFG estimators above. This program assumes that all terminals are introduced by unary rules. (last updated April 2013, old version is here).
You can also download C code for the Digamma function used in this program.
cky.tbz contains a very fast C implementation of a CKY PCFG parser, together with programs for extracting PCFGs from treebanks, etc. This was used in my 1999 CL article. (last updated 6th March, 2006)
A minimum edit distance alignment program in C++, and its readme file. The useful part of this is actually the header file med.h, which contains a generic dynamic programming aligner. (Last updated 29th September, 2011, to compile with g++ 4.6.0).
The tuple-finding software used to find collocations in: Don Blaheta and Mark Johnson (2001) ``Unsupervised learning of multi-word verbs.'' I updated this code in 2012 to compile under g++ 4.6; the old code is available here should you need it.
This collocation-finding paper references an unpublished draft paper on finding ``surprising'' pairs; you can get the the associated code for the exact binomial and the odds ratio interval estimators as well.
While the code is probably more than a decade old, I still get requests for the LALR parser generator code I wrote in CommonLisp (lalrparser.lisp) and in Scheme (lalr.ss). I still get email from people thanking me for this stuff, so bitrot doesn't seem to have affected them yet!
The draw-tree tree-drawing program (which can draw trees in eps, pdf, fig, pgf and wish formats) and the munge-trees tree-munging program (which can binarize, head-percolate, strip empty nodes, etc.).
LDA.tgz: An implementation of the Variational Bayes (VB) algorithm for Latent Dirichlet Allocation (LDA) topic models.
This isn't really software, but the list of incorrect parses produced by the discriminative parser (sorted so that the worst sentences come first) may be of interest to some of you.