To remove any frames surrounding this page,
click here
Open-source software written by
Mark Johnson
This software is open-source, but I do request acknowledgement whenever
this software is used to produce published results or incorporated into other software.
This is research software, and while I have tried to write it as well
as I can, it may still contain bugs, so users beware!
If you find any bugs, please let me know.
I believe that the programs compute what I claim they compute, but I do not guarantee this.
The programs may be poorly and inconsistently documented and may contain undocumented components, features
or modifications. I make no guarantee that these programs will be suitable for any application.
This software is provided AS IS with NO SUPPORT.
These programs have no warranty, guarantee, express or implied representation of any kind whatsoever. All other warranties including, but not limited to, merchantability and fitness
for purpose, whether express, implied, or arising by operation of law, course of dealing, or trade usage are hereby disclaimed.
- The accuracy extrapolation code and data
from our 2018 ACL paper, including the R code that generates the graphics.
- The Pitman-Yor adaptor grammar sampler
from our 2006 NIPS paper, now updated to estimate the Pitman-Yor hyperparameters
as well using a Slice sampler and to use the C++ TR1 hash tables (August 2009),
and to allow the sampler to be trained from a subset of the training corpus.
The Makefile now builds four versions of the program: a double-precision version
(py-cfg) and a quadruple-precision version (py-cfg-quad) which will run on much
longer input strings without underflow, as well as multi-threaded versions of each
(py-cfg-mp and py-cfg-quad-mp) which run roughly twice as fast as their single-threaded
counterparts.
(last update 23/09/2013 contains multi-threaded
versions, previous versions
02/05/2013,
25/02/2013,
18/11/2012,
16/08/2012,
30/07/2012,
12/09/2011)
- Data, adaptor grammars and supporting code for the
experiments described in our NAACL 2009 paper.
- Gibbs and Hastings samplers for PCFGs
(these are MCMC algorithms for computing the Bayesian version of what the
Inside-Outside algorithm computes), updated to handle non-tight PCFGs in three
different ways. The
Inside-Outside PCFG estimator available below
has been updated to optionally use Variational Bayes, so it provides an
alternative way of computing the same thing as these samplers.
As far as I can tell, on small data sets the samplers work well and are more accurate,
but on large data sets (say, more than 1 million tokens) the Variational
Bayes estimators converge much faster.
(Last updated on 22nd April 2013,
old version of 11th March 2011 also available).
- A gzipped tar archive
containing the reranking parser
(this is my version of December 2011,
the version of November 2009 is available
here
) primarily written by Eugene Charniak and me
(with the assistance of many people, e.g., Matt Lease and David McCloskey), as
described in
Eugene Charniak's and my ACL 2005 paper
and my 2005 CoNLL talk.
With some feature tweaking
it's now getting 91.4% f-score on section 23!
This archive contains code for completely retraining the reranker
from scratch, including:
-
constructing the 20 folds of 50-best parses,
- extract features from these 50-best parses, and
- estimate the reranker feature weights using MaxEnt, Averaged Perceptron,
etc.
You will need your own copy of the Penn Treebank and a machine with
4-8GB RAM to retrain the reranker (see the README and Makefiles).
Once trained, the parser+reranker should run in about 1/2 GB RAM
(and the tar file above includes a fully trained model which you
should be able to run out of the box).
The code is stored in a gzipped archive file. Download the archive file,
decompress it with gunzip and unpack it with tar.
If you are using GNU tar, you can decompress and unpack in
one step using tar -zxvf.
For fun, try the reranking parser on the Brown corpus!
Even though the reranking parser is trained only on WSJ,
it actually does surprisingly
well on Brown (better than any other parser, as far as I know,
including parsers trained on Brown). This suggests that we aren't
overtraining on our training data.
The "nonfinal" features and data used in the ACL 2007 paper with
Jianfeng Gao, Galen Andrew and Kristina Toutanova (of Microsoft
research) can be reconstructed by setting "VERSION=nonfinal" in
the top-level Makefile, or just downloaded from
here.
If you're interested in writing new features for the reranker,
the following talk slides
may be helpful.
- The empty node restorer program
(C++ code, in a bzip2'd tar file)
from my ACL 2002 paper
A simple pattern-matching algorithm for recovering empty nodes and their antecedents.
You can also read the README
file.
- A C implementation of the
Inside-Outside algorithm for
estimating PCFGs from terminal strings.
This has an option to use Variational Bayes estimation (the -V flag) in place
of the Maximum Likelihood estimation used in the Expectation-Maximization
algorithm, which makes it comparable to the Gibbs PCFG estimators above.
This program assumes that all terminals are introduced by unary rules.
(last updated April 2013, old version is
here).
You can also download
C code for the Digamma function used
in this program.
-
cky.tbz contains a very fast C implementation
of a CKY PCFG parser, together
with programs for extracting PCFGs from treebanks, etc.
This was used in my 1999 CL article.
(last updated 6th March, 2006)
- A minimum edit distance alignment
program in C++, and its readme file.
The useful part of this is actually the header file med.h, which
contains a generic dynamic programming aligner.
(Last updated 29th September, 2011, to compile with g++ 4.6.0).
- The tuple-finding software
used to find collocations in:
Don Blaheta and Mark Johnson (2001)
``Unsupervised learning of multi-word verbs.''
I updated this code in 2012 to compile under g++ 4.6;
the old code is available
here should you need it.
-
This collocation-finding paper references an
unpublished draft paper on finding
``surprising'' pairs; you can get the
the associated code for
the exact binomial and the
odds ratio interval estimators as well.
-
While the code is probably more than a decade old,
I still get requests for the LALR parser generator code I wrote in
CommonLisp (lalrparser.lisp) and in
Scheme (lalr.ss). I still get email from
people thanking me for this stuff, so bitrot doesn't seem to
have affected them yet!
-
The draw-tree tree-drawing program (which
can draw trees in eps, pdf, fig, pgf and wish formats) and the
munge-trees tree-munging program (which
can binarize, head-percolate, strip empty nodes, etc.).
-
LDA.tgz:
An implementation of the Variational Bayes (VB) algorithm for
Latent Dirichlet Allocation (LDA) topic models.
-
This isn't really software, but the list of incorrect
parses produced by the discriminative parser (sorted so that the worst sentences come first)
may be of interest to some of you.