Galaxy Tool Generating Dataset Collections

As part of the Alveo project we’ve been using the Galaxy Workflow Engine to provide a web-based user-friendly interface to some language processing tools. Galaxy was originally developed for Bioinformatics researchers but we’ve been able to adapt it for language tools quite easily. Galaxy tools are scripts or executable command line applications that read input data from files and write results out to new files. These files are presented as data objects in the Galaxy interface. Chains of tools can be run one after another to process data from input to final results.

One of the recent updates to Galaxy is the ability to group data objects together into datasets. These datasets can then form the input to a workflow which can be run for each object in the dataset.  This is something we’ve wanted for Alveo for a long time since applying the same process to all files in a collection is a common requirement for language processing.   After a bit of exploration I’ve worked out how to write a tool that generates a dataset and since the documentation for this is somewhat sparse and confusing, I thought I’d write up my findings.

To work through the issues I built the simplest tool I could that generated a collection of files: a python script that creates three files with a bit of random data.  The script takes a single required option which is the name of the output directory.

To turn this into a Galaxy tool we need to write an XML configuration file (see the Gist below for the code). This has a section that defines the command line to be run to run the tool and the names of any input options. In this case the only input is a name for the resulting dataset.

One thing that I learned in getting to this solution is that when Galaxy runs a tool it does so in a newly created temporary directory; this means that there is no problem with the output from the tool overwriting the output of any other tool, so output filenames or directory names don’t need to be unique.  However, I did find that this directory contains three temporary files generated by Galaxy (galaxy_1.ec, galaxy_1.sh  and set_metadata_7OxS74.py) this tripped me up before I worked out that I needed to write files to a sub-directory.

The important part of the configuration file is the <outputs> section. This normally just lists the expected output of the tool, but in this case the tool is writing an unknown number of files to a directory.  The output section of my tool configuration is:

 <outputs>
   <collection type="list" label="$job_name" name="output1">
     <discover_datasets pattern="(?P<name>.*)" directory="SampleDataset" />
   </collection>
 </outputs>

 

The <collection> tag says that we’re expecting a collection of data (a dataset). The <discover_datasets> tag describes how Galaxy can find the elements of the dataset – in this case by finding files in the directory SampleDataset  matching the regular expression “.*” (ie. all files in this directory).   The file name becomes the name of the data object.

The code for the python script and the XML file are in the gist below.  Developing Galaxy tools is relatively easy especially with the help of planemo – a collection of scripts that help you write, test and run your new tools.   Once you have planemo installed, store these two files in a directory and run “planemo serve”; planemo will download a copy of Galaxy if you don’t already have one and run the server so that you can access galaxy on http://127.0.0.1:9090.