Extreme Markup Languages

Generalising XPath for Directed Graphs

Steve Cassidy
Centre for Language Technology
Macquarie University
Sydney NSW

Keywords: XPath; Query


XPath is a very natural and powerful way to specify locations in XML documents. This paper examines possible generalisations of XPath to allow both locations and paths through generalised labelled directed graphs to be specified. The need for such a path language is driven by work in querying Linguistic Annotations which are in general more complex in structure than XML documents. The result of this exercise is a powerful path language which reduces to XPath as a special case and which could potentially be useful in a range of query applications.

1 Background

This paper describes a generalisation of the XPath query language motivated by the design of a query language for Annotated language corpora.

1.1 Annotated Corpora

Annotated corpora have been an essential component of research and development in language-related technologies for some years. Text corpora have been used for developing information retrieval and summarisation software (e.g. MUC [MUC-7], TREC [Voorhees 98]), automatic taggers and parsers and machine translation systems [Church & Mercer 93]. In a similar way, annotated speech corpora have proliferated and have found uses across a rapidly expanding set of languages, disciplines and technologies (see
www.ldc.upenn.edu/annotation/ for a review).

Typically, such databases are specified at the level of file formats. Linguistic content is annotated with a variety of tags, attributes and values, with a specified syntax and semantics. Tools are developed for each new format and linguistic domain on an ad hoc basis. These systems are akin to the databases of the 1960s. There is a physical representation along with a hand-crafted program offering a single view on the data. Recent work [Bird & Liberman 01], [Cassidy 01] has developed an abstract data model aimed at being able to store all known annotation types; one outcome of this work is the development of toolsets which abstract away from file format details and present a uniform data model to applications.

One of these tools, the Emu system [Cassidy 01], views annotations as directed graphs where each node represents a token in the corpus and may be labelled with various attributes including a type (phoneme, word, paragraph), start and end times and one or more labels. The edges in this graph can have one of three types: a parent/child relationship defining one or more hierarchies, an explicit sequence relationship and a general associative relationship. A particular feature of linguistic annotations is that they often involve more than one hierarchy over the same set of nodes. For example, Figure shows part of a sentence which has been annotated syntactically and intonationally resulting in two hierarchies dominating a single set of word nodes.

Figure 1:

An example Emu annotation showing an utterance from the TIMIT database which has been augmented with both a syntactic and intonational annotation. The names on the left show the types of each set of nodes. The Word nodes have been duplicated to show the links to both the syntactic and intonational hierarchies. For clarity, sequence relationships are implicit in the left to right ordering of nodes.

	    An example Emu annotation showing an utterance from
	    the TIMIT database which has been augmented with both a
	    syntactic and intonational annotation.  The names on the
	    left show the types of each set of nodes. The Word nodes
	    have been duplicated to show the links to both the
	    syntactic and intonational hierarchies. For clarity,
	    sequence relationships are implicit in the left to right
	    ordering of nodes.

[Graphic entity emu-timit — cassi082001.png]

While progress has been made on implementing tools for creation and manipulation of annotation data, no suitable query language has yet been defined. While the Emu system contains a query language implementation, it is based on an earlier version of the data model and is know not to be able to express some important queries. As part of the effort to develop a new query language for annotation data, we have begun to collect use cases for annotation query and have evaluated some query language proposals against these [Cassidy 02] [Cassidy et al 00].

1.2 Relation to XML

Clearly the annotation data model shows a close resemblance to XML and indeed if only one hierarchy needs to be encoded in an annotation, it can be naturally serialised as XML. Many existing Linguistic text corpora are encoded in XML and there have been large scale efforts to develop standardised DTDs for corpus encoding []. The problem of intersecting hierarchies has been addressed using, for example, standoff markup [Thompson & McKelvie 97] where each hierarchy is stored in a seperate XML document each referring to regions in the original text using XPointer references.

XML has been less frequently used for annotation of speech data. It is used as an interchange format in some tools such as the Annotation Graph toolkit; however, in these cases the structure of the XML document does not reflect the hierarchical structure of the annotations.

Given this close relationship, an obvious question is whether the various XML query languages are appropriate for the task of querying annotations.

1.3 Annotation Query Use Cases

In an earlier paper [Cassidy 02] we presented a number of annotation query use cases were evaluated them with respect to XQuery. This showed that while a number of the queries could be evaluated within XQuery, some interesting cases were very difficult to express. Here we present two examples which illustrate differences between this application and the XML query use-cases.

Find L- Intermediate phrases containing words which form an NP Syntax phrase. An example answer to this query is the NP containing the words "your dark suit" in . This query highlights the need to be able to cope with multiple hierarchies since it requires looking at both Intermediate and Syntax parents of the Word nodes. While XML won't allow multiple parents, there's nothing in XQuery (or in particular, XPath) which precludes a traversal from parent to child to a different parent. If we allow this, the query can be written in XQuery as shown in Figure (of course the result here would be an XML fragment which might not be what is required, we'll ignore this factor for the moment).

Figure 2
FOR $np in //syntax[@label="NP"]
LET $iparent ::= $np//word[1]/parent::inter
    EVERY $int IN 
    SATISFIES ($int == $iparent) AND 
              ($int/@label = "L-")

Find sequences of any number of consonant phonemes followed by a single vowel followed by any number of consonants: C+VC+. This is a significant use case since it illustrates an important class of query for language corpora -- finding non-annotated patterns based on `grammatical' structure. At the same time, this kind of query is very difficult to express in XQuery (although not impossible, see [Cassidy 02] for a solution using functions).

The evaluation of XQuery with respect to these and other use-cases provides two interesting insights. Firstly, the document or data centric views provided for by XQuery (and other XML query languages) are almost but not quite appropriate for annotation query. Secondly, XPath is a very clear and natural way to specify locations of elements within annotations but would need to be extended to achieve completeness.

1.4 Outline

This paper reports on some work which attempts to explore the idea of extending XPath in directions which make it more useful for annotation query. In particular, an attempt to generalise XPath to address queries like the second one presented above (C+VC+). The result of this exercise is a generalised path based query language which applies not only to annotations but to generalised directed graphs and which reduces to XPath as a special case.

For the purposes of this paper I'll talk about a particular kind of directed graph which is a generalisation of the graphs described for annotations. These graphs can be described as edge labelled directed graphs. An example is the XML graph structure which contains edges labelled 'child' and 'attribute'. The nodes in these graphs have two labels: a type (cf. the XML element name) and a textual label. This is an interesting class of graphs which subsumes for example, XML, RDF and our annotation data.

The remainder of this paper reviews the literature on path based query languages and then describes the generalisation of XPath and evaluates this in the annotation domain and in more general graph based data.

2 Query Language Review

review both annotation query and graph/path query stuff

2.1 Annotation Query

For the most part the query languages proposed and developed for annotation data have been of ad-hoc design and are able to cope only with a subset of the kinds of annotation data seen in practice. For example, the Emu query language [Cassidy 01] provides a means of finding nodes based on sequential and hierarchical relations but lacks many basic query language features such as disjunction and compositionality. Most other annotation query tools have made use of general purpose structural pattern matching tools such as Sgrep
www.cs.helsinki.fi/u/jjaakkol/sgrep.html which allow for regular expression like searches in structured textual data.

One important proposal that has been made recently follows from the work on annotation structures by Bird and his colleagues [Bird & Liberman 01]. [Bird et al 00] propose a query language based on path expressions in an Annotation Graph (AG). An AG is a directed graph representation of an annotation which foregrounds the sequential structure of annotations; nodes in the graph are anchor points and edges represent linguistic elements. Hierarchical structure is implicit in AG when a dominating edge spans more than one subordinate edge. While this might seem very different to the hierarchical view of annotations described above, it can be shown [Cassidy 01] that they are able to represent the same structures.

Queries in the proposed AG query language are made up of a set of path expressions which describe paths through the annotation graph. For example in the query term X.[:word, label: cat].Y, X and Y are query variables and stand for nodes in the AG and the terms in brackets specify the type (:utt) and attribute values associated with the edge between these nodes.

While this query language proposal has a number of interesting features, it is awkward to express queries in. Since it focuses on traversing the sequential AG relations only, expressing hierarchical queries is very unnatural. Evaluation against a number of Annotation use cases has shown [Cassidy et al 00] that while it is capable of expressing all of the use cases, the constructs needed to do so are highly complex and would be difficult for naive users to work with. Part of the effort in this paper is to work with a very natural query language component (XPath) and see if it can be extended for annotation query wile retaining the naturalness of expression.

2.2 Path Based Query Languages

This will be a review of the DB literature relating to path queries. In particular looking at things like Graphlog [Consens & Mendelzon] and the G+ language described in [Mendelzon & Wood]. The general conclusion is that there is a good deal of prior work on these kind of languages but that none of them seem to quite fit the problem expressed here. For example, G+ provides a query language where regular expressions can be used to match sequences of edges in the graph but does not enable constraints to be placed on the nodes traversed. In addition to this point, the underlying goal of this paper is not to derive a new formal query language but to explore the generalisation of the existing XPath language. It may become clear during this exercise that the underlying principles are those of an already described QL; this is a good thing since we can then take advantage of the results relating to that QL in implementing our annotation QL.

3 Generalising XPath

The use-case analysis of XQuery highlights a number of deficiencies of XPath which might be addressed by a generalisation of the language. These are listed here with some commentary on the motivation for each.

  1. The result of evaluating a path expression is a nodeset. In some applications it would be useful to be able to retrieve the path that was traversed. Hence the result of a path expression would be a path set.
  2. Limited, non-extensible set of axes.
  3. Unable to state conditions on paths as opposed to nodes. This extends to the notion of defining regular expressions on paths.

3.1 Returning Path Sets

While locating nodes in a graph is the sole use of XPath, there are cases in annotation query and perhaps in other applications where selecting paths through the graph is important. The prototypical example from annotation query is find sequences of phonemes with property X. Note that no changes to the path language are needed to achieve this, but that the query processor needs to be augmented to maintain path sets rather than node sets.

3.2 Defining Axes

The set of axes defined in XPath is clearly designed to allow the set of graph traversal operations that are seen to be atomic in XML document trees. In a more general graph structure, it is clearly more difficult to foresee what axes will be needed and in general it is more useful to define what an axis is than a particular set of axes being prescribed.

An XPath axis is fundamentally a mapping from nodes to nodesets and defines a way of traversing the underlying graph. Each axis encapsulates two things: a type of edge to follow (eg. child vs. attribute) and whether to follow it transitively (eg. child vs. descendant).

One approach to extending the axis set is to allow arbitrary definitions of axes which are simply mappings from nodes to node sets. If we assume that the graph structure being searched is the edge labelled directed graph described earlier, then any edge label might be associated with an axis; in addition, axes might be defined for more complex traversal -- an example might be an uncle axis which returns siblings of a node's parent using a combination of parent and child edges. Similarly, for an XML document one might define an idref axis which followed idref links.

This approach would allow the inclusion of transitivity in the axis definition as is the case with the descendant axis in XPath. However, a more general solution is to factor this out as a modifiable parameter in a path expression. The current binary single step/transitive distinction can be generalised to a how many steps parameter which controls how many steps along an axis are allowable. For example, I could find my grandparents by stepping along the parent axis exactly twice; I could find all of my children and grandchildren by stepping along the child axis up to two times.

The proposal here is to define an axis as a function mapping nodes to pathsets; the function takes a parameter denoting how many steps to take along the axis in generating new paths.

3.3 Path Conditions

In XPath, conditions apply only to the last node on the path extended from the context node; for example, the expression /descendant:para[@style='important'] traverses along the child axis any number of steps to find paragraphs with a particular attribute. No condition is placed on any of the intermediate nodes along the axis.

Rather than limiting conditions to the final element of a path, the approach taken here is to allow conditions on any subset of the nodes traversed while extending the path. For example, a condition might apply to all nodes, the first node or the last two nodes.

Introducing this generalisation allows the use-case described earlier to be addressed. A sequence of nodes labelled C can be located by walking along the following axis an unlimited number of steps applying the condition to every node.

3.4 Extend Programs

The generalisations described above can be implemented in a new path language where a query consists of a series of calls to a single extend function which maps pathsets to pathsets. The signature of extend is as follows:
	[axis] [step] [condition] [where]
where [axis] is the name of the axis to traverse, [step] is a restriction on the number of steps taken along the axis, [condition] is a boolean condition on nodes and [where] is a restriction on where the condition must hold. These are discussed in detail below.

3.4.1 Axis

The [axis] argument defines an axis which must somehow define a mapping from nodes to nodesets in the graph. Each axis need only define what it means to take a single step from a node. We foresee a framework providing provision for definition of new axes on particular graphs in terms of a base set of axes or via a general purpose programming language.

3.4.2 Step

The [step] argument determines how man steps should be taken along the axis and takes the form of a list of positive integers or the special value inf. The interpretation of this list is that a path length is allowed if included in the list. The special value of inf means that any length of path is valid. If the list includes integers and inf then any path of length greater than the highest integer is valid. For example, [2,3,4] would allow paths of length 2, 3 or 4 but not of length 1 or 5, [1,10,inf] would allow paths of length 1 or longer than 10. It is expected that syntactic sugar would be introduced to specify ranges of integers etc. Note that when the only integer specified is zero, the path will not be extended at all; this only makes sense in combination with where = 0, see below for a discussion.

3.4.3 Condition

The [condition] argument is intended to be a general boolean condition on nodes in the same sense as XPath conditions. This would include recursive calls to the path language and any conditions on node labels that are appropriate to the application. The exact form of these conditions is not specified here; in the trial implementation we have implemented a general callout to the host programming language (Tcl). As in XPath, the condition operates on a context node; unlike XPath this might be many nodes along a path.

3.4.4 Where

The [where] argument specifies where on the path the condition must hold. As with [step] this is a list of positive integers plus the special values of inf and end. An integer in the list means that the condition is applied to that node on the path. The special value inf means that the condition must hold for all nodes in the path. The special value end means that the condition applies to the end of the path only. For example, the list [1,2,end] would find paths where the first, second and last nodes satisfied the condition.

An important special case is when the where argument is includes a zero. Path positions are indexed from one so that the first node added to a path has an index one. The zero index refers to the context node -- that is the start node for the path extension. If where includes zero then the condition must hold on the context node with the result that all existing paths who's last node does not meet the condition will be pruned. In XPath, the same effect is achieved via a special axis called self; the mechanism proposed here removes the need for this special axis. In combination with a zero step argument, this can result in a pure prune operation where no new paths are extended but any existing paths not fulfilling the condition is pruned.

3.4.5 Initial Conditions

In XPath, there is a unique starting node from which an XPath expression is evaluated. This is either a node defined by external context or the root of the XML document. Extend programs could maintain the idea of an externally provided node (or nodeset or pathset) but since the underlying data model is a generalised directed graph there is no equivalent to the document root. Two possible starting conditions seem plausible: start with the empty nodeset or start with the set of all nodes. Starting from the empty set requires some magic behaviour for each axis: walking from the empty set along any axis leads to the set of all nodes. Given this, it seems best to assume the full nodeset as a starting point for an extend program which isn't given external context.

3.4.6 Example Extend Programs

To illustrate the ideas described above, here are two example extend programs which implement an XPath and annotation query use-cases. In these examples a global path set is assumed but in practice this would be passed between calls to the extend function.

//book/chapter/following-sibling:chapter. Using the child and following-sibling axes and assuming that each node has a type label corresponding to the element name:
extend child             [inf] type='book'    [end]
extend child             [1]   type='chapter' [end]
extend following-sibling [inf] type='chapter' [end] 
In this example, the first instruction finds book nodes along the child axis and is allowed to make any number of steps; only the final node in the path needs to be a book. The second instruction finds chapter nodes one step away. The third instruction makes any number of steps along the following-sibling axis to find another chapter node; again, only the end node is tested. The result of this program would be a set of paths, to generate the equivalent set of nodes as the XPath expression, the end nodes of each path can be taken.

Find sequences of any number of consonant phonemes followed by a single vowel followed by any number of consonants: C+VC+. Assuming a following axis which connects nodes in sequence the extend program is:
extend following [0,inf] type=phoneme & label=C [0,all]
extend following [1]     type=phoneme & label=V [all]
extend following [1,inf] type=phoneme & label=C [all]
Here, the first instruction extends the start set of all nodes by between zero and infinity steps ensuring that all nodes, including the start node, are labelled C. The second instruction extends by exactly one step to a node labelled V. The third instruction is similar to the first but doesn't constrain the zeroth node (which we know is a V).

3.4.7 Syntax

While extend programs might provide the required expressive power it is clearly not suitable for end users and a more natural query syntax is desirable. An obvious choice is to stay close to the XPath syntax and add new elements for the step and where parameters. One possible syntax would be:
The step and where components would default to their XPath values (step = 1 and where = end). Using this syntax, the second example above would be written:
where path elements are broken for readability. Some short forms could be defined for step and where; for example, + for [1,inf], * for [0,inf].

4 Applications of Generalised XPath

This section will talk about potential applications for the generalised path expression syntax. An obvious example is searching RDF graphs.

5 Implementation

We have a trial implementation of an extend program interpreter written in Tcl which can operate on arbitrary graph data. Demonstrations have been build for XML and generalised graphs, and we are working on an implementation for Annotation Graphs. This section will describe the naive algorithm used to implement extend. For now, here's the Tcl code that implements extend:
itcl::body xxgraph::extend {axis range condition where paths} {
    set range [range $range]
    set where [range $where]
    set matchingpaths [list]
    foreach path $paths {
        if {[llength $path]} {
            if {[iselement 0 $where]} {
                # check condition on 0th node
                if {![uplevel 1 [list $condition [pathend $path]]]} {
            if {[iselement 0 $range]} {
                # unextended path matches
                lappend matchingpaths $path
        # start extending
        set i 1
        set currentpaths [list $path]
        while {![islarger $i $range]} {
            set newpaths [list]
            foreach p $currentpaths {
                # check all nodes along axis
                foreach node [axis_$axis [pathend $p]] {
                    # only check condition on lengths in $where
                    if {![iselement $i $where] || \
                      [uplevel 1 [list $condition $node]]} {
                        if {[llength $p]} {
                            # extend along axis
                            set n [concat $p $axis $node]
                        } else {
                            # extend an empty path 
                            set n [pathinitial $node]
                        lappend newpaths $n
                        if {[iselement $i $range]} { 
                            # extended by a length in $range,
                            # i.e. found a match
                            lappend matchingpaths $n
            incr i
            if {![llength [set currentpaths $newpaths]]} {
                # no new paths available to extend further
    set matchingpaths

This section will also acknowledge that an efficient implementation of this language might be very difficult given it's generality. However, we've not yet looked at this aspect of the problem. It may be that restrictions need to be placed on the language to enable efficient implementation.

6 Discussion

In conclusion, XPath can be generalised to a new path language for generalised edge labelled directed graphs which provides a very consistent set of semantics for axes and path extension. This new language reduces to XPath as a special case where only single steps are taken along an axis and conditions are only applied to nodes at the end point of the path. We believe that this is an interesting generalisation which might form the basis of a powerful and natural query language for Linguistic annotations and which might find application in other areas.


I'd like to thank Dan Steffan for the implementation of the prototype extend program interpreter and discussions on the graph searching problem.


[MUC-7] Message Understanding Conference Proceedings (MUC-7). Science Applications International Corporation, 1998.

[Voorhees 98] Ellen M. Voorhees and Donna K. Harman, editors. NIST Special Publication 500-242: The Seventh Text Retrieval Conference (TREC-7). NIST, Government Printing Office, 1998.

[Church & Mercer 93] K.W. Church and R.L. Mercer. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24, 1993.

[Bird & Liberman 01] A formal framework for linguistic annotation Steven Bird and Mark Liberman, Speech Communication 33(1,2), pp 23-60, 2001.

[Cassidy 01] S. Cassidy and J. Harrington, Multi-level annotation in the Emu speech database management system Speech Communication, 33, 61-77, January, 2001.

[Thompson & McKelvie 97] Thompson, H. S. and McKelvie, D., Hyperlink semantics for standoff markup of read-only documents, in SGML Europe 1997,

[TEI] Sperberg-McQueen, C.M.. and Burnard, L. (eds.). TEI P4: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium. XML Version: Oxford, Providence, Charlottesville, Bergen

[Cassidy 02] Cassidy, S. XQuery as an Annotation Query Language: a Use Case Analysis, Proceedings of the Language Resources and Evaluation Conference, Las Palmas, Spain, 2002

[Cassidy et al 00] Steve Cassidy, Pauline Welby, Julie McGory, Mary Beckman Testing the Adequacy of Query Languages Against Annotated Spoken Dialog. Proceedings of the Speech Science and Technology Conference, Canberra, December 2000.

[Bird et al 00] Steven Bird, Peter Buneman and Wang-Chiew Tan Towards a query language for annotation graphs , Proceedings of the Second International Conference on Language Resources and Evaluation, pp 807-814, 2000.

[Consens & Mendelzon] Consens, M.P. and Mendelzon, A. O. 1990. Graphlog: A Visual Formalism for Real Life Recursion, Proc. of the ACM Symp. on Principles of Database Systems.

[Mendelzon & Wood] Alberto O. Mendelzon, Peter T. Wood Finding Regular Simple Paths In Graph Databases Proceedings of the 15th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Amsterdam, 1989