| Extreme Markup Languages |
Keywords: XPath; Query
AbstractXPath is a very natural and powerful way to specify locations in XML documents. This paper examines possible generalisations of XPath to allow both locations and paths through generalised labelled directed graphs to be specified. The need for such a path language is driven by work in querying Linguistic Annotations which are in general more complex in structure than XML documents. The result of this exercise is a powerful path language which reduces to XPath as a special case and which could potentially be useful in a range of query applications. |
This paper describes a generalisation of the XPath query language motivated by the design of a query language for Annotated language corpora.
Annotated corpora have been an essential component of
research and development in language-related technologies for
some years. Text corpora have been used for developing
information retrieval and summarisation software (e.g. MUC
[MUC-7], TREC [Voorhees 98]),
automatic taggers and parsers and machine translation systems
[Church & Mercer 93]. In a similar way, annotated
speech corpora have proliferated and have found uses across a
rapidly expanding set of languages, disciplines and
technologies (see
www.ldc.upenn.edu/annotation/ for a
review).
Typically, such databases are specified at the level of file formats. Linguistic content is annotated with a variety of tags, attributes and values, with a specified syntax and semantics. Tools are developed for each new format and linguistic domain on an ad hoc basis. These systems are akin to the databases of the 1960s. There is a physical representation along with a hand-crafted program offering a single view on the data. Recent work [Bird & Liberman 01], [Cassidy 01] has developed an abstract data model aimed at being able to store all known annotation types; one outcome of this work is the development of toolsets which abstract away from file format details and present a uniform data model to applications.
One of these tools, the Emu system [Cassidy 01], views annotations as directed graphs where each node represents a token in the corpus and may be labelled with various attributes including a type (phoneme, word, paragraph), start and end times and one or more labels. The edges in this graph can have one of three types: a parent/child relationship defining one or more hierarchies, an explicit sequence relationship and a general associative relationship. A particular feature of linguistic annotations is that they often involve more than one hierarchy over the same set of nodes. For example, Figure shows part of a sentence which has been annotated syntactically and intonationally resulting in two hierarchies dominating a single set of word nodes.
While progress has been made on implementing tools for creation and manipulation of annotation data, no suitable query language has yet been defined. While the Emu system contains a query language implementation, it is based on an earlier version of the data model and is know not to be able to express some important queries. As part of the effort to develop a new query language for annotation data, we have begun to collect use cases for annotation query and have evaluated some query language proposals against these [Cassidy 02] [Cassidy et al 00].
Clearly the annotation data model shows a close resemblance to XML and indeed if only one hierarchy needs to be encoded in an annotation, it can be naturally serialised as XML. Many existing Linguistic text corpora are encoded in XML and there have been large scale efforts to develop standardised DTDs for corpus encoding []. The problem of intersecting hierarchies has been addressed using, for example, standoff markup [Thompson & McKelvie 97] where each hierarchy is stored in a seperate XML document each referring to regions in the original text using XPointer references.
XML has been less frequently used for annotation of speech data. It is used as an interchange format in some tools such as the Annotation Graph toolkit; however, in these cases the structure of the XML document does not reflect the hierarchical structure of the annotations.
Given this close relationship, an obvious question is whether the various XML query languages are appropriate for the task of querying annotations.
In an earlier paper [Cassidy 02] we presented a number of annotation query use cases were evaluated them with respect to XQuery. This showed that while a number of the queries could be evaluated within XQuery, some interesting cases were very difficult to express. Here we present two examples which illustrate differences between this application and the XML query use-cases.
Find L- Intermediate phrases containing words which form an NP Syntax phrase. An example answer to this query is the NP containing the words "your dark suit" in . This query highlights the need to be able to cope with multiple hierarchies since it requires looking at both Intermediate and Syntax parents of the Word nodes. While XML won't allow multiple parents, there's nothing in XQuery (or in particular, XPath) which precludes a traversal from parent to child to a different parent. If we allow this, the query can be written in XQuery as shown in Figure (of course the result here would be an XML fragment which might not be what is required, we'll ignore this factor for the moment).
Figure 2FOR $np in //syntax[@label="NP"]
LET $iparent ::= $np//word[1]/parent::inter
WHERE
EVERY $int IN
$np//word/parent::inter
SATISFIES ($int == $iparent) AND
($int/@label = "L-")
RETURN
$iparent
|
Find sequences of any number of consonant phonemes followed by a single vowel followed by any number of consonants: C+VC+. This is a significant use case since it illustrates an important class of query for language corpora -- finding non-annotated patterns based on `grammatical' structure. At the same time, this kind of query is very difficult to express in XQuery (although not impossible, see [Cassidy 02] for a solution using functions).
The evaluation of XQuery with respect to these and other use-cases provides two interesting insights. Firstly, the document or data centric views provided for by XQuery (and other XML query languages) are almost but not quite appropriate for annotation query. Secondly, XPath is a very clear and natural way to specify locations of elements within annotations but would need to be extended to achieve completeness.
This paper reports on some work which attempts to explore the idea of extending XPath in directions which make it more useful for annotation query. In particular, an attempt to generalise XPath to address queries like the second one presented above (C+VC+). The result of this exercise is a generalised path based query language which applies not only to annotations but to generalised directed graphs and which reduces to XPath as a special case.
For the purposes of this paper I'll talk about a particular kind of directed graph which is a generalisation of the graphs described for annotations. These graphs can be described as edge labelled directed graphs. An example is the XML graph structure which contains edges labelled 'child' and 'attribute'. The nodes in these graphs have two labels: a type (cf. the XML element name) and a textual label. This is an interesting class of graphs which subsumes for example, XML, RDF and our annotation data.
The remainder of this paper reviews the literature on path based query languages and then describes the generalisation of XPath and evaluates this in the annotation domain and in more general graph based data.
review both annotation query and graph/path query stuff
For the most part the query languages proposed and
developed for annotation data have been of ad-hoc design and
are able to cope only with a subset of the kinds of annotation
data seen in practice. For example, the Emu query language
[Cassidy 01] provides a means of
finding nodes based on sequential and hierarchical relations
but lacks many basic query language features such as
disjunction and compositionality. Most other annotation query
tools have made use of general purpose structural pattern
matching tools such as Sgrep
www.cs.helsinki.fi/u/jjaakkol/sgrep.html
which allow for regular expression like searches in structured
textual data.
One important proposal that has been made recently follows from the work on annotation structures by Bird and his colleagues [Bird & Liberman 01]. [Bird et al 00] propose a query language based on path expressions in an Annotation Graph (AG). An AG is a directed graph representation of an annotation which foregrounds the sequential structure of annotations; nodes in the graph are anchor points and edges represent linguistic elements. Hierarchical structure is implicit in AG when a dominating edge spans more than one subordinate edge. While this might seem very different to the hierarchical view of annotations described above, it can be shown [Cassidy 01] that they are able to represent the same structures.
Queries in the proposed AG query language are made up of a set of path expressions which describe paths through the annotation graph. For example in the query term X.[:word, label: cat].Y, X and Y are query variables and stand for nodes in the AG and the terms in brackets specify the type (:utt) and attribute values associated with the edge between these nodes.
While this query language proposal has a number of interesting features, it is awkward to express queries in. Since it focuses on traversing the sequential AG relations only, expressing hierarchical queries is very unnatural. Evaluation against a number of Annotation use cases has shown [Cassidy et al 00] that while it is capable of expressing all of the use cases, the constructs needed to do so are highly complex and would be difficult for naive users to work with. Part of the effort in this paper is to work with a very natural query language component (XPath) and see if it can be extended for annotation query wile retaining the naturalness of expression.
This will be a review of the DB literature relating to path queries. In particular looking at things like Graphlog [Consens & Mendelzon] and the G+ language described in [Mendelzon & Wood]. The general conclusion is that there is a good deal of prior work on these kind of languages but that none of them seem to quite fit the problem expressed here. For example, G+ provides a query language where regular expressions can be used to match sequences of edges in the graph but does not enable constraints to be placed on the nodes traversed. In addition to this point, the underlying goal of this paper is not to derive a new formal query language but to explore the generalisation of the existing XPath language. It may become clear during this exercise that the underlying principles are those of an already described QL; this is a good thing since we can then take advantage of the results relating to that QL in implementing our annotation QL.
The use-case analysis of XQuery highlights a number of deficiencies of XPath which might be addressed by a generalisation of the language. These are listed here with some commentary on the motivation for each.
While locating nodes in a graph is the sole use of XPath, there are cases in annotation query and perhaps in other applications where selecting paths through the graph is important. The prototypical example from annotation query is find sequences of phonemes with property X. Note that no changes to the path language are needed to achieve this, but that the query processor needs to be augmented to maintain path sets rather than node sets.
The set of axes defined in XPath is clearly designed to allow the set of graph traversal operations that are seen to be atomic in XML document trees. In a more general graph structure, it is clearly more difficult to foresee what axes will be needed and in general it is more useful to define what an axis is than a particular set of axes being prescribed.
An XPath axis is fundamentally a mapping from nodes to nodesets and defines a way of traversing the underlying graph. Each axis encapsulates two things: a type of edge to follow (eg. child vs. attribute) and whether to follow it transitively (eg. child vs. descendant).
One approach to extending the axis set is to allow arbitrary definitions of axes which are simply mappings from nodes to node sets. If we assume that the graph structure being searched is the edge labelled directed graph described earlier, then any edge label might be associated with an axis; in addition, axes might be defined for more complex traversal -- an example might be an uncle axis which returns siblings of a node's parent using a combination of parent and child edges. Similarly, for an XML document one might define an idref axis which followed idref links.
This approach would allow the inclusion of transitivity in the axis definition as is the case with the descendant axis in XPath. However, a more general solution is to factor this out as a modifiable parameter in a path expression. The current binary single step/transitive distinction can be generalised to a how many steps parameter which controls how many steps along an axis are allowable. For example, I could find my grandparents by stepping along the parent axis exactly twice; I could find all of my children and grandchildren by stepping along the child axis up to two times.
The proposal here is to define an axis as a function mapping nodes to pathsets; the function takes a parameter denoting how many steps to take along the axis in generating new paths.
In XPath, conditions apply only to the last node on the path extended from the context node; for example, the expression /descendant:para[@style='important'] traverses along the child axis any number of steps to find paragraphs with a particular attribute. No condition is placed on any of the intermediate nodes along the axis.
Rather than limiting conditions to the final element of a path, the approach taken here is to allow conditions on any subset of the nodes traversed while extending the path. For example, a condition might apply to all nodes, the first node or the last two nodes.
Introducing this generalisation allows the use-case described earlier to be addressed. A sequence of nodes labelled C can be located by walking along the following axis an unlimited number of steps applying the condition to every node.
The generalisations described above can be implemented in
a new path language where a query consists of a series of calls
to a single extend function
which maps pathsets to pathsets. The signature of extend is as follows:
extend [axis] [step] [condition] [where] |
The [axis] argument defines an axis which must somehow define a mapping from nodes to nodesets in the graph. Each axis need only define what it means to take a single step from a node. We foresee a framework providing provision for definition of new axes on particular graphs in terms of a base set of axes or via a general purpose programming language.
The [step] argument determines how man steps should be taken along the axis and takes the form of a list of positive integers or the special value inf. The interpretation of this list is that a path length is allowed if included in the list. The special value of inf means that any length of path is valid. If the list includes integers and inf then any path of length greater than the highest integer is valid. For example, [2,3,4] would allow paths of length 2, 3 or 4 but not of length 1 or 5, [1,10,inf] would allow paths of length 1 or longer than 10. It is expected that syntactic sugar would be introduced to specify ranges of integers etc. Note that when the only integer specified is zero, the path will not be extended at all; this only makes sense in combination with where = 0, see below for a discussion.
The [condition] argument is intended to be a general boolean condition on nodes in the same sense as XPath conditions. This would include recursive calls to the path language and any conditions on node labels that are appropriate to the application. The exact form of these conditions is not specified here; in the trial implementation we have implemented a general callout to the host programming language (Tcl). As in XPath, the condition operates on a context node; unlike XPath this might be many nodes along a path.
The [where] argument specifies where on the path the condition must hold. As with [step] this is a list of positive integers plus the special values of inf and end. An integer in the list means that the condition is applied to that node on the path. The special value inf means that the condition must hold for all nodes in the path. The special value end means that the condition applies to the end of the path only. For example, the list [1,2,end] would find paths where the first, second and last nodes satisfied the condition.
An important special case is when the where argument is includes a zero. Path positions are indexed from one so that the first node added to a path has an index one. The zero index refers to the context node -- that is the start node for the path extension. If where includes zero then the condition must hold on the context node with the result that all existing paths who's last node does not meet the condition will be pruned. In XPath, the same effect is achieved via a special axis called self; the mechanism proposed here removes the need for this special axis. In combination with a zero step argument, this can result in a pure prune operation where no new paths are extended but any existing paths not fulfilling the condition is pruned.
In XPath, there is a unique starting node from which an XPath expression is evaluated. This is either a node defined by external context or the root of the XML document. Extend programs could maintain the idea of an externally provided node (or nodeset or pathset) but since the underlying data model is a generalised directed graph there is no equivalent to the document root. Two possible starting conditions seem plausible: start with the empty nodeset or start with the set of all nodes. Starting from the empty set requires some magic behaviour for each axis: walking from the empty set along any axis leads to the set of all nodes. Given this, it seems best to assume the full nodeset as a starting point for an extend program which isn't given external context.
To illustrate the ideas described above, here are two example extend programs which implement an XPath and annotation query use-cases. In these examples a global path set is assumed but in practice this would be passed between calls to the extend function.
//book/chapter/following-sibling:chapter.
Using the child and following-sibling axes and assuming that
each node has a type
label corresponding to the element name:
extend child [inf] type='book' [end] extend child [1] type='chapter' [end] extend following-sibling [inf] type='chapter' [end] |
Find sequences of any number
of consonant phonemes followed by a single vowel followed by
any number of consonants: C+VC+. Assuming a
following axis which
connects nodes in sequence the extend program is:
extend following [0,inf] type=phoneme & label=C [0,all] extend following [1] type=phoneme & label=V [all] extend following [1,inf] type=phoneme & label=C [all] |
While extend programs might provide the required
expressive power it is clearly not suitable for end users and
a more natural query syntax is desirable. An obvious choice
is to stay close to the XPath syntax and add new elements for
the step and where parameters. One possible syntax would be:
/axis{step}:condition:where |
/following{0,inf}:phoneme[label=C]:{0,all}
/following:phoneme[label=C]
/following{0,inf}:phoneme[label=C]:{0,all}
|
This section will talk about potential applications for the generalised path expression syntax. An obvious example is searching RDF graphs.
We have a trial implementation of an extend program
interpreter written in Tcl which can operate on arbitrary graph
data. Demonstrations have been build for XML and generalised
graphs, and we are working on an implementation for Annotation
Graphs. This section will describe the naive algorithm used to
implement extend. For now, here's the Tcl code that implements extend:
itcl::body xxgraph::extend {axis range condition where paths} {
set range [range $range]
set where [range $where]
set matchingpaths [list]
foreach path $paths {
if {[llength $path]} {
if {[iselement 0 $where]} {
# check condition on 0th node
if {![uplevel 1 [list $condition [pathend $path]]]} {
continue
}
}
if {[iselement 0 $range]} {
# unextended path matches
lappend matchingpaths $path
}
}
# start extending
set i 1
set currentpaths [list $path]
while {![islarger $i $range]} {
set newpaths [list]
foreach p $currentpaths {
# check all nodes along axis
foreach node [axis_$axis [pathend $p]] {
# only check condition on lengths in $where
if {![iselement $i $where] || \
[uplevel 1 [list $condition $node]]} {
if {[llength $p]} {
# extend along axis
set n [concat $p $axis $node]
} else {
# extend an empty path
set n [pathinitial $node]
}
lappend newpaths $n
if {[iselement $i $range]} {
# extended by a length in $range,
# i.e. found a match
lappend matchingpaths $n
}
}
}
}
incr i
if {![llength [set currentpaths $newpaths]]} {
# no new paths available to extend further
break
}
}
}
set matchingpaths
}
|
This section will also acknowledge that an efficient implementation of this language might be very difficult given it's generality. However, we've not yet looked at this aspect of the problem. It may be that restrictions need to be placed on the language to enable efficient implementation.
In conclusion, XPath can be generalised to a new path language for generalised edge labelled directed graphs which provides a very consistent set of semantics for axes and path extension. This new language reduces to XPath as a special case where only single steps are taken along an axis and conditions are only applied to nodes at the end point of the path. We believe that this is an interesting generalisation which might form the basis of a powerful and natural query language for Linguistic annotations and which might find application in other areas.
I'd like to thank Dan Steffan for the implementation of the prototype extend program interpreter and discussions on the graph searching problem.
[MUC-7] Message Understanding
Conference Proceedings (MUC-7). Science
Applications International Corporation,
1998.
www.muc.saic.com/proceedings/muc_7_toc.html.
[Voorhees 98] Ellen M. Voorhees and Donna K. Harman,
editors. NIST Special Publication 500-242: The
Seventh Text Retrieval Conference (TREC-7). NIST,
Government Printing Office, 1998.
trec.nist.gov/pubs/trec7/t7_proceedings.html.
[Church & Mercer 93] K.W. Church and R.L. Mercer. Introduction to the special issue on computational linguistics using large corpora. Computational Linguistics, 19(1):1--24, 1993.
[Bird & Liberman 01] A formal framework for linguistic annotation Steven Bird and Mark Liberman, Speech Communication 33(1,2), pp 23-60, 2001.
[Cassidy 01] S. Cassidy and J. Harrington, Multi-level annotation in the Emu speech database management system Speech Communication, 33, 61-77, January, 2001.
[Thompson & McKelvie 97] Thompson, H. S. and
McKelvie, D., Hyperlink semantics for standoff
markup of read-only documents, in SGML Europe 1997,
www.ltg.ed.ac.uk/~ht/sgmleu97.html
[TEI] Sperberg-McQueen, C.M.. and Burnard,
L. (eds.). TEI P4: Guidelines for
Electronic Text Encoding and Interchange. Text
Encoding Initiative Consortium. XML Version: Oxford,
Providence, Charlottesville, Bergen
www.tei-c.org/
[Cassidy 02] Cassidy, S. XQuery as an Annotation Query Language: a Use Case Analysis, Proceedings of the Language Resources and Evaluation Conference, Las Palmas, Spain, 2002
[Cassidy et al 00] Steve Cassidy, Pauline Welby, Julie McGory, Mary Beckman Testing the Adequacy of Query Languages Against Annotated Spoken Dialog. Proceedings of the Speech Science and Technology Conference, Canberra, December 2000.
[Bird et al 00] Steven Bird, Peter Buneman and Wang-Chiew Tan Towards a query language for annotation graphs , Proceedings of the Second International Conference on Language Resources and Evaluation, pp 807-814, 2000.
[Consens & Mendelzon]
Consens, M.P. and Mendelzon, A. O. 1990. Graphlog: A Visual Formalism for Real Life
Recursion, Proc. of the ACM Symp. on Principles
of Database
Systems.
citeseer.nj.nec.com/consens90graphlog.html
[Mendelzon & Wood] Alberto O. Mendelzon, Peter T. Wood Finding Regular Simple Paths In Graph Databases Proceedings of the 15th Conference on Very Large Databases, Morgan Kaufman pubs. (Los Altos CA), Amsterdam, 1989