toolkit useful for working with XML documents

View the Project on GitHub

How To Use XML Utils

Read An XML Document

read an XML document and create a TreeDocument:

// provide a file
File file = new File ("/path/to/doc");
TreeDocument document = new TreeDocument (XmlTools.readDocument (file), file.toURI ());
// or a string containing the XML code
TreeDocument document = new TreeDocument (XmlTools.readDocument ("<xml> [...] </xml>"), null);

The second parameter of the constructor is an URI, which will be used to resolve relative links to resources defined from within the document.

Object Structure


The main object you’ll deal with is the TreeDocument, see Java Doc The TreeDocument maintains several maps to easily access certain nodes in the tree. To get the keys and the corresponding nodes you can use the following methods


Small example to see the usage of TreeDocuments:

File file = new File ("/path/to/doc");
TreeDocument document = new TreeDocument (XmlTools.readDocument (file), file.toURI ());
System.out.println ("There are " + document.getNumNodes () + " nodes in this tree");

// get all subtrees (i.e. nodes rooting these trees) ordered by size, biggest first:
TreeNode[] subTrees = document.getSubtreesBySize ();

// get the root of the tree
TreeNode root = document.getRoot ();

// get the node with id="sems"
// and print the number of nodes below this node
DocumentNode semsNode = document.getNodeById ("sems");
System.out.println ("#nodes below sems-node: " + semsNode.getSizeSubtree ());

// get the first node having a tag name of "example"
// and print the level of its parent
DocumentNode node = document.getNodesByTag ("example").get (0);
System.out.println ("level of parent of first <example> node: " + node.getParent ().getLevel ());

// compare two trees
TreeDocument document2 = new TreeDocument (XmlTools.readDocument (new File ("/path/to/doc")), null);
document2.equals (document); // same document -> true

TreeDocument document3 = new TreeDocument (XmlTools.readDocument (new File ("/path/to/other/doc")), null);
document3.equals (document); // different document -> most likely false

Find the full example in /src/main/java/de/unirostock/sems/xmlutils/eg/TreeUsageExample.java The objects housed in a tree document are of type TreeNode.


A TreeNode (JavaDoc) represents a node in a document. There are two different types of nodes, TextNodes represents textual content in documents and DocumentNode) represent XML nodes. These classes define some getters and setters, just have a look the the corresponding java doc. However, here are some more details for uncommon usecases:

Node Hashes

Each node in the document has two hash values unique for this node and it’s subtree (note: not necessarily unique in the document). You can access this hash values using get/OwnHash and get/SubTreeHash. The OwnHash is an identifier for the TreeNode itself, thus <node> and <node> have the same OwnHash, but the OwnHash of <node attr='value'> is different. In a similar fashion the Sub/TreeHash identifies a subtree rooted in the corresponding node.

Node Weights

The nodes in the document have weights depending on their subtrees. That is, weight ~ size subtree. The objects that computes the weight of a node can be defined when creating the tree document (see extra constructor). It needs to extend the Weighter class and defaults to the SemsWeighter.

See a small example to get an idea of nodes and their usage:

// get root node
DocumentNode root = document.getRoot ();
// root's children
Vector<TreeNode> firstLevel = root.getChildren ();
System.out.println ("There are " + firstLevel.size () + " children in " + root.getXPath () + " :");
for (TreeNode kid : firstLevel)
	System.out.println ("\t" + kid.getXPath () + " having " + ((DocumentNode) kid).getNumLeaves () + " leaves and a weight of " + kid.getWeight ());

// get first message node
DocumentNode message = document.getNodeById ("messageone");
// you can also get access to the same node using it's path:
TreeNode sameNode = document.getNodeByPath (message.getXPath ());
// let's test if it's really the same:
System.out.println ("found same node by id and by XPath? " + (sameNode == message));
// you can also get this node by it's signature (here i know it's the first node having this hash value)
sameNode = document.getNodesByHash (message.getSubTreeHash ()).get (0);
// test:
System.out.println ("found same node by id and by hash? " + (sameNode == message));

// let's print some information about this node
System.out.println ("Path to the message node: " + message.getXPath ());
System.out.println ("Path to its parent: " + message.getParent ().getXPath ());
System.out.println ("Weight of the message node: " + message.getWeight ());
System.out.println ("Signature of the message node: " + message.getOwnHash ());
System.out.println ("Signature of the subtree rooted in the message node: " + message.getSubTreeHash ());
System.out.println ("#number nodes in its subtree: " + (message.getSizeSubtree () + 1));
System.out.println ("number of direct children: " + message.getNumChildren ());
System.out.println ("id of the node: " + message.getId ());
System.out.println ("tag name: " + message.getTagName ());
System.out.println ("attributes in this node:");
for (String attr : message.getAttributes ())
	System.out.println ("\t" + attr + " => " + message.getAttribute (attr));

// remove the node from the tree
DocumentNode parent = message.getParent ();
parent.rmChild (message);
// and reinsert it
parent.addChild (message);

// note how the path ot the node has changed
System.out.println ("New path to the message node: " + message.getXPath ());
// but everything else is still the same
System.out.println ("Weight of the message node: " + message.getWeight ());
System.out.println ("Signature of the subtree rooted in the message node: " + message.getSubTreeHash ());

Find the full example in /src/main/java/de/unirostock/sems/xmlutils/eg/NodeUsageExample.java



The class DocumentTools provides some static functions to print (parts of) document. Just pass the node rooting the tree to print as an argument (e.g. document.getRoot () to print the whole document). Use print/SubDoc to print a tree and print/PrettySubDoc to get the tree as pretty string (i.e. intended etc):

String prettyDoc = DocumentTools.printPrettySubDoc (document.getRoot ());
String justOneLineDoc = DocumentTools.printSubDoc (document.getRoot ());

MathML conversion

The DocumentTools contain a smart method to convert content MathML to presentation MathML: transformMathML. Just pass the DocumentNode which roots the MathML tree and get a string containing the presentation MathML, e.g.:

String presentationMathML = DocumentTools.transformMathML (contentMathMLFile.getRoot ());