Behind-the-Scenes Secrets of Jsoup II: Traverse A Tree Before It Was Built

Here we go! We’ve got a HTML or XML snippet, it would be clean or dirty, small or large, safe or dangerous, but our task won’t change. We need to extract and manipulate data supporting the best of DOM, CSS, and jquery-like methods. The basic idea would be crystal clear, you need to build a DOM tree and traverse it with or without filtering. That’s it. Sounds easy, right?😅Then Just Do It…and you would probably…cry.

A tree consists of a root node and its children from top to bottom. So it’s straightforward to get the root node first, retrieve its attributes then traverse to its child nodes and maintain the relationship at the same time. The class diagram of org.jsoup.nodes generated by Intellij IDEA is like this:

PS: LeafNode was introduced in 1.11.1 for memory usage optimization. “By refactoring the node hierarchy to not track childnodes or attributes by default for lead nodes. For the average document, that’s about a 30% memory reduction.” You can get more details on #911, which was a good example of performance improvement.

Besides other classes will be introduced in the future, Element and Document would be most frequently used, but Node is the root of the inheritance tree. Fields and methods worthy of notice are listed in the below:

  • attributes: key-value collection. Key like ‘type’, ‘id’, ‘name’, ‘value’ and etc. Jsoup especially maintains a list of boolean attribute keys such as ‘checked’(for example: <input type="checkbox" checked value="Registered" />). Literally, two classes Attributes and Attribute are mapped correspondingly. A node can read and write attributes, which provides the ability to clean, correct given HTML
  • parentNode, childNodes and siblingNodes will be used to access nodes related to current
  • baseUri() that will be used to convert relative URL address to absolute

I would talk more about Node traverse in “CSS Selector” part while touching a little bit here first. Slightly different from Yihua Huang‘s interpretation, Jonathan made some refactorings here, but the concept is still the same. These lines of code in class NodeTraversor play an important role in the whole picture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public static void traverse(NodeVisitor visitor, Node root) {
Node node = root;
int depth = 0;

while (node != null) {
visitor.head(node, depth);
if (node.childNodeSize() > 0) {
node = node.childNode(0);
depth++;
} else {
while (node.nextSibling() == null && depth > 0) {
visitor.tail(node, depth);
node = node.parentNode();
depth--;
}
visitor.tail(node, depth);
if (node == root)
break;
node = node.nextSibling();
}
}
}

Using recursion here is dangerous, though much easier. StackOverflowError might occur if the DOM tree is too deep. Iterate through is a must here. Don’t tell me you want to use -Xss256M to silence the world.😅

I want to emphasize that NodeVisitor is an interface. It defines what you would do when a node is first visited and last visited.

1
2
3
4
5
6
7
8
9
10
11
public interface NodeVisitor {
/**
* Callback for when a node is first visited.
*/
void head(Node node, int depth);

/**
* Callback for when a node is last visited, after all of its descendants have been visited.
*/
void tail(Node node, int depth);
}

Implementations such as Accumulator would be used to collect nodes that pass evaluations parsed from the query, that’s actually the key point of document.select("${selector}"). Go to class Collector and you will find it:

1
2
3
4
5
6
7
8
9
10
11
/**
* Build a list of elements, by visiting root and every descendant of root, and testing it against the evaluator.
* @param eval Evaluator to test elements against
* @param root root of tree to descend
* @return list of matches; empty if none
*/
public static Elements collect (Evaluator eval, Element root) {
Elements elements = new Elements();
NodeTraversor.traverse(new Accumulator(root, elements, eval), root);
return elements;
}

Accumulator kept a reference to the node where traverse begins (which means it doesn’t necessarily need to be root), and update elements which you may think of as an ArrayList for simplification.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
private static class Accumulator implements NodeVisitor {
private final Element root;
private final Elements elements;
private final Evaluator eval;

Accumulator(Element root, Elements elements, Evaluator eval) {
this.root = root;
this.elements = elements;
this.eval = eval;
}

public void head(Node node, int depth) {
if (node instanceof Element) {
Element el = (Element) node;
if (eval.matches(root, el))
elements.add(el);
}
}

public void tail(Node node, int depth) {
// void
}
}

Other NodeVisitor implementations such as FormattingVisitor, CleaningVisitor, OuterHtmlVisitor, W3CBuilder are also used in certain scenarios. NodeTraversor reuses the depth-first search process and provides the convenience of extra behaviors in the process, which makes the code clean and consistent.

However, I guess you would be very confused till now. Why the hack you begin from node traverse first? We even don’t have a tree built!

Yes, exactly. But remember, you have to begin with the end in mind. You always want to find DOM elements matching given search criteria, so the first thing you will think of is actually traversing, it will also affect how you build the tree - perhaps I should say “parse the tree” here.

And it will touch the topics of Compilers. Fortunately, we don’t need to dig into this mysterious area. Lex and Parse are enough to achieve the goal. And we will talk about State Machine and State Pattern in the next article.

-To Be Continued-