Introduction
A ternary search tree is an interesting data structure that behaves like a hashtable in many respects, but for smaller datasets tends to use less memory. Additionally it's ideal for some interesting search problems.
In this post I'll explore how to use a TST to solve a word chains problem.
The Problem
The challenge is--given a flat wordlist--to find a path from one arbitrary word to another. Each step in the path must be word in the list and must only differ from the previous step by exactly one letter. e.g.
"dog" -> "cog" -> "cot" -> "cat"
The start word and the end word must be the same length, and there isn't necessarily a path between two words.
The Tree
While there are other data structures that could be used to solve the problem, a Ternary Search Tree has some unique attributes that'll be shown later, making it an exceptionally elegant solution.
What is a TST? The best way to describe a data structure, in my opinion is with pictures, so without further ado...
B
|
A
|
S
| \
| \
S D
/ | |
/ * *
E \
| \
* E
|
*
["bass", "base", "bad", "bade"]
The left and right branches each represent characters that may be used in place of the one at their parent. Astrix nodes means a word is present at that location.
Constructing the Tree
First we have to define the basic class structure:
class Node:
def __init__(self, data):
self.data = data
self.left = self.mid = self.right = None
def __repr__(self):
return "%s -> (%s, %s, %s)" % (self.data, self.left, self.mid, self.right)
Node.data is either the character that is represented by this node or a sentinel value that tells us there's a word at this position (more on this later...)
And the tree itself, is simple enough.
class TernaryTree:
def __init__(self, root = None):
self.root = root
A tree isn't much use without a way to insert more elements into it.
class TernaryTree:
...
def __rinsert(self, node, remains, orig):
try:
if node is None:
node = Node(remains[0])
if remains[0] < node.data:
node.left = self.__rinsert(node.left, remains, orig)
elif remains[0] > node.data:
node.right = self.__rinsert(node.right, remains, orig)
else:
node.mid = self.__rinsert(node.mid, remains[1:], orig)
except IndexError:
node = Node("")
node.mid = orig
return node
The recursive insert takes three arguments:
- node is the current position in the tree.
- remains is the remaining characters of the string that is being inserted. Since nodes are only ever a single character, each time a node is inserted that character gets chomped off the remaining string.
- orig is the original string. This will be used later.
The first if statement checks to see if the node at this position exists. If it doesn't then create a new node.
The next if block determines whether the current character needs to go to the left, right, or center of the current node. The only difference between the three is that when the split character (remains[0]) is the same as this node, the character has been inserted and can be popped off the stack of remaining characters.
The try block is a bit tricky. If remains is an empty string and remains[0] is referenced it will raise an IndexError. And remains should only ever be an empty string when all of the letters have been inserted. The lines:
node = Node("")
node.mid = orig
Create a sentinel node that marks the location of a word in the tree. The advantage of using the empty string is that no character in a string can ever be the empty string. Plus, when further inserts are done the empty string will just act like any other character in binary comparisons. Finally the node.mid property of a sentinel node is overridden to be the value of the string at that branch.
One more small addition and the tree can actually make some claim as a data structure:
class TernaryTree:
...
def insert(self, s):
assert(len(s) > 0)
self.root = self.__rinsert(self.root, s, s)
Searching the Tree
The tree isn't much use if its data can't be accessed (in a creative way!) Searching a TST is actually quite similar to search a binary search tree--or really, any recursive data structure for that matter. The trickiness is, this problem applies a couple of interesting constraints.
It's not particularly helpful in this case to search for a single element. Instead, the search needs to find all words that differ from the search term by only one character.
class TernaryTree:
...
def __rnear_search(self, node, remains, depth, results):
if not node or depth < 0: return results
try:
if depth > 0 or remains[0] < node.data:
results = self.__rnear_search(node.left, remains, depth, results)
if remains[0] == node.data:
results = self.__rnear_search(node.mid, remains[1:], depth, results)
elif remains[0] != node.data and "" != node.data:
results = self.__rnear_search(node.mid, remains[1:], depth-1, results)
if depth > 0 or remains[0] > node.data:
results = self.__rnear_search(node.right, remains, depth, results)
except IndexError:
if "" == node.data:
results += [node.mid]
return results
The code is somewhat complex. The recursive near search takes four arguments.
- node is the node currently being examined.
- remains is the remaining stack of characters to search for (similar to __rinsert).
- depth, functionally within the method, is best defined as the number of remaining errors (characters that differ in the tree from the string) that are allowed before a branch is ruled invalid (hopefully this will make more sense later).
- results is a sequence that contains the current result set.
The method is very similar to __rinsert in structure. The first and third if statements in the try block are symmetrical. If the number of substitutions hasn't been reached or the current character on the stack could be in that respective branch, it's searched.
The second if statement is where any complication resides. If the current letter on the remaining search stack is at this branch, the search stack is trimmed and the middle branch is searched. If the current letter is not at this node and this node is not a word node, trim the stack and search the middle branch: but with a decremented depth.
Decrementing the depth conceptually counts the current node as one of the allowed errors. Recall that in the ternary tree the left and right branches are optional replacements and the center branch is the next in sequence. So whenever a non-optional branch is followed which is not the same as the character at the corresponding location, one of the allowed errors is used up.
Finally, in the same fashion as __rinsert, an IndexError means the search stack has been exhausted. If, after being exhausted the current node is a word node, it's a hit and it added to the results.
And last, add an external method to the tree class so the search is accessible
class TernaryTree:
...
def near_search(self, s, d):
return self.__rnear_search(self.root, s, d, [])
Meta Searching
So far, so good. At this point the ternary search tree has all of the functionality needed to implement the path search on top of it.
The most obvious way to search for the shortest path between any two words is to do a breadth-first search. Treating the TST like a graph is simple enough: a node is just a word, and it's outgoing edges are tst.near_search(node, 1).
def path(node, parents, results):
if parents[node] is None:
return [node] + results
else:
results = [node] + results
return path(parents[node], parents, results)
def bfs(tst, start, end):
parents = {}
seen = {}
parents[start] = None
seen[start] = True
stack = [start]
while len(stack) > 0:
node = stack[0]
stack = stack[1:]
seen[node] = True
if node == end:
return path(node, parents, [])
else:
siblings = filter(lambda s: not seen.has_key(s) and s not in stack, tst.near_search(node, 1))
for n in siblings:
parents[n] = node
stack += siblings
return None
The code is quite self-explanatory, and since this article's focus is on ternary search trees, I'm not going to spend any time explaining it.
Conclusion / Putting it All Together
All the pieces are now in place to finish the solution, it's just a matter of putting them together.
Hopefully this has been a helpful article for anyone interested in ternary search trees. They were quite confusing to me when I first took a look at them, hopefully this clears some things up.
Source Code
Source code is available in the public domain here
Related Links
- http://www.ddj.com/windows/184410528;jsessionid=CF2RX40RWFQ1MQSNDLRSKHSCJUNN2JVN?_requestid=2827
- http://code.google.com/p/ternary-search-tree/
- http://www.geocities.jp/m_hiroi/light/pyalgo10.html