Posted by Sam Gibson
Thu, 20 Dec 2007 01:00:00 GMT
Introduction
A ternary search tree is an interesting data structure that behaves like a hashtable in many respects, but for smaller datasets tends to use less memory. Additionally it's ideal for some interesting search problems.
In this post I'll explore how to use a TST to solve a word chains problem.
The Problem
The challenge is--given a flat wordlist--to find a path from one arbitrary word to another. Each step in the path must be word in the list and must only differ from the previous step by exactly one letter. e.g.
"dog" -> "cog" -> "cot" -> "cat"
The start word and the end word must be the same length, and there isn't necessarily a path between two words.
The Tree
While there are other data structures that could be used to solve the problem, a Ternary Search Tree has some unique attributes that'll be shown later, making it an exceptionally elegant solution.
What is a TST? The best way to describe a data structure, in my opinion is with pictures, so without further ado...
B
|
A
|
S
| \
| \
S D
/ | |
/ * *
E \
| \
* E
|
*
["bass", "base", "bad", "bade"]
The left and right branches each represent characters that may be used in place of the one at their parent. Astrix nodes means a word is present at that location.
Constructing the Tree
First we have to define the basic class structure:
class Node:
def __init__(self, data):
self.data = data
self.left = self.mid = self.right = None
def __repr__(self):
return "%s -> (%s, %s, %s)" % (self.data, self.left, self.mid, self.right)
Node.data is either the character that is represented by this node or a sentinel value that tells us there's a word at this position (more on this later...)
And the tree itself, is simple enough.
class TernaryTree:
def __init__(self, root = None):
self.root = root
A tree isn't much use without a way to insert more elements into it.
class TernaryTree:
...
def __rinsert(self, node, remains, orig):
try:
if node is None:
node = Node(remains[0])
if remains[0] < node.data:
node.left = self.__rinsert(node.left, remains, orig)
elif remains[0] > node.data:
node.right = self.__rinsert(node.right, remains, orig)
else:
node.mid = self.__rinsert(node.mid, remains[1:], orig)
except IndexError:
node = Node("")
node.mid = orig
return node
The recursive insert takes three arguments:
- node is the current position in the tree.
- remains is the remaining characters of the string that is being inserted. Since nodes are only ever a single character, each time a node is inserted that character gets chomped off the remaining string.
- orig is the original string. This will be used later.
The first if statement checks to see if the node at this position exists. If it doesn't then create a new node.
The next if block determines whether the current character needs to go to the left, right, or center of the current node. The only difference between the three is that when the split character (remains[0]) is the same as this node, the character has been inserted and can be popped off the stack of remaining characters.
The try block is a bit tricky. If remains is an empty string and remains[0] is referenced it will raise an IndexError. And remains should only ever be an empty string when all of the letters have been inserted. The lines:
node = Node("")
node.mid = orig
Create a sentinel node that marks the location of a word in the tree. The advantage of using the empty string is that no character in a string can ever be the empty string. Plus, when further inserts are done the empty string will just act like any other character in binary comparisons. Finally the node.mid property of a sentinel node is overridden to be the value of the string at that branch.
One more small addition and the tree can actually make some claim as a data structure:
class TernaryTree:
...
def insert(self, s):
assert(len(s) > 0)
self.root = self.__rinsert(self.root, s, s)
Searching the Tree
The tree isn't much use if its data can't be accessed (in a creative way!) Searching a TST is actually quite similar to search a binary search tree--or really, any recursive data structure for that matter. The trickiness is, this problem applies a couple of interesting constraints.
It's not particularly helpful in this case to search for a single element. Instead, the search needs to find all words that differ from the search term by only one character.
class TernaryTree:
...
def __rnear_search(self, node, remains, depth, results):
if not node or depth < 0: return results
try:
if depth > 0 or remains[0] < node.data:
results = self.__rnear_search(node.left, remains, depth, results)
if remains[0] == node.data:
results = self.__rnear_search(node.mid, remains[1:], depth, results)
elif remains[0] != node.data and "" != node.data:
results = self.__rnear_search(node.mid, remains[1:], depth-1, results)
if depth > 0 or remains[0] > node.data:
results = self.__rnear_search(node.right, remains, depth, results)
except IndexError:
if "" == node.data:
results += [node.mid]
return results
The code is somewhat complex. The recursive near search takes four arguments.
- node is the node currently being examined.
- remains is the remaining stack of characters to search for (similar to __rinsert).
- depth, functionally within the method, is best defined as the number of remaining errors (characters that differ in the tree from the string) that are allowed before a branch is ruled invalid (hopefully this will make more sense later).
- results is a sequence that contains the current result set.
The method is very similar to __rinsert in structure. The first and third if statements in the try block are symmetrical. If the number of substitutions hasn't been reached or the current character on the stack could be in that respective branch, it's searched.
The second if statement is where any complication resides. If the current letter on the remaining search stack is at this branch, the search stack is trimmed and the middle branch is searched. If the current letter is not at this node and this node is not a word node, trim the stack and search the middle branch: but with a decremented depth.
Decrementing the depth conceptually counts the current node as one of the allowed errors. Recall that in the ternary tree the left and right branches are optional replacements and the center branch is the next in sequence. So whenever a non-optional branch is followed which is not the same as the character at the corresponding location, one of the allowed errors is used up.
Finally, in the same fashion as __rinsert, an IndexError means the search stack has been exhausted. If, after being exhausted the current node is a word node, it's a hit and it added to the results.
And last, add an external method to the tree class so the search is accessible
class TernaryTree:
...
def near_search(self, s, d):
return self.__rnear_search(self.root, s, d, [])
Meta Searching
So far, so good. At this point the ternary search tree has all of the functionality needed to implement the path search on top of it.
The most obvious way to search for the shortest path between any two words is to do a breadth-first search. Treating the TST like a graph is simple enough: a node is just a word, and it's outgoing edges are tst.near_search(node, 1).
def path(node, parents, results):
if parents[node] is None:
return [node] + results
else:
results = [node] + results
return path(parents[node], parents, results)
def bfs(tst, start, end):
parents = {}
seen = {}
parents[start] = None
seen[start] = True
stack = [start]
while len(stack) > 0:
node = stack[0]
stack = stack[1:]
seen[node] = True
if node == end:
return path(node, parents, [])
else:
siblings = filter(lambda s: not seen.has_key(s) and s not in stack, tst.near_search(node, 1))
for n in siblings:
parents[n] = node
stack += siblings
return None
The code is quite self-explanatory, and since this article's focus is on ternary search trees, I'm not going to spend any time explaining it.
Conclusion / Putting it All Together
All the pieces are now in place to finish the solution, it's just a matter of putting them together.
Hopefully this has been a helpful article for anyone interested in ternary search trees. They were quite confusing to me when I first took a look at them, hopefully this clears some things up.
Source Code
Source code is available in the public domain here
Related Links
- http://www.ddj.com/windows/184410528;jsessionid=CF2RX40RWFQ1MQSNDLRSKHSCJUNN2JVN?_requestid=2827
- http://code.google.com/p/ternary-search-tree/
- http://www.geocities.jp/m_hiroi/light/pyalgo10.html
Tags code, data structure, puzzle, ternary search tree | no comments | no trackbacks
Posted by Sam Gibson
Fri, 14 Dec 2007 18:12:00 GMT
Over a year ago I'd written up a preliminary specification for the REST design of a distributed social network. The rationale should be obvious to anyone who's paid attention to internet technology in the last decade: freedom from "walled gardens". Because of work though, I never was able to devote much time for the project. Recently several discussions with Scott have led to interest re-sparked.
The first step is to identify what Facebook/MySpace et. al. offer that can't be offered simply by adding XFN style links to a blogroll. As I see it, closed communities offer the following features.
-
Information exposure
Information is granular, and non-standard (i.e. The information exposed by Facebook is not the same as that exposed by LinkedIn, but there is generally some overlap)
-
Privacy controls (though some users of Facebook might argue this...)
-
Web real-estate.
-
"Secure" bi-directional relationship mappings. You can't say someone's your friend on MySpace unless they also agree that you're their friend.
-
Searching (in a generic sense, not a textbox, type in a name sense). Network construction might be a better term.
-
Defined (closed) communication protocol. When Alice friends Bob, both sides know what to do (well, duh--they're both on the same system).
Of these problems the most difficult to solve is search. There are two approaches:
- Push architecture. When a client changes data, it makes an RPC call to all of it's peers informing them that it has changed.
- Pull architecture. When a client changes data, it modifies a feed (RSS/Atom is the natural choice) which are periodically polled by peers looking for changes.
Coupled with a strong client-side caching mechanism and search becomes a more approachable problem.
Bi-directional relations are actually not as analogous to real-world interactions as uni-directional. There's no reason why Mallory can't consider Eve a friend and she doesn't reciprocate.
Communication protocol is another major hurdle. I think there are two options:
-
All web applications that function as a social node have a well defined REST hierarchy underneath them.
-
All URLs that function as social nodes have some meta data attached to them which tells clients where their accessors/methods can be found.
For the time being I've decided to stick with the former. The advantage of the later is that it's possibly easier to integrate with legacy systems (MovableType, WordPress, etc.)
I've started working on a proof of concept in Rails, though there's no reason it has to remain there. I've placed a bzr branch here. (yes, it's still infantile).
I haven't forgotten privacy is a complicated issue. I wish to return to it as some point, once I have a better solution mentally hashed out.
There is at least one project (DiSo) out there that is doing something similar to this. [details]
Tags distributed, project, social networks | 2 comments | no trackbacks
Posted by Sam Gibson
Wed, 05 Dec 2007 23:38:00 GMT
I was inspired several days ago thinking about one of the tactics (illegitimately) employed by crowd control peace officers to consider how such a poisoning of the well could be applied to a political race. I came up with several attacks that I think could work quite well. Admittedly the attacks aren't directly analogous to police instigators in a crowd, but it was that which started the thought process.
One such attack could be the following:
Candidate Foo is running a close race with Candidate Bar and either he, or his staff wishes to destroy Bar's public image.
- Candidate Foo sends one or more of his loyal staff members (individuals henceforth unconnected with the campaign would obviously be best) to work for Candidate Bar's campaign.
- Foo's clandestine operative waits for a signal or a predetermined time to announce allegations of improper sexual conduct on the part of Bar--experimenting with the sex of the covert op in relation to Bar's is an interesting tangent.
Note that the timing is critical. The key would be to release the allegations at just the right time so that it's critical to the outcome of the election, but not so early that the actual truthiness of the allegations could be assessed--at least not in the eyes of the public.
Other attacks could involve releasing materials which are not approved by Candidate Bar on behalf of his campaign office at a critical time, or--if the operative were extremely dedicated--committing a crime whose blame could be laid upon the candidate himself.
I have to assume this sort of tactic has been employed prior to the present. I'd be curious to know it's rate of effectiveness.
Tags brainstorming, hacking, politics | 1 comment | no trackbacks