Reddit Bombing

Sam Gibson Wed, 23 Jan 2008 01:03:00 GMT

What's more fun than Karma whoring on social networking sites? Karma-burning on social networking sites!

The other day I posted a story on reddit about Martin Luther King Jr. It was a pretty obvious troll and got 67 down votes from people who have no soul or sense of humor.

I don't really have a basis for comparison, but that seems like a reasonable number of down votes.

Here's the new sport:

Who can get the most number of downvotes on any given story on reddit? What types of stories get the most downvotes? Empty content? Explicit content? Obvious trolls?

Rules?

  1. You have to post the reddit link to your submission.

(I really want there to be a trolling community version of reddit... trollit, stories devoted to awesome trolls. Meta trolls encouraged.)

(Simple, Applied) Ternary Search Tree in Python

Sam Gibson Thu, 20 Dec 2007 01:00:00 GMT

Introduction

A ternary search tree is an interesting data structure that behaves like a hashtable in many respects, but for smaller datasets tends to use less memory. Additionally it's ideal for some interesting search problems.

In this post I'll explore how to use a TST to solve a word chains problem.

The Problem

The challenge is--given a flat wordlist--to find a path from one arbitrary word to another. Each step in the path must be word in the list and must only differ from the previous step by exactly one letter. e.g.

"dog" -> "cog" -> "cot" -> "cat"

The start word and the end word must be the same length, and there isn't necessarily a path between two words.

The Tree

While there are other data structures that could be used to solve the problem, a Ternary Search Tree has some unique attributes that'll be shown later, making it an exceptionally elegant solution.

What is a TST? The best way to describe a data structure, in my opinion is with pictures, so without further ado...

    B
    |
    A
    | 
    S    
    | \  
    |  \   
    S   D  
  / |   |  
 /  *   *
E        \ 
|         \
*          E
           |
           *

["bass", "base", "bad", "bade"]

The left and right branches each represent characters that may be used in place of the one at their parent. Astrix nodes means a word is present at that location.

Constructing the Tree

First we have to define the basic class structure:

class Node:
  def __init__(self, data):
    self.data = data 
    self.left = self.mid = self.right = None

  def __repr__(self):
    return "%s -> (%s, %s, %s)" % (self.data, self.left, self.mid, self.right)

Node.data is either the character that is represented by this node or a sentinel value that tells us there's a word at this position (more on this later...)

And the tree itself, is simple enough.

class TernaryTree:
  def __init__(self, root = None):
    self.root = root

A tree isn't much use without a way to insert more elements into it.

class TernaryTree:
  ...
  def __rinsert(self, node, remains, orig):
    try:
      if node is None:
        node = Node(remains[0])

      if remains[0] < node.data:
        node.left = self.__rinsert(node.left, remains, orig)
      elif remains[0] > node.data:
        node.right = self.__rinsert(node.right, remains, orig)
      else:
        node.mid = self.__rinsert(node.mid, remains[1:], orig)
    except IndexError:
      node = Node("")
      node.mid = orig
    
    return node

The recursive insert takes three arguments:

  1. node is the current position in the tree.
  2. remains is the remaining characters of the string that is being inserted. Since nodes are only ever a single character, each time a node is inserted that character gets chomped off the remaining string.
  3. orig is the original string. This will be used later.

The first if statement checks to see if the node at this position exists. If it doesn't then create a new node.

The next if block determines whether the current character needs to go to the left, right, or center of the current node. The only difference between the three is that when the split character (remains[0]) is the same as this node, the character has been inserted and can be popped off the stack of remaining characters.

The try block is a bit tricky. If remains is an empty string and remains[0] is referenced it will raise an IndexError. And remains should only ever be an empty string when all of the letters have been inserted. The lines:

node = Node("")
node.mid = orig
Create a sentinel node that marks the location of a word in the tree. The advantage of using the empty string is that no character in a string can ever be the empty string. Plus, when further inserts are done the empty string will just act like any other character in binary comparisons. Finally the node.mid property of a sentinel node is overridden to be the value of the string at that branch.

One more small addition and the tree can actually make some claim as a data structure:

class TernaryTree:
  ...
  def insert(self, s):
    assert(len(s) > 0)
    self.root = self.__rinsert(self.root, s, s)

Searching the Tree

The tree isn't much use if its data can't be accessed (in a creative way!) Searching a TST is actually quite similar to search a binary search tree--or really, any recursive data structure for that matter. The trickiness is, this problem applies a couple of interesting constraints.

It's not particularly helpful in this case to search for a single element. Instead, the search needs to find all words that differ from the search term by only one character.

class TernaryTree:
  ...
  def __rnear_search(self, node, remains, depth, results):
    if not node or depth < 0: return results

    try:
      if depth > 0 or remains[0] < node.data:
        results = self.__rnear_search(node.left, remains, depth, results)
      
      if remains[0] == node.data:
        results = self.__rnear_search(node.mid, remains[1:], depth, results)
      elif remains[0] != node.data and "" != node.data:
        results = self.__rnear_search(node.mid, remains[1:], depth-1, results)

      if depth > 0 or remains[0] > node.data:
        results = self.__rnear_search(node.right, remains, depth, results)
    except IndexError:
      if "" == node.data:
        results += [node.mid]

    return results

The code is somewhat complex. The recursive near search takes four arguments.

  1. node is the node currently being examined.
  2. remains is the remaining stack of characters to search for (similar to __rinsert).
  3. depth, functionally within the method, is best defined as the number of remaining errors (characters that differ in the tree from the string) that are allowed before a branch is ruled invalid (hopefully this will make more sense later).
  4. results is a sequence that contains the current result set.

The method is very similar to __rinsert in structure. The first and third if statements in the try block are symmetrical. If the number of substitutions hasn't been reached or the current character on the stack could be in that respective branch, it's searched.

The second if statement is where any complication resides. If the current letter on the remaining search stack is at this branch, the search stack is trimmed and the middle branch is searched. If the current letter is not at this node and this node is not a word node, trim the stack and search the middle branch: but with a decremented depth.

Decrementing the depth conceptually counts the current node as one of the allowed errors. Recall that in the ternary tree the left and right branches are optional replacements and the center branch is the next in sequence. So whenever a non-optional branch is followed which is not the same as the character at the corresponding location, one of the allowed errors is used up.

Finally, in the same fashion as __rinsert, an IndexError means the search stack has been exhausted. If, after being exhausted the current node is a word node, it's a hit and it added to the results.

And last, add an external method to the tree class so the search is accessible

class TernaryTree:
  ...
  def near_search(self, s, d):
    return self.__rnear_search(self.root, s, d, [])

Meta Searching

So far, so good. At this point the ternary search tree has all of the functionality needed to implement the path search on top of it.

The most obvious way to search for the shortest path between any two words is to do a breadth-first search. Treating the TST like a graph is simple enough: a node is just a word, and it's outgoing edges are tst.near_search(node, 1).

def path(node, parents, results):
  if parents[node] is None:
    return [node] + results
  else:
    results = [node] + results
    return path(parents[node], parents, results)

def bfs(tst, start, end):
  parents = {}
  seen = {}

  parents[start] = None
  seen[start] = True
  stack = [start]

  while len(stack) > 0:
    node = stack[0]
    stack = stack[1:]
    seen[node] = True
    if node == end:
      return path(node, parents, [])
    else:
      siblings = filter(lambda s: not seen.has_key(s) and s not in stack, tst.near_search(node, 1))
      for n in siblings:
        parents[n] = node
      stack += siblings

  return None

The code is quite self-explanatory, and since this article's focus is on ternary search trees, I'm not going to spend any time explaining it.

Conclusion / Putting it All Together

All the pieces are now in place to finish the solution, it's just a matter of putting them together.

Hopefully this has been a helpful article for anyone interested in ternary search trees. They were quite confusing to me when I first took a look at them, hopefully this clears some things up.

Source Code

Source code is available in the public domain here

Related Links

  1. http://www.ddj.com/windows/184410528;jsessionid=CF2RX40RWFQ1MQSNDLRSKHSCJUNN2JVN?_requestid=2827
  2. http://code.google.com/p/ternary-search-tree/
  3. http://www.geocities.jp/m_hiroi/light/pyalgo10.html

Distributed Social Network Brainstorming

Sam Gibson Fri, 14 Dec 2007 18:12:00 GMT

Over a year ago I'd written up a preliminary specification for the REST design of a distributed social network. The rationale should be obvious to anyone who's paid attention to internet technology in the last decade: freedom from "walled gardens". Because of work though, I never was able to devote much time for the project. Recently several discussions with Scott have led to interest re-sparked.

The first step is to identify what Facebook/MySpace et. al. offer that can't be offered simply by adding XFN style links to a blogroll. As I see it, closed communities offer the following features.

  1. Information exposure
      Information is granular, and non-standard (i.e. The information exposed by Facebook is not the same as that exposed by LinkedIn, but there is generally some overlap)
  2. Privacy controls (though some users of Facebook might argue this...)
  3. Web real-estate.
  4. "Secure" bi-directional relationship mappings. You can't say someone's your friend on MySpace unless they also agree that you're their friend.
  5. Searching (in a generic sense, not a textbox, type in a name sense). Network construction might be a better term.
  6. Defined (closed) communication protocol. When Alice friends Bob, both sides know what to do (well, duh--they're both on the same system).

Of these problems the most difficult to solve is search. There are two approaches:

  1. Push architecture. When a client changes data, it makes an RPC call to all of it's peers informing them that it has changed.
  2. Pull architecture. When a client changes data, it modifies a feed (RSS/Atom is the natural choice) which are periodically polled by peers looking for changes.
Coupled with a strong client-side caching mechanism and search becomes a more approachable problem.

Bi-directional relations are actually not as analogous to real-world interactions as uni-directional. There's no reason why Mallory can't consider Eve a friend and she doesn't reciprocate.

Communication protocol is another major hurdle. I think there are two options:

  1. All web applications that function as a social node have a well defined REST hierarchy underneath them.
  2. All URLs that function as social nodes have some meta data attached to them which tells clients where their accessors/methods can be found.
For the time being I've decided to stick with the former. The advantage of the later is that it's possibly easier to integrate with legacy systems (MovableType, WordPress, etc.)

I've started working on a proof of concept in Rails, though there's no reason it has to remain there. I've placed a bzr branch here. (yes, it's still infantile).

I haven't forgotten privacy is a complicated issue. I wish to return to it as some point, once I have a better solution mentally hashed out.

There is at least one project (DiSo) out there that is doing something similar to this. [details]

Poisoning the Well for Fun and Profit

Sam Gibson Wed, 05 Dec 2007 23:38:00 GMT

I was inspired several days ago thinking about one of the tactics (illegitimately) employed by crowd control peace officers to consider how such a poisoning of the well could be applied to a political race. I came up with several attacks that I think could work quite well. Admittedly the attacks aren't directly analogous to police instigators in a crowd, but it was that which started the thought process.

One such attack could be the following:

Candidate Foo is running a close race with Candidate Bar and either he, or his staff wishes to destroy Bar's public image.

  1. Candidate Foo sends one or more of his loyal staff members (individuals henceforth unconnected with the campaign would obviously be best) to work for Candidate Bar's campaign.
  2. Foo's clandestine operative waits for a signal or a predetermined time to announce allegations of improper sexual conduct on the part of Bar--experimenting with the sex of the covert op in relation to Bar's is an interesting tangent.

Note that the timing is critical. The key would be to release the allegations at just the right time so that it's critical to the outcome of the election, but not so early that the actual truthiness of the allegations could be assessed--at least not in the eyes of the public.

Other attacks could involve releasing materials which are not approved by Candidate Bar on behalf of his campaign office at a critical time, or--if the operative were extremely dedicated--committing a crime whose blame could be laid upon the candidate himself.

I have to assume this sort of tactic has been employed prior to the present. I'd be curious to know it's rate of effectiveness.