16 comments:

Anonymoussaid…

Thanks for sharing this. Very useful.

Wednesday, January 19, 2011 11:38:00 PM

Anonymoussaid…

Hey thanx …… 8)

Sunday, October 02, 2011 3:59:00 PM

Anonymoussaid…

Nice Article

Monday, October 03, 2011 12:42:00 AM

Nicolai Diethelmsaid…
This comment has been removed by the author.
Thursday, May 24, 2012 3:35:00 AM

Nicolai Diethelmsaid…
This comment has been removed by the author.
Thursday, May 24, 2012 3:40:00 AM

Nicolai Diethelmsaid…

Your statement about Suggest Trees (“Though these look promising, treaps in practice can degenerate into a linear list.”) refers to a old version of SuggestTree. Since 2011, the structure is not based on a treap, but on a compressed ternary search tree with precomputed top-k lists. While the additional space costs are small, this guarantees much better performance than the slow Segment Tree approach you are recommending.

Thursday, May 24, 2012 8:14:00 AM

dhruvsaid…

@Nicolai The treap (randomized heap keys) has a very low probability (polynomially low in fact) of degenerating into a List, but the way treaps are used here is different. An adversary can manipulate the values and priorities to get to the degenerate case pretty easily.

With respect to the ternary tree representation, do you know if ‘k’ needs to be fixed when the structure is built up (it seems so), or can we query for any ‘k’ once the structure is built up?

If the ‘k’ has to be pre-decided, then the constant cost per node (space) is at least O(k), which seems to be pretty high.

Thursday, May 24, 2012 9:58:00 AM

Nicolai Diethelmsaid…

“With respect to the ternary tree representation, do you know if ‘k’ needs to be fixed when the structure is built up (it seems so), or can we query for any ‘k’ once the structure is built up?”

With a SuggestTree, ‘k’ has to be fixed when the structure is built.

“If the ‘k’ has to be pre-decided, then the constant cost per node (space) is at least O(k), which seems to be pretty high.”

Since most of the nodes are at the bottom of the tree, most of them only hold a very short suggestion list. In my tests with real world data, the average list length was about 2 for k = 10 (and the number of nodes was about 1.3 n). So O(k) is only a theoretical upper bound for the space cost of a node (like, for example, O(n) is an upper bound for the time cost of searching in a hash table).

Thursday, May 24, 2012 1:17:00 PM

dhruvsaid…

Since most of the nodes are at the bottom of the tree, most of them only hold a very short suggestion list. In my tests with real world data, the average list length was about 2 for k = 10 (and the number of nodes was about 1.3 n). So O(k) is only a theoretical upper bound for the space cost of a node (like, for example, O(n) is an upper bound for the time cost of searching in a hash table).

I thought that even internal nodes would hold suggestion lists since internal nodes would correspond to prefixes of strings. Isn’t it so?

Also, I don’t understand the bit “Since most of the nodes are at the bottom of the tree” since I would assume that if a tree has a fanout of 3 (ternary tree), then the number of leaf nodes is O(number of internal nodes) – because of a constant fanout.

Thursday, May 24, 2012 1:41:00 PM

Nicolai Diethelmsaid…

I thought that even internal nodes would hold suggestion lists since internal nodes would correspond to prefixes of strings. Isn’t it so?

Yes, even internal nodes hold a suggestion list. Did you read the SuggestTree documentation? Nodes (prefixes) with the same completions are compressed into one node. So for each suggestion inserted into the tree, at most one new node is added and at most one existing node is split into two nodes. This is why usually most of the nodes “are at the bottom of tree” (have no middle child node) and thus hold a suggestion list of length 1.

Perhaps this is easier to understand if you do not imagine a ternary search tree, but a simpler trie data structure where the child nodes of a node are not ordered as a binary search tree. The number of nodes in such a tree and the length of the suggestion lists would be the same.

Friday, May 25, 2012 1:03:00 AM

dhruvsaid…

Yes, I have read the documentation, but still an unclear on a few things.

for each suggestion inserted into the tree, at most one new node is added and at most one existing node is split into two nodes

I understand this. In fact, a direct consequence of this is that every leaf node has just 1 suggestion in its list.

However, if the branching happens at depth ‘d’, then the suggestion lists of *potentially* ‘d’ nodes might need to be updated when such an insert happens and a new string might find itself in at most ‘d’ suggestion lists. This is where I don’t quite understand the claim of a small number of entries in suggestion lists.

I’m not talking empirically – talking about a worst case. If you can prove the average (or expected) case, then it would satisfy me.

Friday, May 25, 2012 12:12:00 PM

Nicolai Diethelmsaid…
This comment has been removed by the author.
Sunday, May 27, 2012 2:32:00 AM

Nicolai Diethelmsaid…

Again, imagine a simpler trie data structure with direct edges between a node and its child nodes. Because of branching, the number of leaf nodes (suggestion lists of length 1) usually vastly exceeds the number of internal nodes with a long suggestion list. Or, to put it differently, an “average” (middle-ranking) suggestion is usually listed only at a very few nodes.
Of course, one can construct a worst-case trie that branches only near the root. The average length of the suggestion lists in such a trie would be approximately k/2. But on the other hand, the number of nodes would be reduced to almost n. And if one assumes that the probability of a non-branching node is 1/a, with a being the size of the alphabet, the probability of getting such a trie is practically zero for large n.

Sunday, May 27, 2012 2:39:00 AM

Nicolai Diethelmsaid…

If this does not satisfy you, then ask yourself why hash tables with a worst-case time complexity of O(n) are one of the most commonly used data structures.

Sunday, May 27, 2012 2:50:00 AM

dhruvsaid…

@Nicolai When you say why hash tables with a worst-case time complexity of O(n) do you mean space complexity or total cost to query O(n) elements in aggregate? If it is the former, then any hash table (even with a single bucket) would have just O(n) space usage.

Sunday, May 27, 2012 9:08:00 AM

Nicolai Diethelmsaid…

@dhruv: O(n) is the worst case time cost for searching a single element in a hash table like java.util.HashMap.

Sunday, May 27, 2012 11:47:00 AM

Post a Comment

Links to this post

Create a Link

Advertisements