Improving Joran's RuleStore performance

Hi Ceki, Just wondering about the SUB-STRING patterns, such as "*/if/then/*" and "*/if/else/*". I'm currently working on adding a permuterm tree to the concurrent-trees[1] project. It will support fast retrieval from the tree for wildcard queries like: X, X*, *X, *X*, X*Y. These patterns would be fully accelerated. It looks like the *X* variant might fit your sub-string patterns. It's also possible to have partial acceleration for additional patterns such as X*Y*Z, *X*Y*Z, *X*Y*Z* and so on with any number of terms, which involves using the first and last terms to find a candidate set in the tree and then doing on-the-fly filtering for the terms in the middle. Would that help? I'm about 80% of the way to completion. [1] https://code.google.com/p/concurrent-trees/ Niall

On 20.06.2013 23:27, Niall Gallagher wrote:
Hi Ceki,
Just wondering about the SUB-STRING patterns, such as "*/if/then/*" and "*/if/else/*".
I'm currently working on adding a permuterm tree to the concurrent-trees[1] project. It will support fast retrieval from the tree for wildcard queries like: X, X*, *X, *X*, X*Y. These patterns would be fully accelerated.
The *X* variant fits the sub-string patterns used in Joran. While traversing the elements of an XML document, I would like to compute if *X* is the path defined by current location in the XML document. This seems to be different (or even unrelated) to the problem of checking whether *X* matches one of the keys inserted previously into an ADT such as a trie.
It looks like the *X* variant might fit your sub-string patterns.
It's also possible to have partial acceleration for additional patterns such as X*Y*Z, *X*Y*Z, *X*Y*Z* and so on with any number of terms, which involves using the first and last terms to find a candidate set in the tree and then doing on-the-fly filtering for the terms in the middle.
I am curious. Is "autocomplete" in presence of typos the use case for such patterns?
Would that help? I'm about 80% of the way to completion.
[1] https://code.google.com/p/concurrent-trees/
Niall
-- Ceki 65% of statistics are made up on the spot

On 20 Jun 2013, at 23:15, ceki <ceki@qos.ch> wrote:
On 20.06.2013 23:27, Niall Gallagher wrote:
Hi Ceki,
Just wondering about the SUB-STRING patterns, such as "*/if/then/*" and "*/if/else/*".
I'm currently working on adding a permuterm tree to the concurrent-trees[1] project. It will support fast retrieval from the tree for wildcard queries like: X, X*, *X, *X*, X*Y. These patterns would be fully accelerated.
The *X* variant fits the sub-string patterns used in Joran. While traversing the elements of an XML document, I would like to compute if *X* is the path defined by current location in the XML document. This seems to be different (or even unrelated) to the problem of checking whether *X* matches one of the keys inserted previously into an ADT such as a trie.
So you want to store patterns in the tree, and then run a document over the tree to see which patterns matched. Yes it's different. The same idea as InvertedRadixTree. In this case I don't think such an "inverted" permuterm tree is possible, as rotating the query (the document) would be expensive.
It looks like the *X* variant might fit your sub-string patterns.
It's also possible to have partial acceleration for additional patterns such as X*Y*Z, *X*Y*Z, *X*Y*Z* and so on with any number of terms, which involves using the first and last terms to find a candidate set in the tree and then doing on-the-fly filtering for the terms in the middle.
I am curious. Is "autocomplete" in presence of typos the use case for such patterns?
Spell correction and auto-complete in the presence of typos are some use cases (though actually this would require higher-level logic on top of the tree also). There's a discussion about spell correction on the concurrent-trees mailing list. The major use case for me though, is the ability to search a product catalog very quickly using queries containing wildcards, or using queries containing typos. The trees in concurrent-trees[1] were originally part of another project, CQEngine[2] before I spun it out as a separate project. I'm writing the concurrent permuterm tree in concurrent-trees, so that I can then use it as a basis for a new type of index in CQEngine, allowing free-text search of product catalogs etc. [1] http://code.google.com/p/concurrent-trees/ [2] http://code.google.com/p/cqengine/
Would that help? I'm about 80% of the way to completion.
[1] https://code.google.com/p/concurrent-trees/
Niall
-- Ceki 65% of statistics are made up on the spot _______________________________________________ logback-dev mailing list logback-dev@qos.ch http://mailman.qos.ch/mailman/listinfo/logback-dev
participants (2)
-
ceki
-
Niall Gallagher