Recursive Data Mining
Algorithm Overview
RDM captures statistically significant patterns based on the initial input, a set of sequences of tokens. The deemed significant patterns are assigned new tokens. Next, the initial sequences are rewritten by collapsing each significant pattern found in the sequences to its newly assigned token. The algorithm now operates on the re-written sequences, and iterate until either the sequence can not be rewritten any further, or the predefined number of iterations is reached. The outline of the algorithm can be seen from figure 1.
Figure 1: RDM algorithm

For better viewing, figure 1 can be download here.
Pattern Generation
The initial sequence of tokens corresponding to the data is called the level-0 sequence (S0). A sliding window of length lw moves over Sv (v = 0 initially). At each position p of the window over the sequence, all possible (lw,max gap)-sequence patterns are generated. The number of patterns generated equals the number of combinations of tokens covered by the window along with the gap token. Since we do not allow a gap for token s1 of a pattern, the number of patterns generated at each window position is given by the expression 2lw-1 - 2lw-max gap-1 - 1. So, for a lw = 5, with max gap = 2, 11 patterns are generated. Figure 1, shows the patterns generated at position 1 of the sequence. A bounded hash keeps count of number of occurrences of each pattern at level v, as the sliding window moves over S0. This forms the first pass over sequence Sv. An illustrate example is shown in figure 2, where the patterns on the left and right columns are are generated from the first and second windows respectively.
Figure 2: Patterns

Pattern Significance
Of all the patterns generated during the pattern generation steps, usually, only a handful of them give us any advantage in capturing any significant information. To advoid wasting computational efforts on "non-useful" patterns, RDM determines which pattern is statistically significant and which is not. There are many ways to compute a statistical significant of a pattern, a permutation test, probabilistic models, expected and actual occurance ratio, and etc. One can also imagine a domain knownledge can also give out a set of informative patterns prior to starting the program. The selection of a significant test is a domain dependent choice.
Dominant Patterns
At any given level of iteration in RDM, once all significant patterns are defined, the algorithm make a pass over each input sequence of tokens. At each position in the sequence, the tokens in the significant patterns are matched against the tokens in the sequence. Again, the matching method can be chosen to fit the task. One example of such method is edit distance, which the scoring function can be adjust to fit the task at hand, i.e. using a simple Hamming distance, a BLOSUM substitution matrix, or etc. Once the highest matching score starting at first matching position is found the matching pattern is considered a dominant pattern. The term dominant pattern is coined from the fact that this pattern dominates over all other significant patterns for this position in the sequence. Dominant patterns, at consecutive positions over the sequence can be merged to form longer dominant patterns. The merging process is continued till no further dominant patterns can be merged.
Rewriting Step
After dominant patterns are located for each sequence, the matching tokens in the sequence are simply replaced by the corresponding dominant patterns' tokens, as shown by an example here.
RDM captures statistically significant patterns based on the initial input, a set of sequences of tokens. The deemed significant patterns are assigned new tokens. Next, the initial sequences are rewritten by collapsing each significant pattern found in the sequences to its newly assigned token. The algorithm now operates on the re-written sequences, and iterate until either the sequence can not be rewritten any further, or the predefined number of iterations is reached. The outline of the algorithm can be seen from figure 1.
Figure 1: RDM algorithm
For better viewing, figure 1 can be download here.
Pattern Generation
The initial sequence of tokens corresponding to the data is called the level-0 sequence (S0). A sliding window of length lw moves over Sv (v = 0 initially). At each position p of the window over the sequence, all possible (lw,max gap)-sequence patterns are generated. The number of patterns generated equals the number of combinations of tokens covered by the window along with the gap token. Since we do not allow a gap for token s1 of a pattern, the number of patterns generated at each window position is given by the expression 2lw-1 - 2lw-max gap-1 - 1. So, for a lw = 5, with max gap = 2, 11 patterns are generated. Figure 1, shows the patterns generated at position 1 of the sequence. A bounded hash keeps count of number of occurrences of each pattern at level v, as the sliding window moves over S0. This forms the first pass over sequence Sv. An illustrate example is shown in figure 2, where the patterns on the left and right columns are are generated from the first and second windows respectively.
Figure 2: Patterns
Pattern Significance
Of all the patterns generated during the pattern generation steps, usually, only a handful of them give us any advantage in capturing any significant information. To advoid wasting computational efforts on "non-useful" patterns, RDM determines which pattern is statistically significant and which is not. There are many ways to compute a statistical significant of a pattern, a permutation test, probabilistic models, expected and actual occurance ratio, and etc. One can also imagine a domain knownledge can also give out a set of informative patterns prior to starting the program. The selection of a significant test is a domain dependent choice.
Dominant Patterns
At any given level of iteration in RDM, once all significant patterns are defined, the algorithm make a pass over each input sequence of tokens. At each position in the sequence, the tokens in the significant patterns are matched against the tokens in the sequence. Again, the matching method can be chosen to fit the task. One example of such method is edit distance, which the scoring function can be adjust to fit the task at hand, i.e. using a simple Hamming distance, a BLOSUM substitution matrix, or etc. Once the highest matching score starting at first matching position is found the matching pattern is considered a dominant pattern. The term dominant pattern is coined from the fact that this pattern dominates over all other significant patterns for this position in the sequence. Dominant patterns, at consecutive positions over the sequence can be merged to form longer dominant patterns. The merging process is continued till no further dominant patterns can be merged.
Rewriting Step
After dominant patterns are located for each sequence, the matching tokens in the sequence are simply replaced by the corresponding dominant patterns' tokens, as shown by an example here.