Skip to content

Suffix Pattern Recognition

Python CI

1. Design Summary

For suffix pattern recognition with \(K\) suffix patterns (length \(W\)) and 1 query pattern (length \(N\)), using reverse-traversal prefix tree search is efficient because it can model common ancestor ordering with reverse encoding (by inserting suffix patterns in reverse order during setup and searching in reverse order during queries) and check the multiple patterns over a single root-to-leaf pass (taking runtime \(O(K \cdot W)\) for one-time setup and \(O(W)\) for each query). This improves over using direct suffix pattern search which must check the multiple patterns over separate array passes (taking \(O(K \cdot W)\) for each query) or using forward-traversal prefix tree search which cannot model common ancestor ordering and must check the multiple patterns over separate root-to-leaf passes (taking runtime \(O(K \cdot W)\) for one-time setup and \(O(K \cdot W)\) for each query).

For batch data (offline scenario), reading the query pattern from a static array (size \(N\)) with a reverse iterator takes \(O(1)\) auxiliary space. For streaming data (online scenario), reading the query pattern from a history array (size \(W\)) reconstructed from the stream takes \(O(W)\) auxiliary space.

📊 Complexity Analysis

Operation Direct Search (v1) Forward-traversal (v2) Reverse-traversal (v3) Comparison
Initial Setup \(O(1)\) \(O(K \cdot W)\) \(O(K \cdot W)\) v1 is superior
Batch Queries (N) \(O(K \cdot W)\) \(O(K \cdot W)\) \(O(W)\) v3 is superior
Streaming Queries (1) \(O(K \cdot W)\) \(O(K \cdot W)\) \(O(W)\) v3 is superior
Space Complexity \(O(1)\) \(O(K \cdot W)\) \(O(K \cdot W)\) v1 is superior

🚀 Reverse Encoding (Visual)

The advantage of reverse-traversal prefix tree search is that it can check multiple suffix patterns at once, ensuring \(O(W)\) query time (and is independent of the suffix pattern count \(K\) and the query pattern length \(N\)). This is possible because of reverse encoding.

graph TD
    subgraph "Reverse Prefix Tree (Stored Structure)"
    direction TD
    Root((Root)) --> C((c))
    C --> E_tail((e*))
    Root --> E((e))
    E --> C_tail((c*))
    Root --> R1((r))
    R1 --> A1((a))
    A1 --> C2((c))
    C2 --> E2((e))
    E2 --> C3((c))
    C3 --> A2((a))
    A2 --> R2_tail((r*))
    end

    subgraph "Streaming Timeline: 'racecar'"
    T1[T=1: 'r'] --> H1["History: [r] <br/> Search: r -> (No Tail)"]
    T2[T=2: 'a'] --> H2["History: [r, a] <br/> Search: a -> (Fail)"]
    T3[T=3: 'c'] --> H3["History: [r, a, c] <br/> Search: c -> (No Tail)"]
    T4[T=4: 'e'] --> H4["History: [r, a, c, e] <br/> Search: e -> c -> MATCH 'ce'"]
    T5[T=5: 'c'] --> H5["History: [r, a, c, e, c] <br/> Search: c -> e -> MATCH 'ec'"]
    T6[T=6: 'a'] --> H6["History: [..., c, a] <br/> Search: a -> (Fail)"]
    T7[T=7: 'r'] --> H7["History: [r,a,c,e,c,a,r] <br/> Search: r->a->c->e->c->a->r -> MATCH 'racecar'"]
    end

    style H4 fill:#ccffcc,stroke:#333
    style H5 fill:#ccffcc,stroke:#333
    style H7 fill:#ccffcc,stroke:#333
    style E_tail fill:#f9f,stroke:#333
    style C_tail fill:#f9f,stroke:#333
    style R2_tail fill:#f9f,stroke:#333

2. Design Approaches & Trade-offs

When Less is More

Although Direct Search (v1) requires ~90% less lines of code, it performs worse than Reverse-traversal (v3) in every comparison other than initial setup and space complexity. This trade-off shows that Direct Search may be preferable for low-memory systems that cannot afford higher space usage (like embedded systems) or greater maintenance scope (like regulated industries) and that Reverse-traversal may be preferable for systems with more resources (like compute nodes in distributed systems).

Python Object-Model Tuning

To reduce object-overhead due to the Python object-model, I implemented a nested hash map for the reverse-traversal prefix tree, with each "node" using a character for the hash map key and a reference to the next "node" for the hash map value. This approach reduced object lookup during queries.

3. API Reference

data_streaming_accelerators.core.suffix_pattern_recognition

Classes

SuffixPatternRecognitionV1

Bases: SuffixPatternRecognitionBase

Source code in src/data_streaming_accelerators/core/suffix_pattern_recognition.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
class SuffixPatternRecognitionV1(SuffixPatternRecognitionBase):

    class _ReversePrefixTree:
        @classmethod
        def as_nested_dict(cls, words: list[str]) -> dict:
            """Make reverse prefix tree as nested dict.
            Takes runtime O(N*W) and memory O(N*W)."""
            node_fact = lambda: {'is_tail': False}
            head = node_fact()
            for word in words:
                p_node = head; word_length = len(word)
                for i in range(word_length-1, -1, -1):
                    if (char := word[i]) not in p_node:
                        p_node[char] = node_fact()  # add new key here
                    p_node = p_node[char]
                    if i == 0:
                        p_node['is_tail'] = True
            return head

    def __init__(self, words: list[str]) -> None:
        """Make new object for stream query api.
        Takes runtime O(N*W) and memory O(N*W)."""
        _ReversePrefixTree = self.__class__._ReversePrefixTree  # type alias
        # use nested dict for reverse prefix tree.
        # takes runtime O(N*W) and memory O(N*W).
        self.suffixes = _ReversePrefixTree.as_nested_dict(words)
        # use list for max-size stream history.
        # takes runtime O(W) and memory O(1).
        self.history_maxsize = max(len(word) for word in words)
        self.history = []

    @validators.validate_args([None, check_character])
    def query(self, char: str) -> bool:
        """Entry point for `query` api.
        Takes runtime O(W) and memory O(W)."""
        # push next char to history (max size W).
        # takes runtime O(1) and memory O(1).
        if len(self.history) == self.history_maxsize:
            _ = self.history.pop(0)
        self.history.append(char)
        # read last chars from history to search the reverse prefix tree.
        # takes runtime O(W) and memory O(1).
        _history = self.history  # reduce self lookup in hot loop
        p_node = self.suffixes; history_length = len(_history)
        for i in range(history_length-1, -1, -1):
            if (last_char := _history[i]) not in p_node:
                return False  # early exit on match fail
            p_node = p_node[last_char]
            if p_node['is_tail']:
                return True  # early exit on match success
        return False
Functions
__init__(words)

Make new object for stream query api. Takes runtime O(NW) and memory O(NW).

Source code in src/data_streaming_accelerators/core/suffix_pattern_recognition.py
32
33
34
35
36
37
38
39
40
41
42
def __init__(self, words: list[str]) -> None:
    """Make new object for stream query api.
    Takes runtime O(N*W) and memory O(N*W)."""
    _ReversePrefixTree = self.__class__._ReversePrefixTree  # type alias
    # use nested dict for reverse prefix tree.
    # takes runtime O(N*W) and memory O(N*W).
    self.suffixes = _ReversePrefixTree.as_nested_dict(words)
    # use list for max-size stream history.
    # takes runtime O(W) and memory O(1).
    self.history_maxsize = max(len(word) for word in words)
    self.history = []
query(char)

Entry point for query api. Takes runtime O(W) and memory O(W).

Source code in src/data_streaming_accelerators/core/suffix_pattern_recognition.py
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
@validators.validate_args([None, check_character])
def query(self, char: str) -> bool:
    """Entry point for `query` api.
    Takes runtime O(W) and memory O(W)."""
    # push next char to history (max size W).
    # takes runtime O(1) and memory O(1).
    if len(self.history) == self.history_maxsize:
        _ = self.history.pop(0)
    self.history.append(char)
    # read last chars from history to search the reverse prefix tree.
    # takes runtime O(W) and memory O(1).
    _history = self.history  # reduce self lookup in hot loop
    p_node = self.suffixes; history_length = len(_history)
    for i in range(history_length-1, -1, -1):
        if (last_char := _history[i]) not in p_node:
            return False  # early exit on match fail
        p_node = p_node[last_char]
        if p_node['is_tail']:
            return True  # early exit on match success
    return False