COMP9024 ♢ Week 04b ♢ String Algorithms ♢ (22T0)

Strings
Strings
Pattern Matching
Pattern Matching
Analysis of Naive Pattern Matching
Boyer-Moore Algorithm
Knuth-Morris-Pratt Algorithm
Boyer-Moore vs KMP
Word Matching With Tries
Preprocessing Strings
Tries
Trie Operations
Trie Operations
Word Matching With Tries
Word Matching with Tries
Compressed Tries
Pattern Matching With Suffix Tries
Summary

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [0/55]

∧ >>

❖ Strings

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [1/55]

<< ∧ >>

❖ Strings

A string is a sequence of characters.

An alphabet Σ is the set of possible characters in strings.

Examples of strings:

C program

HTML document

DNA sequence

Digitised image

Examples of alphabets:

ASCII
Unicode
{0,1}
{A,C,G,T}

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [2/55]

<< ∧ >>

❖ ... Strings

Notation:

length(P) … #characters in P
λ … empty string (length(λ) = 0)
Σ^m … set of all strings of length m over alphabet Σ
Σ^* … set of all strings over alphabet Σ

νω denotes the concatenation of strings ν and ω

Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [3/55]

<< ∧ >>

❖ ... Strings

Notation:

substring of P … any string Q such that P = νQω, for some ν,ω∈Σ^*
prefix of P … any string Q such that P = Qω, for some ω∈Σ^*
suffix of P … any string Q such that P = ωQ, for some ω∈Σ^*

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [4/55]

<< ∧ >>

❖ Exercise : Strings

The string a/a of length 3 over the ASCII alphabet has

how many prefixes?
how many suffixes?
how many substrings?

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [5/55]

<< ∧ >>

4 prefixes: "" "a" "a/" "a/a"
4 suffixes: "a/a" "/a" "a" ""
6 substrings: "" "a" "/" "a/" "/a" "a/a"

Note:
"" means the same as λ (= empty string)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [6/55]

<< ∧ >>

❖ ... Strings

ASCII (American Standard Code for Information Interchange)

Specifies mapping of 128 characters to integers 0..127
The characters encoded include:
- upper and lower case English letters: A-Z and a-z
- digits: 0-9
- common punctuation symbols
- special non-printing characters: e.g. newline and space

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [7/55]

<< ∧ >>

❖ ... Strings

Reminder:

In C a string is an array of chars containing ASCII codes

these arrays have an extra element containing a 0
the extra 0 can also be written '\0' (null character or null-terminator)
convenient because don't have to track the length of the string

Because strings are so common, C provides convenient syntax:


char str[] = "hello";  // same as char str[] = {'h','e','l','l','o','\0'};

Note: str[] will have 6 elements

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [8/55]

<< ∧ >>

❖ ... Strings

C provides a number of string manipulation functions via #include <string.h>, e.g.

strlen()   // length of string
strncpy()  // copy one string to another
strncat()  // concatenate two strings
strstr()   // find substring inside string

Example:

char *strncat(char *dest, char *src, int n)

appends string src to the end of dest overwriting the '\0' at the end of dest and adds terminating '\0'
returns start of string dest
will never add more than n characters
(If src is less than n characters long, the remainder of dest is filled with '\0' characters. Otherwise, dest is not null-terminated.)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [9/55]

<< ∧ >>

❖ Pattern Matching

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [10/55]

<< ∧ >>

❖ Pattern Matching

Example (pattern checked backwards):

Text … abacaab
Pattern … abacab

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [11/55]

<< ∧ >>

❖ ... Pattern Matching

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:

Text editors

Search engines

Biological research

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [12/55]

<< ∧ >>

❖ ... Pattern Matching

Naive pattern matching algorithm

checks for each possible shift of P relative to T
- until a match is found, or
- all placements of the pattern have been tried

NaiveMatching(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  for all i=0..n-m do
|  |  j=0                            // check from left to right
|  |  while j<m and T[i+j]=P[j] do  // test i^th shift of pattern
|  |  |  j=j+1
|  |  |  if j=m then
|  |  |     return i                 // entire pattern checked
|  |  |  end if
|  |  end while
|  end for
|  return -1                         // no match found

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [13/55]

<< ∧ >>

❖ Analysis of Naive Pattern Matching

Naive pattern matching runs in O(n·m)

Examples of worst case (forward checking):

T = aaa…ah
P = aaah
may occur in DNA sequences
unlikely in English text

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [14/55]

<< ∧ >>

❖ Exercise : Naive Matching

Suppose all characters in P are different.

Can you accelerate NaiveMatching to run in O(n) on an n-character text T?

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [15/55]

<< ∧ >>

When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]

⇒ each character in T checked at most twice

Example:

abcdabcdeabcc abcdabcdeabcc
abcdexxxxxxxx xxxxabcde

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [16/55]

<< ∧ >>

❖ Boyer-Moore Algorithm

The Boyer-Moore pattern matching algorithm is based on two heuristics:

Looking-glass heuristic: Compare P with subsequence of T moving backwards
Character-jump heuristic: When a mismatch occurs at T[i]=c
- if P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i]
- otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [17/55]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Example:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [18/55]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

last-occurrence function L
- L maps Σ to integers such that L(c) is defined as
  - the largest index i such that P[i]=c, or
  - -1 if no such index exists

Example: Σ = {a,b,c,d}, P = acab

c	`a`	`b`	`c`	`d`
L(c)	2	3	1	-1

L can be represented by an array indexed by the numeric codes of the characters
L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [19/55]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

BoyerMooreMatch(T,P,Σ):
|  Input  text T of length n, pattern P of length m, alphabet Σ
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  L=lastOccurenceFunction(P,Σ)
|  i=m-1, j=m-1                 // start at end of pattern
|  repeat
|  |  if T[i]=P[j] then
|  |  |  if j=0 then
|  |  |     return i            // match found at i
|  |  |  else
|  |  |     i=i-1, j=j-1
|  |  |  end if
|  |  else                      // character-jump
|  |  |  i=i+m-min(j,1+L[T[i]])
|  |  |  j=m-1
|  |  end if
|  until i≥n
|  return -1                    // no match

Biggest jump (m characters ahead) occurs when L[T[i]] = -1

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [20/55]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Case 1: j ≤ 1+L[c]

Case 2: 1+L[c] < j

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [21/55]

<< ∧ >>

❖ Exercise : Boyer-Moore algorithm

For the alphabet Σ = {a,b,c,d}

compute last-occurrence function L for pattern P = abacab
trace Boyer-More on P and text T = abacaabadcabacabaabb
- how many comparisons are needed?

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [22/55]

<< ∧ >>

c	`a`	`b`	`c`	`d`
L(c)	4	5	3	-1

13 comparisons in total

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [23/55]

<< ∧ >>

❖ ... Boyer-Moore Algorithm

Analysis of Boyer-Moore algorithm:

Runs in O(nm+s) time
- m … length of pattern n … length of text s … size of alphabet
Example of worst case:
- T = aaa … a
- P = baaa
Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than naive matching on English text

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [24/55]

<< ∧ >>

❖ Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm …

compares the pattern to the text left-to-right
but shifts the pattern more intelligently than the naive algorithm

Reminder:

Q is a prefix of P … P = Qω, for some ω∈Σ^*
Q is a suffix of P … P = ωQ, for some ω∈Σ^*

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [25/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

When a mismatch occurs …

what is the most we can shift the pattern to avoid redundant comparisons?
Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [26/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself

Failure function F(j) defined as
- the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
for each position j=0..m-1
if mismatch occurs at P_j ⇒ advance j to F(j-1)

Example: P = abaaba

j	0	1	2	3	4	5
P_j	a	b	a	a	b	a
F(j)	0	0	1	1	2	3

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [27/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

KMPMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  F=failureFunction(P)
|  i=0, j=0                    // start from left
|  while i<n do
|  |  if T[i]=P[j] then
|  |     if j=m-1 then
|  |        return i-j         // match found at i-j
|  |     else
|  |        i=i+1, j=j+1
|  |     end if
|  |  else                     // mismatch at P[j]
|  |     if j>0 then
|  |        j=F[j-1]           // resume comparing P at F[j-1]
|  |     else
|  |        i=i+1
|  |     end if
|  |  end if
|  end while
|  return -1                   // no match

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [28/55]

<< ∧ >>

❖ Exercise : KMP-Algorithm

compute failure function F for pattern P = abacab
trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb
- how many comparisons are needed?

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [29/55]

<< ∧ >>

j	0	1	2	3	4	5
P_j	a	b	a	c	a	b
F(j)	0	0	1	0	1	2

19 comparisons in total

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [30/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Construction of the failure function matches pattern against itself:

failureFunction(P):
|  Input  pattern P of length m
|  Output failure function for P
|
|  F[0]=0
|  i=1, j=0
|  while i<m do
|  |  if P[i]=P[j] then   // we have matched j+1 characters
|  |     F[i]=j+1
|  |     i=i+1, j=j+1
|  |  else if j>0 then    // use failure function to shift P
|  |     j=F[j-1]
|  |  else
|  |     F[i]=0           // no match
|  |     i=i+1
|  |  end if
|  end while
|  return F

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [31/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Analysis of failure function computation:

At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·m iterations of the while-loop

⇒ failure function can be computed in O(m) time

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [32/55]

<< ∧ >>

❖ ... Knuth-Morris-Pratt Algorithm

Analysis of Knuth-Morris-Pratt algorithm:

Failure function can be computed in O(m) time
At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
Hence, there are no more than 2·n iterations of the while-loop

⇒ KMP's algorithm runs in optimal time O(m+n)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [33/55]

<< ∧ >>

❖ Boyer-Moore vs KMP

Boyer-Moore algorithm

decides how far to jump ahead based on the mismatched character in the text
works best on large alphabets and natural language texts (e.g. English)

Knuth-Morris-Pratt algorithm

uses information embodied in the pattern to determine where the next match could begin
works best on small alphabets (e.g. A,C,G,T)

For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [34/55]

<< ∧ >>

❖ Word Matching With Tries

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [35/55]

<< ∧ >>

❖ Preprocessing Strings

Preprocessing the pattern speeds up pattern matching queries

After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length

If the text is large, immutable and searched for often (e.g., works by Shakespeare)

we can preprocess the text instead of the pattern

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [36/55]

<< ∧ >>

❖ ... Preprocessing Strings

A trie …

is a compact data structure for representing a set of strings
- e.g. all the words in a text, a dictionary etc.
supports pattern matching queries in time proportional to the pattern size

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [37/55]

<< ∧ >>

❖ Tries

Tries are trees organised using parts of keys (rather than whole keys)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [38/55]

<< ∧ >>

❖ ... Tries

Each node in a trie …

contains one part of a key (typically one character)
may have up to 26 children
may be tagged as a "finishing" node
but even "finishing" nodes may have children

Depth d of trie = length of longest key value

Cost of searching O(d) (independent of n)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [39/55]

<< ∧ >>

❖ ... Tries

Possible trie representation:

#define ALPHABET_SIZE 26

typedef struct Node *Trie;

typedef struct Node {
   bool finish;      // last char in key?
   Item data;        // no Item if !finish
   Trie child[ALPHABET_SIZE];
} Node;

typedef char *Key;

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [40/55]

<< ∧ >>

❖ ... Tries

Note: Can also use BST-like nodes for more space-efficient implementation of tries

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [41/55]

<< ∧ >>

❖ Trie Operations

Basic operations on tries:

search for a key
insert a key

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [42/55]

<< ∧ >>

❖ Trie Operations

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [43/55]

<< ∧ >>

❖ ... Trie Operations

Traversing a path, using char-by-char from Key:

find(trie,key):
|  Input  trie, key
|  Output pointer to element in trie if key found
|         NULL otherwise
|
|  node=trie
|  for each char in key do
|  |  if node.child[char] exists then
|  |     node=node.child[char]  // move down one level
|  |  else
|  |     return NULL
|  |  end if
|  end for
|  if node.finish then          // "finishing" node reached?
|     return node
|  else
|     return NULL
|  end if

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [44/55]

<< ∧ >>

❖ ... Trie Operations

Insertion into Trie:

insert(trie,item,key):
|  Input  trie, item with key of length m
|  Output trie with item inserted
|
|  if trie is empty then
|     t=new trie node
|  end if
|  if m=0 then
|     t.finish=true, t.data=item
|  else
|     t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
|  end if
|  return t

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [45/55]

<< ∧ >>

❖ ... Trie Operations

Analysis of standard tries:

O(n) space
insertion and search in time O(d·m)
- n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
- m … size of the string parameter of the operation (the "key")
- d … size of the underlying alphabet (e.g. 26)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [46/55]

<< ∧ >>

❖ Word Matching With Tries

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [47/55]

<< ∧ >>

❖ Word Matching with Tries

Preprocessing the text:

Insert all searchable words of a text into a trie
Each leaf stores the occurrence(s) of the associated word in the text

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [48/55]

<< ∧ >>

❖ ... Word Matching with Tries

Example text and corresponding trie of searchable words:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [49/55]

<< ∧ >>

❖ Compressed Tries

Compressed tries …

have internal nodes of degree ≥ 2
are obtained from standard tries by compressing "redundant" chains of nodes

Example:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [50/55]

<< ∧ >>

❖ ... Compressed Tries

Possible compact representation of a compressed trie to encode an array S of strings:

nodes store ranges of indices instead of substrings
- use triple (i,j,k) to represente substring S[i][j..k]
requires O(s) space (s = #strings in array S)

Example:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [51/55]

<< ∧ >>

❖ Pattern Matching With Suffix Tries

The suffix trie of a text T is the compressed trie of all the suffixes of T

Example:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [52/55]

<< ∧ >>

❖ ... Pattern Matching With Suffix Tries

Compact representation:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [53/55]

<< ∧ >>

❖ ... Pattern Matching With Suffix Tries

Analysis of pattern matching using suffix tries:

Suffix trie for a text of size n …

can be constructed in O(n) time
uses O(n) space
supports pattern matching queries in O(s·m) time
- m … length of the pattern
- s … size of the alphabet

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [54/55]

<< ∧

❖ Summary

Alphabets and words
Pattern matching
- Boyer-Moore, Knuth-Morris-Pratt
Tries

Suggested reading:
- tries … Sedgewick, Ch. 15.2

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [55/55]

Produced: 20 Dec 2021