COMP9024 ♢ Week 04b ♢ String Algorithms ♢ (22T0)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [0/55]
❖ Strings

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [1/55]
❖ Strings

A string is a sequence of characters.

An alphabet Σ is the set of possible characters in strings.

Examples of strings:

Examples of alphabets:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [2/55]
❖ ... Strings

Notation:

νω denotes the concatenation of strings ν and ω

Note: length(νω) = length(ν)+length(ω)    λω = ω = ωλ

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [3/55]
❖ ... Strings

Notation:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [4/55]
❖ Exercise : Strings

The string  a/a  of length 3 over the ASCII alphabet has

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [5/55]
Note:
"" means the same as λ   (= empty string)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [6/55]
❖ ... Strings

ASCII (American Standard Code for Information Interchange)

[Diagram:Pic/ascii.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [7/55]
❖ ... Strings

Reminder:

In C a string is an array of chars containing ASCII codes

Because strings are so common, C provides convenient syntax:


char str[] = "hello";  // same as char str[] = {'h','e','l','l','o','\0'};

Note: str[] will have 6 elements

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [8/55]
❖ ... Strings

C provides a number of string manipulation functions via #include <string.h>, e.g.

strlen()   // length of string
strncpy()  // copy one string to another
strncat()  // concatenate two strings
strstr()   // find substring inside string

Example:

char *strncat(char *dest, char *src, int n)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [9/55]
❖ Pattern Matching

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [10/55]
❖ Pattern Matching

Example (pattern checked backwards):

[Diagram:Pic/pattern-matching.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [11/55]
❖ ... Pattern Matching

Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P

Applications:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [12/55]
❖ ... Pattern Matching

Naive pattern matching algorithm

NaiveMatching(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  for all i=0..n-m do
|  |  j=0                            // check from left to right
|  |  while j<m and T[i+j]=P[j] do  // test ith shift of pattern
|  |  |  j=j+1
|  |  |  if j=m then
|  |  |     return i                 // entire pattern checked
|  |  |  end if
|  |  end while
|  end for
|  return -1                         // no match found

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [13/55]
❖ Analysis of Naive Pattern Matching

Naive pattern matching runs in O(n·m)

Examples of worst case (forward checking):

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [14/55]
❖ Exercise : Naive Matching

Suppose all characters in P are different.

Can you accelerate NaiveMatching to run in O(n) on an n-character text T?

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [15/55]

When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]

⇒ each character in T checked at most twice

Example:

abcdabcdeabcc    abcdabcdeabcc
abcdexxxxxxxx    xxxxabcde

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [16/55]
❖ Boyer-Moore Algorithm

The Boyer-Moore pattern matching algorithm is based on two heuristics:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [17/55]
❖ ... Boyer-Moore Algorithm

Example:

[Diagram:Pic/boyer-moore.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [18/55]
❖ ... Boyer-Moore Algorithm

Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build

Example: Σ = {a,b,c,d}, P = acab

cabcd
L(c)231-1
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [19/55]
❖ ... Boyer-Moore Algorithm

BoyerMooreMatch(T,P,Σ):
|  Input  text T of length n, pattern P of length m, alphabet Σ
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  L=lastOccurenceFunction(P,Σ)
|  i=m-1, j=m-1                 // start at end of pattern
|  repeat
|  |  if T[i]=P[j] then
|  |  |  if j=0 then
|  |  |     return i            // match found at i
|  |  |  else
|  |  |     i=i-1, j=j-1
|  |  |  end if
|  |  else                      // character-jump
|  |  |  i=i+m-min(j,1+L[T[i]])
|  |  |  j=m-1
|  |  end if
|  until i≥n
|  return -1                    // no match

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [20/55]
❖ ... Boyer-Moore Algorithm

Case 1: j ≤ 1+L[c]

[Diagram:Pic/boyer-moore-case1.png]

Case 2: 1+L[c] < j

[Diagram:Pic/boyer-moore-case2.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [21/55]
❖ Exercise : Boyer-Moore algorithm

For the alphabet Σ = {a,b,c,d}

  1. compute last-occurrence function L for pattern P = abacab
  2. trace Boyer-More on P and text T = abacaabadcabacabaabb
    • how many comparisons are needed?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [22/55]

cabcd
L(c)453-1

[Diagram:Pic/boyer-moore-example.png]


13 comparisons in total

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [23/55]
❖ ... Boyer-Moore Algorithm

Analysis of Boyer-Moore algorithm:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [24/55]
❖ Knuth-Morris-Pratt Algorithm

The Knuth-Morris-Pratt algorithm …

Reminder:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [25/55]
❖ ... Knuth-Morris-Pratt Algorithm

When a mismatch occurs …


[Diagram:Pic/kmp-shift.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [26/55]
❖ ... Knuth-Morris-Pratt Algorithm

KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself

Example: P = abaaba
j012345
Pjabaaba
F(j)001123

[Diagram:Pic/kmp-failure-function.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [27/55]
❖ ... Knuth-Morris-Pratt Algorithm

KMPMatch(T,P):
|  Input  text T of length n, pattern P of length m
|  Output starting index of a substring of T equal to P
|         -1 if no such substring exists
|
|  F=failureFunction(P)
|  i=0, j=0                    // start from left
|  while i<n do
|  |  if T[i]=P[j] then
|  |     if j=m-1 then
|  |        return i-j         // match found at i-j
|  |     else
|  |        i=i+1, j=j+1
|  |     end if
|  |  else                     // mismatch at P[j]
|  |     if j>0 then
|  |        j=F[j-1]           // resume comparing P at F[j-1]
|  |     else
|  |        i=i+1
|  |     end if
|  |  end if
|  end while
|  return -1                   // no match

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [28/55]
❖ Exercise : KMP-Algorithm

  1. compute failure function F for pattern P = abacab
  2. trace Knuth-Morris-Pratt on P and text T = abacaabaccabacabaabb
    • how many comparisons are needed?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [29/55]

j012345
Pjabacab
F(j)001012

[Diagram:Pic/kmp-example.png]


19 comparisons in total

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [30/55]
❖ ... Knuth-Morris-Pratt Algorithm

Construction of the failure function matches pattern against itself:

failureFunction(P):
|  Input  pattern P of length m
|  Output failure function for P
|
|  F[0]=0
|  i=1, j=0
|  while i<m do
|  |  if P[i]=P[j] then   // we have matched j+1 characters
|  |     F[i]=j+1
|  |     i=i+1, j=j+1
|  |  else if j>0 then    // use failure function to shift P
|  |     j=F[j-1]
|  |  else
|  |     F[i]=0           // no match
|  |     i=i+1
|  |  end if
|  end while
|  return F

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [31/55]
❖ ... Knuth-Morris-Pratt Algorithm

Analysis of failure function computation:

⇒  failure function can be computed in O(m) time
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [32/55]
❖ ... Knuth-Morris-Pratt Algorithm

Analysis of Knuth-Morris-Pratt algorithm:

⇒  KMP's algorithm runs in optimal time O(m+n)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [33/55]
❖ Boyer-Moore vs KMP

Boyer-Moore algorithm

Knuth-Morris-Pratt algorithm

For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [34/55]
❖ Word Matching With Tries

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [35/55]
❖ Preprocessing Strings

Preprocessing the pattern speeds up pattern matching queries

If the text is large, immutable and searched for often (e.g., works by Shakespeare)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [36/55]
❖ ... Preprocessing Strings

A trie

Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [37/55]
❖ Tries

Tries are trees organised using parts of keys (rather than whole keys)

[Diagram:Pic/trie-example.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [38/55]
❖ ... Tries

Each node in a trie …

Depth d of trie = length of longest key value

Cost of searching O(d)   (independent of n)

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [39/55]
❖ ... Tries

Possible trie representation:

#define ALPHABET_SIZE 26

typedef struct Node *Trie;

typedef struct Node {
   bool finish;      // last char in key?
   Item data;        // no Item if !finish
   Trie child[ALPHABET_SIZE];
} Node;

typedef char *Key;

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [40/55]
❖ ... Tries

Note: Can also use BST-like nodes for more space-efficient implementation of tries

[Diagram:Pic/trie-as-bst.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [41/55]
❖ Trie Operations

Basic operations on tries:

  1. search for a key
  2. insert a key
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [42/55]
❖ Trie Operations

[Diagram:Pic/trie-search.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [43/55]
❖ ... Trie Operations

Traversing a path, using char-by-char from Key:

find(trie,key):
|  Input  trie, key
|  Output pointer to element in trie if key found
|         NULL otherwise
|
|  node=trie
|  for each char in key do
|  |  if node.child[char] exists then
|  |     node=node.child[char]  // move down one level
|  |  else
|  |     return NULL
|  |  end if
|  end for
|  if node.finish then          // "finishing" node reached?
|     return node
|  else
|     return NULL
|  end if

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [44/55]
❖ ... Trie Operations

Insertion into Trie:

insert(trie,item,key):
|  Input  trie, item with key of length m
|  Output trie with item inserted
|
|  if trie is empty then
|     t=new trie node
|  end if
|  if m=0 then
|     t.finish=true, t.data=item
|  else
|     t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
|  end if
|  return t

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [45/55]
❖ ... Trie Operations

Analysis of standard tries:

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [46/55]
❖ Word Matching With Tries

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [47/55]
❖ Word Matching with Tries

Preprocessing the text:

  1. Insert all searchable words of a text into a trie
  2. Each leaf stores the occurrence(s) of the associated word in the text
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [48/55]
❖ ... Word Matching with Tries

Example text and corresponding trie of searchable words:

[Diagram:Pic/text-trie-example.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [49/55]
❖ Compressed Tries

Compressed tries

Example:

[Diagram:Pic/compressed-trie.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [50/55]
❖ ... Compressed Tries

Possible compact representation of a compressed trie to encode an array S of strings:

Example:

[Diagram:Pic/compact-trie.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [51/55]
❖ Pattern Matching With Suffix Tries

The suffix trie of a text T is the compressed trie of all the suffixes of T

Example:

[Diagram:Pic/suffix-trie.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [52/55]
❖ ... Pattern Matching With Suffix Tries

Compact representation:

[Diagram:Pic/suffix-trie-compact.png]

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [53/55]
❖ ... Pattern Matching With Suffix Tries

Analysis of pattern matching using suffix tries:

Suffix trie for a text of size n

COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [54/55]
❖ Summary


COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [55/55]


Produced: 20 Dec 2021