COMP9024 ♢ Week 04b ♢ String Algorithms ♢ (22T0)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [0/55]
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [1/55]
A string is a sequence of characters.
An alphabet Σ is the set of possible characters in strings.
Examples of strings:
- C program
- HTML document
- DNA sequence
- Digitised image
Examples of alphabets:
- ASCII
- Unicode
- {0,1}
- {A,C,G,T}
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [2/55]
Notation:
- length(P) … #characters in P
- λ … empty string (length(λ) = 0)
- Σm … set of all strings of length m over alphabet Σ
- Σ* … set of all strings over alphabet Σ
νω denotes the
concatenation of strings ν and ω
Note: length(νω) = length(ν)+length(ω) λω = ω = ωλ
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [3/55]
Notation:
- substring of P … any string Q such that P = νQω, for some ν,ω∈Σ*
- prefix of P … any string Q such that P = Qω, for some ω∈Σ*
- suffix of P … any string Q such that P = ωQ, for some ω∈Σ*
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [4/55]
The string a/a
of length 3 over the ASCII alphabet has
- how many prefixes?
- how many suffixes?
- how many substrings?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [5/55]
- 4 prefixes:
""
"a"
"a/"
"a/a"
- 4 suffixes:
"a/a"
"/a"
"a"
""
- 6 substrings:
""
"a"
"/"
"a/"
"/a"
"a/a"
Note:
""
means the same as λ (= empty string)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [6/55]
ASCII (American Standard Code for Information Interchange)
- Specifies mapping of 128 characters to integers 0..127
- The characters encoded include:
- upper and lower case English letters: A-Z and a-z
- digits: 0-9
- common punctuation symbols
- special non-printing characters: e.g. newline and space
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [7/55]
Reminder:
In C a string is an array of char
s containing ASCII codes
- these arrays have an extra element containing a 0
- the extra 0 can also be written
'\0'
(null character or null-terminator)
- convenient because don't have to track the length of the string
Because strings are so common, C provides convenient syntax:
char str[] = "hello";
Note: str[]
will have 6 elements
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [8/55]
C provides a number of string manipulation functions via #include <string.h>
, e.g.
strlen()
strncpy()
strncat()
strstr()
Example:
char *strncat(char *dest, char *src, int n)
- appends string
src
to the end of dest
overwriting the '\0'
at the end of dest
and adds terminating '\0'
- returns start of string
dest
- will never add more than
n
characters
(If src
is less than n
characters long, the remainder of dest
is filled with '\0'
characters. Otherwise, dest
is not null-terminated.)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [9/55]
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [10/55]
Example (pattern checked backwards):
- Text …
abacaab
- Pattern …
abacab
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [11/55]
Given two strings T (text) and P (pattern),
the pattern matching problem consists of finding a substring of T equal to P
Applications:
- Text editors
- Search engines
- Biological research
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [12/55]
Naive pattern matching algorithm
- checks for each possible shift of P relative to T
- until a match is found, or
- all placements of the pattern have been tried
NaiveMatching(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| for all i=0..n-m do
| | j=0
| | while j<m and T[i+j]=P[j] do
| | | j=j+1
| | | if j=m then
| | | return i
| | | end if
| | end while
| end for
| return -1
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [13/55]
❖ Analysis of Naive Pattern Matching | |
Naive pattern matching runs in O(n·m)
Examples of worst case (forward checking):
- T =
aaa…ah
- P =
aaah
- may occur in DNA sequences
- unlikely in English text
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [14/55]
❖ Exercise : Naive Matching | |
Suppose all characters in P are different.
Can you accelerate NaiveMatching
to run in O(n) on an n-character text T?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [15/55]
When a mismatch occurs between P[j] and T[i+j], shift the pattern all the way to align P[0] with T[i+j]
⇒ each character in T checked at most twice
Example:
abcd
a
bcdeabcc
abcd
a
bcdeabcc
abcd
e
xxxxxxxx
xxxx
abcde
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [16/55]
The Boyer-Moore pattern matching algorithm is based on two heuristics:
- Looking-glass heuristic: Compare P with subsequence of T moving backwards
- Character-jump heuristic: When a mismatch occurs at T[i]=
c
- if P contains
c
⇒ shift P so as to align the last occurrence of c
in P with T[i]
- otherwise ⇒ shift P so as to align P[0] with T[i+1] (a.k.a. "big jump")
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [17/55]
❖ ... Boyer-Moore Algorithm | |
Example:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [18/55]
❖ ... Boyer-Moore Algorithm | |
Boyer-Moore algorithm preprocesses pattern P and alphabet Σ to build
- last-occurrence function L
- L maps Σ to integers such that L(c) is defined as
- the largest index i such that P[i]=c, or
- -1 if no such index exists
Example: Σ = {a,b,c,d
}, P = acab
- L can be represented by an array indexed by the numeric codes of the characters
- L can be computed in O(m+s) time (m … length of pattern, s … size of Σ)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [19/55]
❖ ... Boyer-Moore Algorithm | |
BoyerMooreMatch(T,P,Σ):
| Input text T of length n, pattern P of length m, alphabet Σ
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| L=lastOccurenceFunction(P,Σ)
| i=m-1, j=m-1
| repeat
| | if T[i]=P[j] then
| | | if j=0 then
| | | return i
| | | else
| | | i=i-1, j=j-1
| | | end if
| | else
| | | i=i+m-min(j,1+L[T[i]])
| | | j=m-1
| | end if
| until i≥n
| return -1
- Biggest jump (m characters ahead) occurs when L[T[i]] = -1
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [20/55]
❖ ... Boyer-Moore Algorithm | |
Case 1: j ≤ 1+L[c]
Case 2: 1+L[c] < j
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [21/55]
❖ Exercise : Boyer-Moore algorithm | |
For the alphabet Σ = {a,b,c,d
}
- compute last-occurrence function L for pattern P =
abacab
- trace Boyer-More on P and text T =
abacaabadcabacabaabb
- how many comparisons are needed?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [22/55]
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [23/55]
❖ ... Boyer-Moore Algorithm | |
Analysis of Boyer-Moore algorithm:
- Runs in O(nm+s) time
- m … length of pattern n … length of text s … size of alphabet
- Example of worst case:
- Worst case may occur in images and DNA sequences but unlikely in English texts
⇒ Boyer-Moore significantly faster than naive matching on English text
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [24/55]
❖ Knuth-Morris-Pratt Algorithm | |
The Knuth-Morris-Pratt algorithm …
- compares the pattern to the text left-to-right
- but shifts the pattern more intelligently than the naive algorithm
Reminder:
- Q is a prefix of P … P = Qω, for some ω∈Σ*
- Q is a suffix of P … P = ωQ, for some ω∈Σ*
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [25/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
When a mismatch occurs …
- what is the most we can shift the pattern to avoid redundant comparisons?
- Answer: the largest prefix of P[0..j] that is a suffix of P[1..j]
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [26/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
KMP preprocesses the pattern P[0..m-1] to find matches of its prefixes with itself
- Failure function F(j) defined as
- the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]
for each position j=0..m-1
- if mismatch occurs at Pj ⇒ advance j to F(j-1)
Example:
P =
abaaba
j | 0 | 1 | 2 | 3 | 4 | 5 |
Pj | a | b | a | a | b | a |
F(j) | 0 | 0 | 1 | 1 | 2 | 3 |
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [27/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
KMPMatch(T,P):
| Input text T of length n, pattern P of length m
| Output starting index of a substring of T equal to P
| -1 if no such substring exists
|
| F=failureFunction(P)
| i=0, j=0
| while i<n do
| | if T[i]=P[j] then
| | if j=m-1 then
| | return i-j
| | else
| | i=i+1, j=j+1
| | end if
| | else
| | if j>0 then
| | j=F[j-1]
| | else
| | i=i+1
| | end if
| | end if
| end while
| return -1
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [28/55]
❖ Exercise : KMP-Algorithm | |
- compute failure function F for pattern P =
abacab
- trace Knuth-Morris-Pratt on P and text T =
abacaabaccabacabaabb
- how many comparisons are needed?
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [29/55]
j | 0 | 1 | 2 | 3 | 4 | 5 |
Pj | a | b | a | c | a | b |
F(j) | 0 | 0 | 1 | 0 | 1 | 2 |
19 comparisons in total
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [30/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
Construction of the failure function matches pattern against itself:
failureFunction(P):
| Input pattern P of length m
| Output failure function for P
|
| F[0]=0
| i=1, j=0
| while i<m do
| | if P[i]=P[j] then
| | F[i]=j+1
| | i=i+1, j=j+1
| | else if j>0 then
| | j=F[j-1]
| | else
| | F[i]=0
| | i=i+1
| | end if
| end while
| return F
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [31/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
Analysis of failure function computation:
- At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
- Hence, there are no more than 2·m iterations of the while-loop
⇒ failure function can be computed in
O(m) time
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [32/55]
❖ ... Knuth-Morris-Pratt Algorithm | |
Analysis of Knuth-Morris-Pratt algorithm:
- Failure function can be computed in O(m) time
- At each iteration of the while-loop, either
- i increases by one, or
- the "shift amount" i-j increases by at least one (observe that F(j-1)<j)
- Hence, there are no more than 2·n iterations of the while-loop
⇒ KMP's algorithm runs in optimal time O(m+n)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [33/55]
Boyer-Moore algorithm
- decides how far to jump ahead based on the mismatched character in the text
- works best on large alphabets and natural language texts (e.g. English)
Knuth-Morris-Pratt algorithm
- uses information embodied in the pattern to determine where the next match could begin
- works best on small alphabets (e.g.
A,C,G,T
)
For the keen: The article "Average running time of the Boyer-Moore-Horspool algorithm" shows that the time is inversely proportional to size of alphabet
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [34/55]
❖ Word Matching With Tries | |
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [35/55]
Preprocessing the pattern speeds up pattern matching queries
- After preprocessing P, KMP algorithm performs pattern matching in time proportional to the text length
If the text is large, immutable and searched for often (e.g., works by Shakespeare)
- we can preprocess the text instead of the pattern
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [36/55]
❖ ... Preprocessing Strings | |
A trie …
- is a compact data structure for representing a set of strings
- e.g. all the words in a text, a dictionary etc.
- supports pattern matching queries in time proportional to the pattern size
Note: Trie comes from retrieval, but is pronounced like "try" to distinguish it from "tree"
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [37/55]
Tries are trees organised using parts of keys (rather than whole keys)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [38/55]
Each node in a trie …
- contains one part of a key (typically one character)
- may have up to 26 children
- may be tagged as a "finishing" node
- but even "finishing" nodes may have children
Depth d of trie = length of longest key value
Cost of searching O(d) (independent of n)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [39/55]
Possible trie representation:
#define ALPHABET_SIZE 26
typedef struct Node *Trie;
typedef struct Node {
bool finish;
Item data;
Trie child[ALPHABET_SIZE];
} Node;
typedef char *Key;
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [40/55]
Note: Can also use BST-like nodes for more space-efficient implementation of tries
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [41/55]
Basic operations on tries:
- search for a key
- insert a key
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [42/55]
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [43/55]
Traversing a path, using char-by-char from Key:
find(trie,key):
| Input trie, key
| Output pointer to element in trie if key found
| NULL otherwise
|
| node=trie
| for each char in key do
| | if node.child[char] exists then
| | node=node.child[char]
| | else
| | return NULL
| | end if
| end for
| if node.finish then
| return node
| else
| return NULL
| end if
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [44/55]
Insertion into Trie:
insert(trie,item,key):
| Input trie, item with key of length m
| Output trie with item inserted
|
| if trie is empty then
| t=new trie node
| end if
| if m=0 then
| t.finish=true, t.data=item
| else
| t.child[key[0]]=insert(t.child[key[0]],item,key[1..m-1])
| end if
| return t
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [45/55]
Analysis of standard tries:
- O(n) space
- insertion and search in time O(d·m)
- n … total size of text (e.g. sum of lengths of all strings in a given dictionary)
- m … size of the string parameter of the operation (the "key")
- d … size of the underlying alphabet (e.g. 26)
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [46/55]
❖ Word Matching With Tries | |
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [47/55]
❖ Word Matching with Tries | |
Preprocessing the text:
- Insert all searchable words of a text into a trie
- Each leaf stores the occurrence(s) of the associated word in the text
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [48/55]
❖ ... Word Matching with Tries | |
Example text and corresponding trie of searchable words:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [49/55]
Compressed tries …
- have internal nodes of degree ≥ 2
- are obtained from standard tries by compressing "redundant" chains of nodes
Example:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [50/55]
Possible compact representation of a compressed trie to encode an array S of strings:
- nodes store ranges of indices instead of substrings
- use triple (i,j,k) to represente substring S[i][j..k]
- requires O(s) space (s = #strings in array S)
Example:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [51/55]
❖ Pattern Matching With Suffix Tries | |
The suffix trie of a text T is the compressed trie of all the suffixes of T
Example:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [52/55]
❖ ... Pattern Matching With Suffix Tries | |
Compact representation:
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [53/55]
❖ ... Pattern Matching With Suffix Tries | |
Analysis of pattern matching using suffix tries:
Suffix trie for a text of size n …
- can be constructed in O(n) time
- uses O(n) space
- supports pattern matching queries in O(s·m) time
- m … length of the pattern
- s … size of the alphabet
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [54/55]
- Alphabets and words
- Pattern matching
- Boyer-Moore, Knuth-Morris-Pratt
- Tries
- Suggested reading:
- tries … Sedgewick, Ch. 15.2
COMP9024 (22T0) ♢ Week 04b ♢ String Algorithms ♢ [55/55]
Produced: 20 Dec 2021