TY - GEN
T1 - Bit-parallel approach to approximate string matching in compressed texts
AU - Matsumoto, T.
AU - Kida, T.
AU - Takeda, M.
AU - Shinohara, A.
AU - Arikawa, S.
N1 - Publisher Copyright:
© 2000 IEEE.
PY - 2000
Y1 - 2000
N2 - Addresses the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of a collage system, which is a formal system proposed by T. Kida et al. (1999) that captures various dictionary-based compression methods. We present an algorithm that exploits bit-parallelism, assuming that our problem fits in a single machine word, e.g. (m - k)(k + 1) ≤ L, where m is the pattern length, k is the number of allowed errors and L is the length, in bits, of the machine word. For a class of simple collage systems, the algorithm runs in O(k2(∥D∥ + |S|) + km) time using O(k2∥D∥) space, where ∥D∥ is the size of dictionary D and |S| is the number of tokens in the sequence S. The LZ78 (Lempel-Ziv, 1978) and the LZW (Lempel-Ziv-Welch, 1984) compression methods are covered by this class. Since we can regard ∥D∥ + |S| as the compressed length, the time and space complexities are O(k2n + km) and O(k2n), respectively. For general k and m, they become O(k3mn/L + km) and O(k3mn/L). Thus, our algorithm is competitive to the algorithm proposed by J. Kärkkäinen, et al. (2000), which runs in O(km) time using O(kmn) space, when k = O(√L).
AB - Addresses the problem of approximate string matching on compressed text. We consider this problem for a text string described in terms of a collage system, which is a formal system proposed by T. Kida et al. (1999) that captures various dictionary-based compression methods. We present an algorithm that exploits bit-parallelism, assuming that our problem fits in a single machine word, e.g. (m - k)(k + 1) ≤ L, where m is the pattern length, k is the number of allowed errors and L is the length, in bits, of the machine word. For a class of simple collage systems, the algorithm runs in O(k2(∥D∥ + |S|) + km) time using O(k2∥D∥) space, where ∥D∥ is the size of dictionary D and |S| is the number of tokens in the sequence S. The LZ78 (Lempel-Ziv, 1978) and the LZW (Lempel-Ziv-Welch, 1984) compression methods are covered by this class. Since we can regard ∥D∥ + |S| as the compressed length, the time and space complexities are O(k2n + km) and O(k2n), respectively. For general k and m, they become O(k3mn/L + km) and O(k3mn/L). Thus, our algorithm is competitive to the algorithm proposed by J. Kärkkäinen, et al. (2000), which runs in O(km) time using O(kmn) space, when k = O(√L).
UR - http://www.scopus.com/inward/record.url?scp=84964461893&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84964461893&partnerID=8YFLogxK
U2 - 10.1109/SPIRE.2000.878198
DO - 10.1109/SPIRE.2000.878198
M3 - Conference contribution
AN - SCOPUS:84964461893
T3 - Proceedings - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000
SP - 221
EP - 228
BT - Proceedings - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 7th International Symposium on String Processing and Information Retrieval, SPIRE 2000
Y2 - 27 September 2000 through 29 September 2000
ER -