TY - GEN
T1 - Processing text files as is
T2 - 9th International Symposium on String Processing and Information Retrieval, SPIRE 2002
AU - Takeda, Masayuki
AU - Miyamoto, Satoru
AU - Kida, Takuya
AU - Shinohara, Ayumi
AU - Fukamachi, Shuichi
AU - Shinohara, Takeshi
AU - Arikawa, Setsuo
N1 - Publisher Copyright:
© Springer-Verlag Berlin Heidelberg 2002.
PY - 2002
Y1 - 2002
N2 - Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.
AB - Techniques in processing text files “as is” are presented, in which given text files are processed without modification. The compressed pattern matching problem, first defined by Amir and Benson (1992), is a good example of the “as-is” principle. Another example is string matching over multi-byte character texts, which is a significant problem common to oriental languages such as Japanese, Korean, Chinese, and Taiwanese. A text file from such languages is a mixture of single-byte characters and multi-byte characters. Naive solution would be (1) to convert a given text into a fixed length encoded one and then apply any string matching routine to it; or (2) to directly search the text file byte after byte for (the encoding of) a pattern in which an extra work is needed for synchronization to avoid false detection. Both the solutions, however, sacrifice the searching speed. Our algorithm runs on such a multi-byte character text file at the same speed as on an ordinary ASCII text file, without false detection. The technique is applicable to any prefix code such as the Huffman code and variants of Unicode. We also generalize the technique so as to handle structured texts such as XML documents. Using this technique, we can avoid false detection of keyword even if it is a substring of a tag name or of an attribute description, without any sacrifice of searching speed.
UR - http://www.scopus.com/inward/record.url?scp=84958774017&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84958774017&partnerID=8YFLogxK
U2 - 10.1007/3-540-45735-6_16
DO - 10.1007/3-540-45735-6_16
M3 - Conference contribution
AN - SCOPUS:84958774017
SN - 3540441581
SN - 9783540441588
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 170
EP - 186
BT - String Processing and Information Retrieval - 9th International Symposium, SPIRE 2002, Proceedings
A2 - Laender, Alberto H. F.
A2 - Oliveira, Arlindo L.
PB - Springer Verlag
Y2 - 11 September 2002 through 13 September 2002
ER -