Simple and efficient algorithm for approximate dictionary matching

Naoaki Okazaki, Jun'ichi Tsujii

Research output: Contribution to conferencePaperpeer-review

59 Citations (Scopus)

Abstract

This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ - overlap join of inverted lists. First we show that this task is solvable exactly by a τ -overlap join. Given inverted lists retrieved for a query, the algorithm collects fewer candidate strings and prunes unlikely candidates to efficiently find strings that satisfy the constraint of the τ -overlap join. We conducted experiments of approximate dictionary matching on three large-scale datasets that include person names, biomedical names, and general English words. The algorithm exhibited scalable performance on the datasets. For example, it retrieved strings in 1.1 ms from the string collection of Google Web1T unigrams (with cosine similarity and threshold 0.7).

Original languageEnglish
Pages851-859
Number of pages9
Publication statusPublished - 2010 Dec 1
Event23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China
Duration: 2010 Aug 232010 Aug 27

Other

Other23rd International Conference on Computational Linguistics, Coling 2010
Country/TerritoryChina
CityBeijing
Period10/8/2310/8/27

ASJC Scopus subject areas

  • Language and Linguistics
  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Simple and efficient algorithm for approximate dictionary matching'. Together they form a unique fingerprint.

Cite this