JoDINew papersPast issuesIndex of papers

Fault-tolerant Fulltext Search for Large Multilingual Scientific Text Corpora

Wolfram M. Esser
Chair of Computer Science II, University of Würzburg,
Am Hubland, 97074 Würzburg, Germany
Email: esser@informatik.uni-wuerzburg.de
Key features: References; Figures; Tables
This is a summary version of the paper. The author's authoritative fulltext is available as PDF (9 pages, 210 kb). Download latest PDF viewer 

Contents

  • 1 Introduction
  • 2 Related Work 
    • Phonetic Codes 
    • Edit Distance Based Approaches
  • 3 Weighted Pattern Morphing
  • 4 Automatic Weight Adjustments 
    • Performance Measuring Mehtod 
    • Morph Feedback for Weight Adjustments
  • 5 Experiments on Filter Efficiency and Speed
  • 6 Conclusion
  • References

Abstract

In the work reported here, we present a new way of performing fault-tolerant fulltext retrieval on large text corpora, such as scientific encyclopedias. The weighted pattern morphing (WPM) technique introduced in this paper overcomes disadvantages of both the popular edit distance measure and the Soundex code approaches, yet keeping their flexibility. This algorithm handles phonetic similarities; common typing errors such as omission or transposition of letters, and inconsistent usage of abbreviations and hyphenation. After showing how WPM can be implemented efficiently, we present a novel method of how the weights of the internal penalty matrix can be automatically adjusted for even better results. Though the described technique can be applied without prior knowledge of actual user patterns, re-examination with a large number of online-user's patterns proves the portability of this fine-tuning approach. We further show how shifting the penalty matrix from one language to another can be accomplished. The described WPM technique is integrated into a large commercial pharmaceutical encyclopedia CDROM, an online dermatological encyclopedia, and an online-reference encyclopedia of parasitology research, thus also proving its "road capability".

Index of Figures

Figure 1. Workflow of weighted pattern morphing
Figure 2. Example of WPM at work on Parasitology
Figure 3. Altmeyer: Performance of different weight sets on 62,385 words drawn from the text corpus
Figure 4. User queries: Performance of different weight sets on 5,424 different search terms collected from the online logs of Altmeyer

Index of Tables

Table 1. Two example penalty weight matrices: phoneme group "c/g/k..."(top) and numbers (bottom)
Table 2. Adjusting submorph weights by morph feedback
Table 3. Examples of WPM results on the English text corpus of "Parasitology Research"

References

Altmeyer, P. and Bacharach-Buhles, M. (2002) Springer Enzyklopädie Dermatologie, Allergologie, Umweltmedizin (Springer-Verlag: Berlin, Heidelberg)

Blaschek, W., Ebel, S., Hackenthal, E., Holzgrabe, U., Keller, K. and Reichling J. (Hrg.) (2003) HagerROM 2003 - Hagers Handbuch der Drogen und Arzneistoffe, CD-ROM (Springer-Verlag: Heidelberg) http://www.hagerrom.de

Esser, W. (2004) "Fault-tolerant Fulltext Information Retrieval in Digital Multilingual Encyclopedias with Weighted Pattern Morphing". Advances in Information Retrieval, Procedings of the 26th ECIR, LNCS 2997, edited by S. McDonald and J. Tait (Springer), pp. 338-351 http://www.derwok.de/papers/esser_wpm_ecir04.pdf

Fredkin, E. (1960) "Trie Memory". Communications of the ACM, Vol. 3 (9), September, 490-499

Gadd, T. (1990) "PHONIX: The algorithm". Program, 24(3), 363-366

Grimm, M. (2001) "Random Access und Caching für q-Gramm-Suchverfahren". Diplomarbeit Lehrstuhl für Informatik II, Universität Würzburg

Hodge, V. and Austin, J. (2001) "An Evaluation of Phonetic Spell Checkers". Tech. Report YCS338, Dept. of CS, University of York, UK http://citeseer.ist.psu.edu/463597.html

Jokinen, P. and Ukkonen, E. (1991) "Two algorithms for approximate string matching in static texts". In Proceedings of the 2nd Mathematical Foundations of Computer Science (MFCS '91), LNCS 520 (Springer-Verlag: Berlin), pp. 240-248

Knuth, D. (1998) The Art of Computer Programming, Volume 3: Sorting and Searching, second edition (Addison Wesley)

Levenshtein, V. (1965) "Binary codes capable of correcting deletions, insertions, and reversals". Problems in Information Transmission, 1, 8-17

Myers, E. (1994) "A sublinear algorithm for approximate keyword searching". Algorithmica, 12(4/5), 345-374

Navarro, G. and Baeza-Yates, R. (1998) "A Practical q-Gram Index for Text Retrieval Allowing Errors". CLEI Electronic Journal, Vol. 1, No. 2, 1 http://www.clei.cl/cleiej/paper.php?id=32

Navarro, G., Baeza-Yates, R., Sutinen, E. and Tarhio, J. (2001) "Indexing Methods for Approximate String Matching". IEEE Data Engineering Bulletin, Vol. 24, No. 4, 19-27 http://citeseer.ist.psu.edu/navarro00indexing.html

Philips, L. (2000) "The Double Metaphone Search Algorithm". C/C++ Users Journal, Vol. 18, No. 6, June, 1 http://www.cuj.com/documents/s=8038/cuj0006philips/

Rosenfelder, M. (2000) "Hou tu pranownse Inglish". Zompist.com http://www.zompist.com/spell.html

Russell, R. (1918) INDEX (Soundex Patent), US Patent No. 1,261,167, pp.1-4 http://patimg2.uspto.gov/.piw?Docid=01261167&idkey=E7455D7DADEF

Stephen, G. (1994) String Searching Algorithms, Lecture Notes Series on Computing, Vol. 3 (World Scientific Publishing)

Swanson, D. (1988) "Historical note: Information retrieval and the future of an illusion". Journal of the American Society for Information Science, 39 (2), 92-98

Zobel, J. and Dart, P. (1996) "Phonetic String Matching: Lessons from Information Retrieval". Proceedings of the 19th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Zurich, Switzerland (ACM Press), pp. 166-172

Zobel, J. and Dart, P. (1995) "Finding Approximate Matches in Large Lexicons". Software Practice and Experience, Vol. 25, No. 3, March, 331-345 http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue3/spe948jz.pdf


The author's authoritative fulltext is available as PDF (9 pages, 210 kb). Download latest PDF viewer 

Top of pageNew papersPast issuesIndex of papers