



![]() |
Fault-tolerant Fulltext Search for Large Multilingual Scientific Text CorporaChair of Computer Science II, University of Würzburg, Am Hubland, 97074 Würzburg, Germany Email: esser@informatik.uni-wuerzburg.de Key features: References; Figures; Tables This is a summary version of the paper. The author's authoritative fulltext is available as PDF (9 pages, 210 kb). Download latest PDF viewer
AbstractIn the work reported here, we present a new way of performing fault-tolerant fulltext retrieval on large text corpora, such as scientific encyclopedias. The weighted pattern morphing (WPM) technique introduced in this paper overcomes disadvantages of both the popular edit distance measure and the Soundex code approaches, yet keeping their flexibility. This algorithm handles phonetic similarities; common typing errors such as omission or transposition of letters, and inconsistent usage of abbreviations and hyphenation. After showing how WPM can be implemented efficiently, we present a novel method of how the weights of the internal penalty matrix can be automatically adjusted for even better results. Though the described technique can be applied without prior knowledge of actual user patterns, re-examination with a large number of online-user's patterns proves the portability of this fine-tuning approach. We further show how shifting the penalty matrix from one language to another can be accomplished. The described WPM technique is integrated into a large commercial pharmaceutical encyclopedia CDROM, an online dermatological encyclopedia, and an online-reference encyclopedia of parasitology research, thus also proving its "road capability".Index of FiguresFigure 1. Workflow of weighted pattern morphingFigure 2. Example of WPM at work on Parasitology Figure 3. Altmeyer: Performance of different weight sets on 62,385 words drawn from the text corpus Figure 4. User queries: Performance of different weight sets on 5,424 different search terms collected from the online logs of Altmeyer Index of TablesTable 1. Two example penalty weight matrices: phoneme group "c/g/k..."(top) and numbers (bottom)Table 2. Adjusting submorph weights by morph feedback Table 3. Examples of WPM results on the English text corpus of "Parasitology Research" ReferencesAltmeyer, P. and Bacharach-Buhles, M. (2002) Springer Enzyklopädie Dermatologie, Allergologie, Umweltmedizin (Springer-Verlag: Berlin, Heidelberg)Blaschek, W., Ebel, S., Hackenthal, E., Holzgrabe, U., Keller, K. and Reichling J. (Hrg.) (2003) HagerROM 2003 - Hagers Handbuch der Drogen und Arzneistoffe, CD-ROM (Springer-Verlag: Heidelberg) http://www.hagerrom.de Esser, W. (2004) "Fault-tolerant Fulltext Information Retrieval in Digital Multilingual Encyclopedias with Weighted Pattern Morphing". Advances in Information Retrieval, Procedings of the 26th ECIR, LNCS 2997, edited by S. McDonald and J. Tait (Springer), pp. 338-351 http://www.derwok.de/papers/esser_wpm_ecir04.pdf Fredkin, E. (1960) "Trie Memory". Communications of the ACM, Vol. 3 (9), September, 490-499 Gadd, T. (1990) "PHONIX: The algorithm". Program, 24(3), 363-366 Grimm, M. (2001) "Random Access und Caching für q-Gramm-Suchverfahren". Diplomarbeit Lehrstuhl für Informatik II, Universität Würzburg Hodge, V. and Austin, J. (2001) "An Evaluation of Phonetic Spell Checkers". Tech. Report YCS338, Dept. of CS, University of York, UK http://citeseer.ist.psu.edu/463597.html Jokinen, P. and Ukkonen, E. (1991) "Two algorithms for approximate string matching in static texts". In Proceedings of the 2nd Mathematical Foundations of Computer Science (MFCS '91), LNCS 520 (Springer-Verlag: Berlin), pp. 240-248 Knuth, D. (1998) The Art of Computer Programming, Volume 3: Sorting and Searching, second edition (Addison Wesley) Levenshtein, V. (1965) "Binary codes capable of correcting deletions, insertions, and reversals". Problems in Information Transmission, 1, 8-17 Myers, E. (1994) "A sublinear algorithm for approximate keyword searching". Algorithmica, 12(4/5), 345-374 Navarro, G. and Baeza-Yates, R. (1998) "A Practical q-Gram Index for Text Retrieval Allowing Errors". CLEI Electronic Journal, Vol. 1, No. 2, 1 http://www.clei.cl/cleiej/paper.php?id=32 Navarro, G., Baeza-Yates, R., Sutinen, E. and Tarhio, J. (2001) "Indexing Methods for Approximate String Matching". IEEE Data Engineering Bulletin, Vol. 24, No. 4, 19-27 http://citeseer.ist.psu.edu/navarro00indexing.html Philips, L. (2000) "The Double Metaphone Search Algorithm". C/C++ Users Journal, Vol. 18, No. 6, June, 1 http://www.cuj.com/documents/s=8038/cuj0006philips/ Rosenfelder, M. (2000) "Hou tu pranownse Inglish". Zompist.com http://www.zompist.com/spell.html Russell, R. (1918) INDEX (Soundex Patent), US Patent No. 1,261,167, pp.1-4 http://patimg2.uspto.gov/.piw?Docid=01261167&idkey=E7455D7DADEF Stephen, G. (1994) String Searching Algorithms, Lecture Notes Series on Computing, Vol. 3 (World Scientific Publishing) Swanson, D. (1988) "Historical note: Information retrieval and the future of an illusion". Journal of the American Society for Information Science, 39 (2), 92-98 Zobel, J. and Dart, P. (1996) "Phonetic String Matching: Lessons from Information Retrieval". Proceedings of the 19th annual international ACM SIGIR conference on Research and Development in Information Retrieval, Zurich, Switzerland (ACM Press), pp. 166-172 Zobel, J. and Dart, P. (1995) "Finding Approximate Matches in Large Lexicons". Software Practice and Experience, Vol. 25, No. 3, March, 331-345 http://www.cs.ubc.ca/local/reading/proceedings/spe91-95/spe/vol25/issue3/spe948jz.pdf
|