



![]() |
Managing Content with Automatic Document ClassificationSchool of Electrical and Information Engineering, University of Sydney, Australia *School of Computer Engineering, Hansung University, Korea Web: http://www.weg.ee.usyd.edu.au/people/rafa Key features: References, Figures 1-3, Tables 1-5 This is a summary version of the paper. The author's authoritative full-text is available as PDF (15 pages, 232kb). Download latest PDF viewer
AbstractNews articles and Web directories represent some of the most popular and commonly accessed content on the Web. Information designers normally define categories that model these knowledge domains (i.e. news topics or Web categories) and domain experts assign documents to these categories. The paper describes how machine learning and automatic document classification techniques can be used for managing large numbers of news articles, or Web page descriptions, lightening the load on domain experts. The paper uses two datasets, one with with more than 800,000 Reuters news stories and another with over 41,000 Web sites, and classifies them using a Naïve Bayes algorithm, into predefined categories. We discuss the different parameters and design decisions that normally appear when building automatic classifiers, including, stemming, stop-words, thresholding, amount of data and approaches for improving performance using the structure in XML documents. The methodology developed would enable Web based applications or workflow systems to manage information more efficiently, i.e. by assigning documents to topics automatically or assisting humans in the process of doing so.FiguresFigure 1. Packages in AI::CategorizerFigure 2. UML diagram for the AI::Categorizer framework Figure 3. Naïve Bayes performance on the Reuters Corpus Volume 1 dataset TablesTable 1: Contingency table for class jTable 2: Accuracy measures for Naïve Bayes and kNN on Reuters Corpus Volume 1 Table 3: Performance results averaged for six different partitioning of the dataset using at and hierarchical Naïve Bayes and Scut thresholding strategy Table 4: Accuracy results for Rcut strategy with different values of k on the hierarchical classifier Table 5: The time efficiency between at Naïve Bayes and hierarchical Naïve Bayes AcknowledgementsJae-Moon Lee and Rafael A. Calvo acknowledge the Australian Research Council and Hansung University for their financial support. Xiaobo Li and Rafael A. Calvo acknowledge the support of the Capital Markets Collaborative Research Centre.References[1] Calvo, Rafael A. and H. A. Ceccatto (2000) "Intelligent document classification". Intelligent Data Analysis, 4(5) http://www.weg.ee.usyd.edu.au/people/rafa/papers/ida2k/ida2k.pdf[2] Calvo, Rafael A. and Jae-Moon Lee (2003) "Coping with the news: the machine learning way". In Proceedings of Ausweb 2003 Conference, The Ninth Australian World Wide Web Conference, Gold Coast, July, edited by A. Treloar and A. Ellis http://ausweb.scu.edu.au/aw03/papers/calvo/paper.htm [3] Fayad, Mohamed and Douglas C. Schmidt (eds) (1999) Building Application Frameworks (John Wiley & Sons) [4] Lewis, David D. (1998) "Naive (Bayes) at forty: The independence assumption in information retrieval". In Proceedings of ECML-98, 10th European Conference on Machine Learning, Chemnitz, edited by Claire Nédellec and Céline Rouveirol, LNCS No. 1398 (Springer-Verlag: Heidelberg), pp. 4-15 [5] Lewis, David D. and Mark Ringuette (1994) "A comparison of two learning algorithms for text categorization". In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV [6] Li, Xiaobo and Rafael A. Calvo (2003) "Hierarchical document classification using naive bayes". In 8th Australasian Document Computing Symposium, CSIRO, Canberra, December [7] Rose, Tony G., Mark Stevenson and Miles Whitehead (2002) "The Reuters Corpus Volume 1 - from Yesterday's News to Tomorrow's Language Resources". In 3rd International Conference on Language Resources and Evaluation, May, p. 7 http://about.reuters.com/researchandstandards/corpus/LREC_camera_ready.pdf [8] Sebastiani, Fabrizio (2002) "Machine learning in automated text categorization". ACM Computing Surveys (CSUR), 34(1):1-47 [9] Williams, Ken and Rafael A. Calvo (2002) "A framework for text categorization". In 7th Australasian Document Computing Symposium, Syndey http://www.weg.ee.usyd.edu.au/people/rafa/papers/adcs2002/ADCS-framework.pdf [10] Yang, Yiming (2001) "A study on thresholding strategies for text categorization". In Proceedings of SIGIR-01, 24th ACM International Conference on Research and Development in Information Retrieval, New Orleans, LA, September (ACM) http://citeseer.ist.psu.edu/449456.html [11] Yang, Yiming and Liu, X. (1999) "A re-examination of text categorization methods". In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, August (ACM), pp. 42-49 [12] Yang, Yiming and Jan O. Pedersen (1997) "A comparative study on feature selection in text categorization". In Proceedings of ICML-97, 14th International Conference on Machine Learning, edited by Douglas H. Fisher (Morgan Kaufmann Publishers: San Francisco), pp. 412-420 http://citeseer.ist.psu.edu/yang97comparative.html |