[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/345508.345550acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article
Free access

Improving text categorization methods for event tracking

Published: 01 July 2000 Publication History

Abstract

Automated tracking of events from chronologically ordered document streams is a new challenge for statistical text classification. Existing learning techniques must be adapted or improved in order to effectively handle difficult situations where the number of positive training instances per event is extremely small, the majority of training documents are unlabelled, and most of the events have a short duration in time. We adapted several supervised text categorization methods, specifically several new variants of the k-Nearest Neighbor (kNN) algorithm and a Rocchio approach, to track events. All of these methods showed significant improvement (up to 71% reduction in weighted error rates) over the performance of the original kNN algorithm on TDT benchmark collections, making kNN among the top-performing systems in the recent TDT3 official evaluation. Furthermore, by combining these methods, we significantly reduced the variance in performance of our event tracking system over different data collections, suggesting a robust solution for parameter optimization.

References

[1]
James Allan, Jaime Carbonell, George Doddington, Jonathan Yamron, and Yiming Yang. Topic detection and tracking pilot study: Final report. In ProceedIngs of the DARPA Broadcast News Transcription and Understanding Workshop, pages 194- 218, San Francisco, CA, 1998. Morgan Kaufmann Publishers, Inc.
[2]
James Allan, Ron Papka, and Victor Lavrenko. On-line new event detection and tracking. In Proceedzngs of the Twenty-Fzrst Annual International A CM SIGIR Conference on Research and Development n Informatwn Retmeval, pages 37-45, New York, 1998. The Association for Computing Machinery.
[3]
Jaime Carbonell, Yiming Yang, John Lafferty, Ralf D.Brown, Tom Pierce, and Xin Liu. Cmu report on tdt-2: Segmentation, detection and tracking. In Proceedzngs of the DARPA Broadcast News Workshop, pages 117-120, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
[4]
William W. Cohen and Yoram Singer. Contextsensitive learning methods for text categorization. In Proceedzngs of the Nzneteenth Annual Internatzonal ACM SIGIR Conference on Research and Development zn Informatzon Retrzvval, New York, 1996. The Association for Computing Machinery. 307-315.
[5]
Jon Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover). In IEEE Workshop on Automatzc Speech Recognitzon and Understandzng, Piscataway, N J, 1997. IEEE Signal Processing Society.
[6]
Jon Fiscus, George Doddington, John Garofolo, and Alvin Martin. Nist's 1998 topic detection and tracking evaluation (tdt2). In Proceedings of the DARPA Broadcast News Transcrzption and Understanding Workshop, pages 19-26, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
[7]
W. Lain and C.Y. Ho. Using a generalized instance set for automatic text categorization. In Proceedzngs of the 21th Ann Int ACM SIGIR Conference on Research and Development zn Informatzon Retrzeval (SIGIR'98), pages 81-89, 1998.
[8]
Leah S. Larkey and W. Bruce Croft. Combining classifiers in text categorization. In ProceedIgs of the Nzneteenth Annual Internatzonal A CM SIGIR Conference on Research and Development zn Information Retrieval, pages 289-297, New York, 1998. The Association for Computing Machinery.
[9]
Joon Ho Lee. Combining multiple evidence from different properties of weighting schemes. In Proceedzngs of the Ezghteenth Annual Internatzonal ACM SIGIR Conference on Research and Development zn Informatzon Retrzeval, pages 180-188, New York, 1995. The Association for Computing Machinery.
[10]
David D. Lewis, Robert E. Schapire, James P. Callan, and Ron Papka. Training algorithms for linear text classifiers. In Proceedings of the Nineteenth Annual Internatzonal ACM SIGIR Conference on Research and Development *n Information Retrzeval, New York, 1996. The Association for Computing Machinery. 298-306.
[11]
A. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki. The det curve in assessment of detection task performance. In EuroSpeech 1997 Proceedings, volume 4, 1997.
[12]
J. J. Rocchio-Jr. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrzeval System: Experzments in Automatzc Document Processzng, pages 313-323. Prentice-Hall, Inc., Englewood Cliffs, New Jersay, 1971.
[13]
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of American Soczety for Information Sczences, 41:288-297, 1990.
[14]
Robert .E. Schapire, Yoram Singer, and Amit Singhal. Boosting and rocchio applied to text filtering. In Proceedings of the Twenty-first Annual Internatzonal A CM SIGIR Conference on Research and Development *n Information Retrieval, pages 215- 223, New York, 1998. The Association for Computing Machinery.
[15]
J. Michael Schultz and Mark Liberman. Topic detection and tracking using idf-weighted cosine coefficient. In Proceedings of the DARPA Broadcast News Workshop, pages 189-192, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
[16]
R. Schwartz, T. Imai, L. Nguyen, and J. Makhoul. A maximum likelihood model for topic classification of broadcast news. In Proceedzngs of Eurospeech, Rhodes, Greece, 1997.
[17]
Joeseph A. Shaw and Edward A. Fox. Combination of multiple searches. In The Second Text REtrzeval Conference, pages 243-252, 1994.
[18]
F. Walls, H. Jin, S. Sista, and R. Schwartz. Topic detection in broadcast news. In Proceedzngs of the DARPA Broadcast News Workshop, pages 193-198, San Francisco, CA, 1999. Morgan Kaufmann Publishers, Inc.
[19]
S.M. Weiss, C. Apte, F. Damerau, D.E. Johnson, F.J. Oles, T. Goets, and T. Hampp. Maximizing text-mining performance. IEEE Intellzgent Systems, 14(4):63-69, 1999.
[20]
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Ann Int ACM SIGIR Conference on Research and Development in Informat,on Retrzeval (SIGIR '94), pages 13-22, 1994.
[21]
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrzeval, 1(1/2):67-88, 1999.
[22]
Y. Yang, J.G. Carbonell, R. Brown, Thomas Pierce, Brian T. Archibald, and Xin Liu. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, 14(4):32-43, 1999.
[23]
Y. Yang and X. Liu. A re-examination of text categorization methods. In The Twenty-Second Annual International A CM SIGIR Conference on Research and Development zn Informatzon Retrieval, pages 42-49, New York, 1999. Association for Computing Machinery.
[24]
Y. Yang, T. Pierce, and J. Carbonell. A study on retrospective and on-line event detection. In Proceedzngs of the Ith Ann Int ACM SIGIR Conference on Research and Development in Information Betrzeval (SIGIR'98), pages 28-36, 1998.

Cited By

View all
  • (2024)Chronic Diseases Prediction Using Machine Learning With Data Preprocessing Handling: A Critical ReviewIEEE Access10.1109/ACCESS.2024.340674812(80698-80730)Online publication date: 2024
  • (2023)Understanding the Trends in Blockchain Domain Through an Unsupervised Systematic Patent AnalysisIEEE Transactions on Engineering Management10.1109/TEM.2021.307431070:6(1991-2005)Online publication date: Jun-2023
  • (2022)Improved Prediction Analysis with Hybrid Models for Thunderstorm Classification over the Ranchi RegionNew Generation Computing10.1007/s00354-022-00174-242:1(7-31)Online publication date: 6-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
July 2000
396 pages
ISBN:1581132263
DOI:10.1145/345508
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2000

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation
  2. event detection and tracking
  3. machine learning and IR
  4. statistical/probabilistic models
  5. text categorization

Qualifiers

  • Article

Conference

SIGIR00
Sponsor:
  • Greek Com Soc
  • SIGIR
  • Athens U of Econ & Business

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)157
  • Downloads (Last 6 weeks)20
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Chronic Diseases Prediction Using Machine Learning With Data Preprocessing Handling: A Critical ReviewIEEE Access10.1109/ACCESS.2024.340674812(80698-80730)Online publication date: 2024
  • (2023)Understanding the Trends in Blockchain Domain Through an Unsupervised Systematic Patent AnalysisIEEE Transactions on Engineering Management10.1109/TEM.2021.307431070:6(1991-2005)Online publication date: Jun-2023
  • (2022)Improved Prediction Analysis with Hybrid Models for Thunderstorm Classification over the Ranchi RegionNew Generation Computing10.1007/s00354-022-00174-242:1(7-31)Online publication date: 6-Jun-2022
  • (2021)Predicting net radiation in naturally ventilated greenhouses based on outside global solar radiation for reference evapotranspiration estimationAgricultural Water Management10.1016/j.agwat.2021.107102257(107102)Online publication date: Nov-2021
  • (2021)Fast kNN query processing over a multi-node GPU environmentThe Journal of Supercomputing10.1007/s11227-021-03975-2Online publication date: 15-Jul-2021
  • (2020)Authority-Based Conversation Tracking in Twitter: An Unattended Methodological ApproachApplied Sciences10.3390/app1009327310:9(3273)Online publication date: 8-May-2020
  • (2020)A review on classification of imbalanced data for wireless sensor networksInternational Journal of Distributed Sensor Networks10.1177/155014772091640416:4(155014772091640)Online publication date: 14-Apr-2020
  • (2020)Online news media website ranking using user-generated contentJournal of Information Science10.1177/0165551519894928(016555151989492)Online publication date: 3-Feb-2020
  • (2020)Chord-Length Shape Features for License Plate Character RecognitionJournal of Russian Laser Research10.1007/s10946-020-09861-1Online publication date: 26-Mar-2020
  • (2019)Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A ReviewBig Data10.1089/big.2018.01757:4(221-248)Online publication date: 1-Dec-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media