[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2499178.2499196acmotherconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article

Exploiting Forum Thread Structures to Improve Thread Clustering

Published: 29 September 2013 Publication History

Abstract

Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.

References

[1]
T. Baldwin, D. Martinez, and R. B. Penman. Automatic thread classification for linux user forum information access. 2008.
[2]
K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR '06, pages 43--50, New York, NY, USA, 2006. ACM.
[3]
G. Cong, L. Wang, C.-Y. Lin, Y.-I. Song, and Y. Sun. Finding question-answer pairs from online forums. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 467--474, New York, NY, USA, 2008. ACM.
[4]
A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Royal statistical Society B, 39:1--38, 1977.
[5]
H. Duan and C. Zhai. Exploiting thread structures to improve smoothing of language models for forum post retrieval. In Proceedings of the 33rd European conference on Advances in information retrieval, ECIR'11, pages 350--361, Berlin, Heidelberg, 2011. Springer-Verlag.
[6]
J. L. Elsas, J. Arguello, J. Callan, and J. G. Carbonell. Retrieval and feedback models for blog feed search. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 347--354, New York, NY, USA, 2008. ACM.
[7]
J. L. Elsas and J. G. Carbonell. It pays to be picky: an evaluation of thread retrieval in online forums. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, SIGIR '09, pages 714--715, New York, NY, USA, 2009. ACM.
[8]
H. Fang, T. Tao, and C. Zhai. A formal study of information retrieval heuristics. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 49--56, New York, NY, USA, 2004. ACM.
[9]
N. Fuhr, M. Lechtenfeld, B. Stein, and T. Gollub. The optimum clustering framework: implementing the cluster hypothesis. Inf. Retr., 15(2):93--115, Apr. 2012.
[10]
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR '99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, New York, NY, USA, 1999. ACM.
[11]
D. Koller and M. Sahami. Hierarchically classifying documents using very few words, 1997.
[12]
J. B. Macqueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[13]
S. Robertson, H. Zaragoza, and M. Taylor. Simple bm25 extension to multiple weighted fields. In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 42--49, New York, NY, USA, 2004. ACM.
[14]
J. Seo, W. B. Croft, and D. A. Smith. Online community search using thread structure. In Proceedings of the 18th ACM conference on Information and knowledge management, CIKM '09, pages 1907--1910, New York, NY, USA, 2009. ACM.
[15]
M. Shokouhi and L. Si. Federated search. Found. Trends Inf. Retr., 5(1):1--102, Jan. 2011.
[16]
A. Singh, D. P, and D. Raghu. Retrieving similar discussion forum threads: a structure based approach. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, SIGIR '12, pages 135--144, New York, NY, USA, 2012. ACM.
[17]
A. Singhal. Modern information retrieval: A brief overview. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35--42, 2001.
[18]
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques, 2000.
[19]
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, May 2005.
[20]
X. Yin, J. Han, and P. S. Yu. Crossclus: user-guided multi-relational clustering. Data Min. Knowl. Discov., 15(3):321--348, 2007.
[21]
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 743--748, New York, NY, USA, 2004. ACM.

Cited By

View all
  • (2023)Automatic Recommendation of Forum Threads and Reinforcement Activities in a Data Structure and Programming CourseApplied System Innovation10.3390/asi60500836:5(83)Online publication date: 21-Sep-2023
  • (2019)Thread Structure Learning on Online Health Forums With Partially Labeled DataIEEE Transactions on Computational Social Systems10.1109/TCSS.2019.29464986:6(1273-1282)Online publication date: Dec-2019
  • (2014)Resolving healthcare forum posts via similar thread retrievalProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2649399(33-42)Online publication date: 20-Sep-2014

Index Terms

  1. Exploiting Forum Thread Structures to Improve Thread Clustering

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICTIR '13: Proceedings of the 2013 Conference on the Theory of Information Retrieval
    September 2013
    148 pages
    ISBN:9781450321075
    DOI:10.1145/2499178
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    • Findwise: Findwise AB
    • Google Inc.
    • Spinque: Spinque
    • Univ. of Copenhagen: University of Copenhagen
    • LARM: LARM Audio Research Archive
    • Royal School of Library and Information Science: Royal School of Library and Information Science
    • Yahoo! Labs

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 September 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. forums
    2. k-Means
    3. text mining
    4. thread clustering
    5. web 2.0

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICTIR '13
    Sponsor:
    • Findwise
    • Spinque
    • Univ. of Copenhagen
    • LARM
    • Royal School of Library and Information Science

    Acceptance Rates

    ICTIR '13 Paper Acceptance Rate 11 of 51 submissions, 22%;
    Overall Acceptance Rate 235 of 527 submissions, 45%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Automatic Recommendation of Forum Threads and Reinforcement Activities in a Data Structure and Programming CourseApplied System Innovation10.3390/asi60500836:5(83)Online publication date: 21-Sep-2023
    • (2019)Thread Structure Learning on Online Health Forums With Partially Labeled DataIEEE Transactions on Computational Social Systems10.1109/TCSS.2019.29464986:6(1273-1282)Online publication date: Dec-2019
    • (2014)Resolving healthcare forum posts via similar thread retrievalProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics10.1145/2649387.2649399(33-42)Online publication date: 20-Sep-2014

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media