Exploration of Document Relation Quality with Consideration of Term Representation Basis, Term Weighting and Association Measure

Nichnan Kittiphattanabawon²²,
Thanaruk Theeramunkong²² &
Ekawit Nantajeewarawat²²

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 6122))

Included in the following conference series:

Pacific-Asia Workshop on Intelligence and Security Informatics

827 Accesses
3 Citations

Abstract

Tracking and relating news articles from several sources can play against misinformation from deceptive news stories since single source can not judge whether the information is a truth or not. Preventing misinformation in a computer system is an interesting research in intelligence and security informatics. For this task, association rule mining has been recently applied due to its performance and scalability. This paper presents an exploration on how term representation basis, term weighting and association measure affect the quality of relations discovered among news articles from several sources. Twenty four combinations initiated by two term representation bases, four term weightings, and three association measures are explored with their results compared to human judgement. A number of evaluations are conducted to compare each combination’s performance to the others’ with regard to top-k ranks. The experimental results indicate that a combination of bigram (BG), term frequency with inverse document frequency (TFIDF) and confidence (CONF), as well as a combination of BG, TFIDF and conviction (CONV), achieves the best performance to find the related documents by placing them in upper ranks with 0.41% rank-order mismatch on top-50 mined relations. However, a combination of unigram (UG), TFIDF and lift (LIFT) performs the best by locating irrelevant relations in lower ranks (top-1100) with rank-order mismatch of 9.63 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improvement of TextRank Based on Co-occurrence Word Pairs and Context Information

Detecting and classifying online health misinformation with ‘Content Similarity Measure (CSM)’ algorithm: an automated fact-checking-based approach

Article 07 January 2023

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Thompson, P., Cybenko, G., Giani, A.: Cognitive Hacking, ch. 19. Book of Economics of Information Security, pp. 255–287. Springer, US (2004)
Google Scholar
Ferizis, G., Bailey, P.: Towards practical genre classification of web documents. In: Proc. 15th international conference on World Wide Web, pp. 1013–1014. ACM, New York (2006)
Chapter Google Scholar
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proc. Coling 2004, Geneva, Switzerland, COLING, August 23-27, pp. 611–617 (2004)
Google Scholar
Carreira, R., Crato, J.M., Gonçalves, D., Jorge, J.A.: Evaluating adaptive user profiles for news classification. In: Proc. 9th international conference on Intelligent user interfaces, pp. 206–212. ACM, New York (2004)
Google Scholar
Antonellis, I., Bouras, C., Poulopoulos, V.: Personalized news categorization through scalable text classification. In: Zhou, X., Li, J., Shen, H.T., Kitsuregawa, M., Zhang, Y. (eds.) APWeb 2006. LNCS, vol. 3841, pp. 391–401. Springer, Heidelberg (2006)
Chapter Google Scholar
Mengle, S., Goharian, N., Platt, A.: Discovering relationships among categories using misclassification information. In: Proc. 2008 ACM symposium on Applied computing, pp. 932–937. ACM, New York (2008)
Chapter Google Scholar
Zhang, N., Watanabe, T., Matsuzaki, D., Koga, H.: A novel document analysis method using compressibility vector. In: Proc. the First International Symposium on Data, Privacy, and E-Commerce, November 2007, pp. 38–40 (2007)
Google Scholar
Weixin, T., Fuxi, Z.: Text document clustering based on the modifying relations. In: Proc. 2008 International Conf. on Computer Science and Software Engineering, December 2008, vol. 1, pp. 256–259 (2008)
Google Scholar
Lin, F., Liang, C.: Storyline-based summarization for news topic retrospection. Decision Support Systems 45(3), 473–490 (2008)
Article Google Scholar
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study final report. In: Proc. the DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Google Scholar
Papka, R., Allan, J.: Topic Detection and Tracking: Event Clustering as a Basis for First Story Detection, ch. 4. Book of Advances Information Retrieval: Recent Research from the CIIR, pp. 96–126. Kluwer Academic Publishers, Dordrecht (2006)
Google Scholar
Kotsiantis, S., Kanellopoulos, D.: Association rules mining: A recent overview. International Transactions on Computer Science and Engineering 32(1), 71–82 (2006)
Google Scholar
Sriphaew, K., Theeramunkong, T.: Quality evaluation for document relation discovery using citation information. IEICE Trans. Inf. Syst. E90-D(8), 1225–1234 (2007)
Article Google Scholar
Kittiphattanabawon, N., Theeramunkong, T.: Relation discovery from thai news articles using association rule mining. In: Chen, H., Yang, C.C., Chau, M., Li, S.-H. (eds.) PAISI 2009. LNCS, vol. 5477, pp. 118–129. Springer, Heidelberg (2009)
Chapter Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proc. the 20th International Conf. on Very Large Data Bases, San Francisco, CA, USA, pp. 487–499. Morgan Kaufmann Publishers Inc., San Francisco (1994)
Google Scholar
Zaki, M.J., Hsiao, C.J.: Charm: An efficient algorithm for closed association rule mining. Technical report, Computer Science, Rensselaer Polytechnic Institute (1999)
Google Scholar
Zaki, M.J., Hsiao, C.J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. on Knowl. and Data Eng. 17(4), 462–478 (2005)
Article Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Min. Knowl. Discov. 8(1), 53–87 (2004)
Article MathSciNet Google Scholar
Lallich, S., Teytaud, O., Prudhomme, E.: Association rule interestingness: Measure and statistical validation. In: Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 251–275. Springer, Heidelberg (2007)
Chapter Google Scholar
Azevedo, P.J., Jorge, A.M.: Comparing rule measures for predictive association rules. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 510–517. Springer, Heidelberg (2007)
Chapter Google Scholar
David, H.: The Method of Paired Comparisons. Oxford University Press, Oxford (1988)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Sirindhorn International Institute of Technology, Thammasat University, Thailand
Nichnan Kittiphattanabawon, Thanaruk Theeramunkong & Ekawit Nantajeewarawat

Authors

Nichnan Kittiphattanabawon
View author publications
You can also search for this author in PubMed Google Scholar
Thanaruk Theeramunkong
View author publications
You can also search for this author in PubMed Google Scholar
Ekawit Nantajeewarawat
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of Arizona, Tucson, AZ, USA
Hsinchun Chen
The University of Hong Kong, Hong Kong, China
Michael Chau
National Taiwan University, Taipei, Taiwan, R.O.C.
Shu-hsing Li
International School of Information Management, University of Mysore, Mysore, India
Shalini Urs
International Institute of Information Technology, Bangalore, India
Srinath Srinivasa
Virginia Tech, Blacksburg, VA, USA
G. Alan Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kittiphattanabawon, N., Theeramunkong, T., Nantajeewarawat, E. (2010). Exploration of Document Relation Quality with Consideration of Term Representation Basis, Term Weighting and Association Measure. In: Chen, H., Chau, M., Li, Sh., Urs, S., Srinivasa, S., Wang, G.A. (eds) Intelligence and Security Informatics. PAISI 2010. Lecture Notes in Computer Science, vol 6122. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13601-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-13601-6_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13600-9
Online ISBN: 978-3-642-13601-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploration of Document Relation Quality with Consideration of Term Representation Basis, Term Weighting and Association Measure

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Improvement of TextRank Based on Co-occurrence Word Pairs and Context Information

Detecting and classifying online health misinformation with ‘Content Similarity Measure (CSM)’ algorithm: an automated fact-checking-based approach

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Exploration of Document Relation Quality with Consideration of Term Representation Basis, Term Weighting and Association Measure

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Improvement of TextRank Based on Co-occurrence Word Pairs and Context Information

Detecting and classifying online health misinformation with ‘Content Similarity Measure (CSM)’ algorithm: an automated fact-checking-based approach

Discovering Frequent Itemsets on Uncertain Data: A Systematic Review

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation