Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims
<p>Hierarchy of classes.</p> "> Figure 2
<p>Workflow of the system.</p> "> Figure 3
<p>Distribution of words among topics for PolitiFact subset modeled with LDA with 20 topics and DMM with 80 topics.</p> "> Figure 4
<p>Distribution of terms among topics for custom methods.</p> "> Figure 5
<p>Distribution of ontology terms among topics for custom methods.</p> "> Figure 6
<p>Accuracy of various topic modeling methods, for PolitiFact with 80 topics.</p> "> Figure 7
<p>Accuracy of custom methods, for various tag assignment approaches.</p> "> Figure 8
<p>Coherence of 20 topics for various topic modeling methods.</p> "> Figure 9
<p>Coherence of topics produced by our methods.</p> "> Figure A1
<p>Full set of term frequencies for two datasets, with two clustering methods and various generalization schemes, part 1. The same terms are encoded with the same color.</p> "> Figure A1 Cont.
<p>Full set of term frequencies for two datasets, with two clustering methods and various generalization schemes, part 1. The same terms are encoded with the same color.</p> "> Figure A2
<p>Full set of term frequencies for two datasets, with two clustering methods and various generalization schemes, part 2. The same terms are encoded with the same color.</p> "> Figure A2 Cont.
<p>Full set of term frequencies for two datasets, with two clustering methods and various generalization schemes, part 2. The same terms are encoded with the same color.</p> ">
Abstract
:1. Introduction
2. Background
3. Method
3.1. Collection of Claims
3.2. Semantic Embedding
3.3. Clustering of Claims
3.4. Semantic Entity Extraction
{ "id": 4274,"entity": ["http://dbpedia.org/resource/NATO","http://dbpedia.org/resource/Madrid","http://dbpedia.org/resource/Spain" ] }
- dbpedia(entities_only)–only DBpedia entities as extracted from claims are used
- dbpedia(rdf_type_all)–DBpedia instances are mapped to classes by following rdf:type
- dbpedia(rdf_type_no_yago)–same as above but all classes from YAGO ontology are excluded
- dbpedia(rdf_type_yago)–only superclasses from YAGO are used
- dbpedia(rdf_type_ontology)–only classes from DBpedia ontology are used
- wikidata(all)–similarly to rdf:type, classes from Wikidata are used along with their extensions by following Wikidata’s P279 property for all superclasses
- wikidata(no_equivalents)–same as above but in the case of equivalent classes, only one class is considered, e.g., either http://dbpedia.org/ontology/Person or http://www.wikidata.org/entity/Q5 (access date 31 July 2024).
3.5. Identification of Important Terms
4. Experiment and Results
4.1. Evaluation
4.2. Uniqueness of Words in Topics
4.3. Classification Evaluation
4.4. Coherence Evaluation
5. Discussion and Future Work
6. Limitations
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A. Additional Figures
Appendix B. Computing Times
Method | Head | Description | Ratio |
---|---|---|---|
LDA | 0.08 | 0.1 | 1.25 |
DMM | 0.10 | 0.13 | 1.33 |
BTM | 0.20 | 0.67 | 3.33 |
WNTM | 0.72 | 3.73 | 5.21 |
SATM | 1.87 | 3.80 | 2.04 |
PTM | 1.03 | 1.47 | 1.42 |
GPUDMM | 1.05 | 1.47 | 1.40 |
GPUPDMM | 39.73 | 55.95 | 1.41 |
LFLDA | 41.73 | 53.27 | 1.28 |
LFDMM | 26.52 | inf | n/a |
References
- Hirlekar, V.V.; Kumar, A. Natural Language Processing based Online Fake News Detection Challenges—A Detailed Review. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020; pp. 748–754. [Google Scholar] [CrossRef]
- Padalko, H.; Chomko, V.; Chumachenko, D. A novel approach to fake news classification using LSTM-based deep learning models. Front. Big Data 2024, 6, 1320800. [Google Scholar] [CrossRef] [PubMed]
- Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Jwa, H.; Oh, D.; Park, K.; Kang, J.M.; Lim, H. exbake: Automatic fake news detection model based on bidirectional encoder representations from transformers (bert). Appl. Sci. 2019, 9, 4062. [Google Scholar] [CrossRef]
- Al-Tarawneh, M.A.B.; Al-irr, O.; Al-Maaitah, K.S.; Kanj, H.; Aly, W.H.F. Enhancing Fake News Detection with Word Embedding: A Machine Learning and Deep Learning Approach. Computers 2024, 13, 239. [Google Scholar] [CrossRef]
- Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
- Miranda-Belmonte, H.U.; Muñiz-Sánchez, V.; Corona, F. Word embeddings for topic modeling: An application to the estimation of the economic policy uncertainty index. Expert Syst. Appl. 2023, 211, 118499. [Google Scholar] [CrossRef]
- Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Khan, K.; Rehman, S.U.; Aziz, K.; Fong, S.; Sarasvady, S. DBSCAN: Past, present and future. In Proceedings of the Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), Chennai, India, 17–19 February 2014; pp. 232–238. [Google Scholar] [CrossRef]
- McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
- Jégou, H.; Douze, M.; Johnson, J.; Hosseini, L.; Deng, C. Faiss: Similarity Search and Clustering of Dense Vectors Library. Astrophysics Source Code Library, Record ascl:2210.024. 2022. Available online: https://ui.adsabs.harvard.edu/abs/2022ascl.soft10024J/abstract (accessed on 31 July 2024).
- Drikvandi, R.; Lawal, O. Sparse principal component analysis for natural language processing. Ann. Data Sci. 2023, 10, 25–41. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Angelov, D. Top2Vec: Distributed Representations of Topics. arXiv 2020, arXiv:2008.09470. [Google Scholar]
- Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
- Schäfer, K.; Choi, J.E.; Vogel, I.; Steinebach, M. Unveiling the Potential of BERTopic for Multilingual Fake News Analysis—Use Case: COVID-19. arXiv 2024, arXiv:2407.08417v1. [Google Scholar]
- Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023, 12, 2605. [Google Scholar] [CrossRef]
- Egger, R.; Yu, J. A topic modeling comparison between LDA, NMF, top2vec, and BERTopic to demystify twitter posts. Front. Sociol. 2022, 7, 886498. [Google Scholar] [CrossRef]
- Jipeng, Q.; Zhenyu, Q.; Yun, L.; Yunhao, Y.; Xindong, W. Short Text Topic Modeling Techniques, Applications, and Performance: A Survey. arXiv 2019, arXiv:1904.07695. [Google Scholar]
- Quan, X.; Kit, C.; Ge, Y.; Pan, S. Short and sparse text topic modeling via self-aggregation. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Buenos Aires, Argentina, 25–31 July 2015; pp. 2270–2276. [Google Scholar]
- Zuo, Y.; Wu, J.; Zhang, H.; Lin, H.; Xu, K.; Xiong, H. Topic modeling of short texts: A pseudo-document view. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, CA, USA, 13–17 August 2016; pp. 2105–2114. [Google Scholar]
- Zuo, Y.; Zhao, J.; Xu, K. Word network topic model: A simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 2016, 48, 379–398. [Google Scholar] [CrossRef]
- Zhou, Z.; Qin, J.; Xiang, X.; Tan, Y.; Liu, Q.; Xiong, N.N. News Text Topic Clustering Optimized Method Based on TF-IDF Algorithm on Spark. Comput. Mater. Contin. 2020, 62, 217–231. [Google Scholar] [CrossRef]
- Zhang, W.; Yoshida, T.; Tang, X. A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst. Appl. 2011, 38, 2758–2765. [Google Scholar] [CrossRef]
- Lim, K.H.; Karunasekera, S.; Harwood, A.; Falzon, L. Spatial-based topic modelling using wikidata knowledge base. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 4786–4788. [Google Scholar]
- Zarrinkalam, F.; Fani, H.; Bagheri, E.; Kahani, M.; Du, W. Semantics-enabled user interest detection from twitter. In Proceedings of the 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Singapore, 6–9 December 2015; Volume 1, pp. 469–476. [Google Scholar]
- Wang, X.; Gao, T.; Zhu, Z.; Zhang, Z.; Liu, Z.; Li, J.; Tang, J. KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation. arXiv 2019, arXiv:1911.06136. [Google Scholar] [CrossRef]
- Hosseini, M.; Javadian Sabet, A.; He, S.; Aguiar, D. Interpretable fake news detection with topic and deep variational models. Online Soc. Networks Media 2023, 36, 100249. [Google Scholar] [CrossRef]
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Daiber, J.; Jakob, M.; Hokamp, C.; Mendes, P.N. Improving Efficiency and Accuracy in Multilingual Entity Extraction. In Proceedings of the 9th International Conference on Semantic Systems (I-Semantics), Graz, Austria, 4–6 September 2013. [Google Scholar]
Fact Checker | Number of Claims |
---|---|
Snopes.com | 7000 |
PolitiFact | 6068 |
AFP Fact Check | 4308 |
FactCheck.org | 2245 |
USA Today | 1964 |
The Washington Post | 1216 |
POLYGRAPH.info | 963 |
Newsweek | 504 |
others | 47 |
Total | 24,315 |
Measure | UMAP + HDBSCAN | UMAP + HDBSCAN corr | PCA + K-Means |
---|---|---|---|
Silhouette coefficient | −0.037 | 0.040 | 0.023 |
Calinski–Harabasz index | 46.0 | 54.0 | 92.3 |
Davies–Bouldin Index | 3.38 | 3.04 | 4.19 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Węcel, K.; Sawiński, M.; Lewoniewski, W.; Stróżyna, M.; Księżniak, E.; Abramowicz, W. Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims. Information 2024, 15, 659. https://doi.org/10.3390/info15100659
Węcel K, Sawiński M, Lewoniewski W, Stróżyna M, Księżniak E, Abramowicz W. Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims. Information. 2024; 15(10):659. https://doi.org/10.3390/info15100659
Chicago/Turabian StyleWęcel, Krzysztof, Marcin Sawiński, Włodzimierz Lewoniewski, Milena Stróżyna, Ewelina Księżniak, and Witold Abramowicz. 2024. "Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims" Information 15, no. 10: 659. https://doi.org/10.3390/info15100659
APA StyleWęcel, K., Sawiński, M., Lewoniewski, W., Stróżyna, M., Księżniak, E., & Abramowicz, W. (2024). Sentence Embeddings and Semantic Entity Extraction for Identification of Topics of Short Fact-Checked Claims. Information, 15(10), 659. https://doi.org/10.3390/info15100659