[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Visual Approach for Interactive Keyterm-Based Clustering

Published: 20 February 2018 Publication History

Abstract

The keyterm-based approach is arguably intuitive for users to direct text-clustering processes and adapt results to various applications in text analysis. Its way of markedly influencing the results, for instance, by expressing important terms in relevance order, requires little knowledge of the algorithm and has predictable effect, speeding up the task. This article first presents a text-clustering algorithm that can easily be extended into an interactive algorithm. We evaluate its performance against state-of-the-art clustering algorithms in unsupervised mode. Next, we propose three interactive versions of the algorithm based on keyterm labeling, document labeling, and hybrid labeling. We then demonstrate that keyterm labeling is more effective than document labeling in text clustering. Finally, we propose a visual approach to support the keyterm-based version of the algorithm. Visualizations are provided for the whole collection as well as for detailed views of document and cluster relationships. We show the effectiveness and flexibility of our framework, Vis-Kt, by presenting typical clustering cases on real text document collections. A user study is also reported that reveals overwhelmingly positive acceptance toward keyterm-based clustering.

Supplementary Material

MP4 File (a6-nourashrafeddin.mp4)

References

[1]
C. C. Aggarwal and Ch. Zhai. 2012. A survey of text-clustering algorithms. In Mining Text Data. Springer, 77--128.
[2]
D. Andrzejewski, X. Zhu, and M. Craven. 2009. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 25--32.
[3]
P. Awasthi, M. F. Balcan, and K. Voevodski. 2017. Local algorithms for interactive clustering. J. Mach. Learn. Res. 18, 3 (2017), 1--35.
[4]
S. Basu, A. Banerjee, and R. J. Mooney. 2002. Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning (ICML’02). Morgan Kaufmann, San Francisco, CA, 27--34.
[5]
S. Basu, A. Banerjee, and R. J. Mooney. 2004. Active semi-supervision for pairwise constrained clustering. In Proceedings of the SIAM International Conference on Data Mining. 333--344.
[6]
M. W. Berry, Mu. Browne, A. N. Langville, V. P. Pauca, and R. J. Plemmons. 2007. Algorithms and applications for approximate nonnegative matrix factorization. Comput. Stat. Data Anal. 52, 1 (2007), 155--173.
[7]
J. Bezdek. 1981. Pattern Recognition with Fuzzy Objective Functions. Kluwer Academic Publishers, Norwell, MA.
[8]
D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022.
[9]
M. Bostock, V. Ogievetsky, and J. Heer. 2011. D3 data-driven documents. IEEE Trans. Visual. Comput. Graph. 17, 12 (2011), 2301--2309.
[10]
L. Boudjeloud-Assala, Ph. Pinheiro, A. Blansché, Th. Tamisier, and B. Otjacques. 2015. Interactive and iterative visual clustering. Information Visualization 15, 3 (2015), 181--197.
[11]
E. T. Brown, J. Liu, C. E. Brodley, and R. Chang. 2012. Dis-function: Learning distance functions interactively. In Proceedings of the IEEE Conference on Visual Analytics Science and Technology (VAST’12). 83--92.
[12]
J. Choo, C. Lee, C. K. Reddy, and H. Park. 2013. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Trans. Visual. Comput. Graph. 19, 12 (Dec 2013), 1992--2001.
[13]
J. Chuang, C. D. Manning, and J. Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces. ACM, New York, 74--77.
[14]
G. V. Cormack, J. M. G. Hidalgo, and E. P. Sánz. 2007. Feature engineering for mobile SMS spam filtering. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 871--872.
[15]
D. Costa and G. Venturini. 2007. A visual and interactive data exploration method for large data sets and clustering. In Proceedings of the 3rd International Conference on Advanced Data Mining and Applications. Springer-Verlag, 553--561.
[16]
W. Cui, Sh. Liu, L. Tan, C. Shi, Y. Song, Z. J. Gao, H. Qu, and X. Tong. 2011. Textflow: Towards better understanding of evolving topics in text. IEEE Trans. Visual. Comput. Graph. 17, 12 (2011), 2412--2421.
[17]
M. des Jardins, J. MacGlashan, and J. Ferraioli. 2007. Interactive visual clustering. In Proceedings of the 12th International Conference on Intelligent User Interfaces. ACM, New York, 361--364.
[18]
A. Endert, P. Fiaux, and C. North. 2012. Semantic interaction for visual text analytics. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’12). ACM, New York, 473--482.
[19]
A. Endert, S. Fox, D. Maiti, S. Leman, and C. North. 2012. The semantics of clustering: Analysis of user-generated spatializations of text documents. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI’12). ACM, New York, 555--562.
[20]
P. Fiaux, M. Sun, L. Bradel, C. North, N. Ramakrishnan, and A. Endert. 2013. Bixplorer: Visual analytics with biclusters. Computer 46, 8 (2013), 90--94.
[21]
Th. M. J. Fruchterman and E. M. Reingold. 1991. Graph drawing by force-directed placement. Softw. Pract. Exper. 21, 11 (Nov. 1991), 1129--1164.
[22]
L. Galavotti, F. Sebastiani, and M. Simi. 2000. Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries. Springer-Verlag, London, UK, 59--68.
[23]
C. Gorg, L. Zhicheng, K. Jaeyeon, Ch. Jaegul, P. Haesun, and J. Stasko. 2013. Combining computational analyses and interactive visualization for document exploration and sensemaking in jigsaw. IEEE Trans. Visual. Comput. Graph. 19, 10 (Oct 2013), 1646--1663.
[24]
D. Greene and P. Cunningham. 2006. Practical solutions to the problem of diagonal dominance in kernel document clustering. In Proceedings of the 23rd International Conference on Machine Learning. ACM Press, 377--384.
[25]
F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. 2014. Word cloud explorer: Text analytics based on word clouds. In Proceedings of the 47th Hawaii International Conference on System Sciences (HICSS’14). 1833--1842.
[26]
E. Hoque and G. Carenini. 2016. Interactive topic modeling for exploring asynchronous online conversations: Design and evaluation of ConVisIT. ACM Trans. Interact. Intell. Syst. 6, 1, Article 7 (Feb. 2016), 24 pages.
[27]
X. Hu, L. Bradel, D. Maiti, L. House, C. North, and S. Leman. 2013. Semantics of directly manipulating spatializations. IEEE Trans. Visual. Comput. Graph. 19, 12 (Dec 2013), 2052--2059.
[28]
Y. Hu, B. Boyd-Graber, and J. Satinoff, and A. Smith. 2014. Interactive topic modeling. Mach. Learn. 95, 3 (2014), 423--469.
[29]
Y. Hu, E. Milios, and J. Blustein. 2011. Interactive feature selection for document clustering. In Proceedings of the 2011 ACM Symposium on Applied Computing (SAC’11). ACM, New York, 1143--1150.
[30]
Y. Hu, E. Milios, and J. Blustein. 2012. Enhancing semi-supervised document clustering with feature supervision. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC’12). ACM, New York, 929--936.
[31]
Y. Hu, E. Milios, and J. Blustein. 2012. Semi-supervised document clustering with dual supervision through seeding. In Proceedings of the 27th Annual ACM Symposium on Applied Computing (SAC’12). ACM, New York, NY, 144--151.
[32]
Y. Hu, E. Milios, J. Blustein, and S. Liu. 2012. Personalized document clustering with dual supervision. In Proceedings of the 2012 ACM Symposium on Document Engineering (DocEng’12). ACM, New York, 161--170.
[33]
J. Kogan, C. Nicholas, and V. Volkovich. 2003. Text mining with information-theoretic clustering. Comput. Sci. Eng. 5, 6 (Nov. 2003), 52--59.
[34]
B. Larsen and Ch. Aone. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99). ACM, New York, NY, 16--22.
[35]
H. Lee, J. Kihm, J. Choo, J. Stasko, and H. Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. Comput. Graph. Forum 31, 3pt3 (2012), 1155--1164.
[36]
L. V. D. Maaten and G. E. Hinton. 2008. Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9 (2008), 2579--2605.
[37]
C. D. Manning, P. Raghavan, and H. Schütze. 2008. Flat clustering. In Introduction to Information Retrieval. Cambridge University Press, New York, NY, 253--287.
[38]
Y. Matsuo and M. Ishizuka. 2004. Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artific. Intell. Tools 13, 01 (2004), 157--169.
[39]
J. L. Neto, A. D. Santos, C. A. A. Kaestner, and A. A. Freitas. 2000. Document clustering and text summarization. In Proceedings of the 4th International Conference Practical Applications of Knowledge Discovery and Data Mining (PADD’00).
[40]
S. Nourashrafeddin. 2014. Interactive Term Supervised Text Document Clustering. Ph.D. Dissertation. Faculty of Computer Science, Dalhousie University, Retrieved from https://dalspace.library.dal.ca/handle/10222/55965.
[41]
S. Nourashrafeddin, E. Milios, and D. V. Arnold. 2013. An evolutionary algorithm for feature selective double clustering of text documents. In Proceedings of the 2013 IEEE Congress on Evolutionary Computation (CEC’13). IEEE, 446--453.
[42]
S. Nourashrafeddin, E. Milios, and D. V. Arnold. 2014. An ensemble approach for text document clustering using wikipedia concepts. In Proceedings of the 2014 ACM Symposium on Document Engineering (DocEng’14). ACM, New York, 107--116.
[43]
S. N. Nourashrafeddin, E. Milios, and D. V. Arnold. 2013. Interactive text document clustering using feature labeling. In Proceedings of the 2013 ACM Symposium on Document Engineering (DocEng’13). ACM, New York, 61--70.
[44]
F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz. 2008. Least square projection: A fast high precision multidimensional projection technique and its application to document mapping. IEEE Trans. Visual. Comput. Graph. 14, 3 (2008), 564--575.
[45]
M. F. Porter. 1980. An algorithm for suffix stripping. Progr.: Electron. Libr. Info. Syst. 14, 3 (1980), 130--137.
[46]
A. A. Puretskiy, G. L. Shutt, and M. W. Berry. 2010. Survey of text visualization techniques. Text Mining: Applications and Theory (2010), 105--127.
[47]
D. Ramage, D. Hall, R. Nallapati, and Ch. D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 (EMNLP’09). Association for Computational Linguistics, Stroudsburg, PA, 248--256.
[48]
R. G. Rossi, R. M. Maracini, and S. O. Rezende. 2014. Analysis of domain independent statistical keyword extraction methods for incremental clustering. Learn. Nonlin. Models 12, 1 (2014), 17--37.
[49]
P. J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65.
[50]
W. Tang, H. Xiong, Sh. Zhong, and J. Wu. 2007. Enhancing semi-supervised clustering: A feature projection perspective. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’07). ACM, New York, 707--716.
[51]
O. Turetken and R. Sharda. 2005. Clustering-based visual interfaces for presentation of web search results: An empirical investigation. Info. Syst. Front. 7, 3 (2005), 273--297.
[52]
H. Xu, Zh. Li, Sh. Guo, and K. Chen. 2012. CloudVista: Interactive and economical visual cluster analysis for big data in the cloud. Proc. VLDB Endow. 5, 12 (2012), 1886--1889.
[53]
Q. You, Sh. Fang, and P. Ebright. 2010. Iterative visual clustering for unstructured text mining. In Proceedings of the International Symposium on Biocomputing. ACM, New York, 26.

Cited By

View all
  • (2023)Addressing the gap between current language models and key-term-based clusteringProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604900(1-10)Online publication date: 22-Aug-2023
  • (2023)GAM Coach: Towards Interactive and User-centered Algorithmic RecourseProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580816(1-20)Online publication date: 19-Apr-2023
  • (2023)Evaluating Visual Analytics for Relevant Information Retrieval in Document CollectionsInteracting with Computers10.1093/iwc/iwad01935:2(247-261)Online publication date: 3-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Interactive Intelligent Systems  Volume 8, Issue 1
Special Issue on Interactive Visual Analysis of Human Crowd Behaviors and Regular Paper
March 2018
132 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/3185338
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2018
Accepted: 01 December 2017
Revised: 01 June 2017
Received: 01 October 2016
Published in TIIS Volume 8, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document clustering
  2. interactive
  3. keyterm-based clustering
  4. visualization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Boeing Company, CNPq and FAPESP (Brazil)
  • Natural Sciences and Engineering Research Council of Canada
  • International Development Research Centre, Ottawa, Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Addressing the gap between current language models and key-term-based clusteringProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604900(1-10)Online publication date: 22-Aug-2023
  • (2023)GAM Coach: Towards Interactive and User-centered Algorithmic RecourseProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580816(1-20)Online publication date: 19-Apr-2023
  • (2023)Evaluating Visual Analytics for Relevant Information Retrieval in Document CollectionsInteracting with Computers10.1093/iwc/iwad01935:2(247-261)Online publication date: 3-Mar-2023
  • (2022)ClioQuery: Interactive Query-oriented Text Analytics for Comprehensive Investigation of Historical News ArchivesACM Transactions on Interactive Intelligent Systems10.1145/352402512:3(1-49)Online publication date: 26-Jul-2022
  • (2021)Treemap-Based Cluster Visualization and its Application to Text Data AnalysisJournal of Advanced Computational Intelligence and Intelligent Informatics10.20965/jaciii.2021.p049825:4(498-507)Online publication date: 20-Jul-2021
  • (2021)Evaluating visual analytics for text information retrievalProceedings of the XX Brazilian Symposium on Human Factors in Computing Systems10.1145/3472301.3484320(1-11)Online publication date: 18-Oct-2021
  • (2021)Interactive clustering: a scoping reviewArtificial Intelligence Review10.1007/s10462-020-09913-754:4(2765-2826)Online publication date: 1-Apr-2021
  • (2020)Bridging the Gap Between Ethics and PracticeACM Transactions on Interactive Intelligent Systems10.1145/341976410:4(1-31)Online publication date: 16-Oct-2020
  • (2020)Visual analysis of interactive document clustering streamsProceedings of the 2020 International Conference on Advanced Visual Interfaces10.1145/3399715.3399962(1-3)Online publication date: 28-Sep-2020
  • (2020)An adaptive document recognition system for lettrinesInternational Journal on Document Analysis and Recognition10.1007/s10032-019-00346-923:2(115-128)Online publication date: 1-Jun-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media