[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Visual Analytics Approach for Interactive Document Clustering

Published: 09 August 2019 Publication History

Abstract

Document clustering is a necessary step in various analytical and automated activities. When guided by the user, algorithms are tailored to imprint a perspective on the clustering process that reflects the user’s understanding of the dataset. More than just allow for customized adjustment of the clusters, a visual analytics approach will provide tools for the user to draw new insights on the collection. While contributing his or her perspective, the user will also acquire a deeper understanding of the data set. To that effect, we propose a novel visual analytics system for interactive document clustering. We built our system on top of clustering algorithms that can adapt to user’s feedback. In the proposed system, initial clustering is created based on the user-defined number of clusters and the selected clustering algorithm. A set of coordinated visualizations allow the examination of the dataset and the results of the clustering. The visualization provides the user with the highlights of individual documents and understanding of the evolution of documents over the time period to which they relate. The users then interact with the process by means of changing key-terms that drive the process according to their knowledge of the documents domain. In key-term-based interaction, the user assigns a set of key-terms to each target cluster to guide the clustering algorithm. We have improved that process with a novel algorithm for choosing proper seeds for the clustering. Results demonstrate that not only the system has improved considerably its precision, but also its effectiveness in the document-based decision making. A set of quantitative experiments and a user study have been conducted to show the advantages of the approach for document analytics based on clustering. We performed and reported on the use of the framework in a real decision-making scenario that relates users discussion by email to decision making in improving patient care. Results show that the framework is useful even for more complex data sets such as email conversations.

References

[1]
Accessed: 2017-10-07. Mind Map file format Description. http://freemind.sourceforge.net.
[2]
Accessed: 2017-10-07. VNA file format Description. https://gephi.org/users/supported-graph-formats/netdraw-vna-format/.
[3]
E. Alexander, J. Kohlmann, R. Valenza, M. Witmore, and M. Gleicher. 2014. Serendip: Topic model-driven visual exploration of text corpora. In Proceedings of the 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). 173--182.
[4]
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, 25--32.
[5]
David Arthur and Sergei Vassilvitskii. 2007. K-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07). Society for Industrial and Applied Mathematics, Philadelphia, PA, 1027--1035. http://dl.acm.org/citation.cfm?id=1283383.1283494.
[6]
Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. 2014. Local algorithms for interactive clustering. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, II--550--II--558. http://dl.acm.org/citation.cfm?id=3044805.3044954.
[7]
Maria-Florina Balcan and Avrim Blum. 2008. Clustering with interactive feedback. In Proceedings of the 19th International Conference on Algorithmic Learning Theory (ALT’08). Springer-Verlag, Berlin, 316--328.
[8]
Aaron Bangor, Philip Kortum, and James Miller. 2009. Determining what individual SUS scores mean: Adding an adjective rating scale. J. Usability Studies 4, 3 (May 2009), 114--123. http://dl.acm.org/citation.cfm?id=2835587.2835589.
[9]
Josh Barnes and Piet Hut. 1986. A hierarchical O (N log N) force-calculation algorithm. Nature 324, 6096 (1986), 446--449.
[10]
Sugato Basu, Arindam Banerjee, and Raymond J. Mooney. 2002. Semi-supervised clustering by seeding. In Proceedings of the 19th International Conference on Machine Learning (ICML’02). Morgan-Kaufmann Publishers, Inc., San Francisco, CA, 27--34. http://dl.acm.org/citation.cfm?id=645531.656012.
[11]
Ron Bekkerman, Hema Raghavan, James Allan, and Koji Eguchi. 2007. Interactive clustering of text collections according to a user-specified criterion. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07). Morgan-Kaufmann Publishers, Inc., San Francisco, CA, 684--689. http://dl.acm.org/citation.cfm?id=1625275.1625385.
[12]
Matthew Berger, Katherine McDonough, and Lee M. Seversky. 2017. Cite2Vec: Citation-driven document exploration via word embeddings. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan. 2017), 691--700.
[13]
James C. Bezdek. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, MA.
[14]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, (Jan. 2003), 993--1022.
[15]
Michael Bostock, Vadim Ogievetsky, and Jeffrey Heer. 2011. D3 data-driven documents. IEEE Transactions on Visualization and Computer Graphics 17, 12 (Dec. 2011), 2301--2309.
[16]
John Brooke. 1996. SUS-A quick and dirty usability scale. Usability Evaluation in Industry 189, 194 (1996), 4--7.
[17]
Ana Cardoso-Cachopo. 2007. Improving Methods for Single-label Text Categorization. Ph.D dissertation, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
[18]
M. Chang, L. Ratinov, D. Roth, and V. Srikumar. 2008. Importance of semantic representation: Dataless classification. In AAAI. http://cogcomp.cs.illinois.edu/papers/CRRS08.pdf.
[19]
Jaegul Choo, Changhyun Lee, Chandan K. Reddy, and Haesun Park. 2013. UTOPIAN: User-driven topic modeling based on interactive nonnegative matrix factorization. IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 1992--2001.
[20]
Jason Chuang, Christopher D. Manning, and Jeffrey Heer. 2012. Termite: Visualization techniques for assessing textual topic models. In Proceedings of the International Working Conference on Advanced Visual Interfaces (AVI’12). ACM, New York, 74--77.
[21]
Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom Mitchell, Kamal Nigam, and Seán Slattery. 1998. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (AAAI’98/IAAI’98). American Association for Artificial Intelligence, Menlo Park, CA, 509--516. http://dl.acm.org/citation.cfm?id=295240.295725.
[22]
Wenwen Dou, Li Yu, Xiaoyu Wang, Zhiqiang Ma, and William Ribarsky. 2013. HierarchicalTopics: Visually exploring large text collections using topic hierarchies. IEEE Transactions on Visualization and Computer Graphics 19, 12 (Dec. 2013), 2002--2011.
[23]
Thomas M. J. Fruchterman and Edward M. Reingold. 1991. Graph drawing by force-directed placement. Software: Practice and Experience 21, 11 (1991), 1129--1164.
[24]
Derek Greene and Pádraig Cunningham. 2005. Producing accurate interpretable clusters from high-dimensional data. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’05). Springer-Verlag, Berlin, 486--494.
[25]
Susan Havre, Beth Hetzler, and Lucy Nowell. 2002. ThemeRiverTM: In search of trends, patterns, and relationships. IEEE Transactions on Visualization and Computer Graphics 8, 1 (2002), 9--20.
[26]
Enamul Hoque and Giuseppe Carenini. 2016. Interactive topic modeling for exploring asynchronous online conversations: Design and evaluation of ConVisIT. ACM Trans. Interact. Intell. Syst. 6, 1, Article 7 (Feb. 2016), 24 pages.
[27]
Yifan Hu. 2005. Efficient, high-quality force-directed graph drawing. Mathematica Journal 10, 1 (2005), 37--71.
[28]
Yuening Hu, Jordan Boyd-Graber, and Brianna Satinoff. 2011. Interactive topic modeling. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1 (HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 248--257. http://dl.acm.org/citation.cfm?id=2002472.2002505.
[29]
Yeming Hu, Evangelos E. Milios, James Blustein, and Shali Liu. 2012. Personalized document clustering with dual supervision. In Proceedings of the 2012 ACM Symposium on Document Engineering (DocEng’12). ACM, New York, 161--170.
[30]
Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification 2, 1 (1985), 193--218.
[31]
M. Kim, K. Kang, D. Park, J. Choo, and N. Elmqvist. 2017. TopicLens: Efficient multi-level visual topic exploration of large-scale document collections. IEEE Transactions on Visualization and Computer Graphics 23, 1 (Jan 2017), 151--160.
[32]
Chia-Tung Kuo, S. S. Ravi, Thi-Bich-Hanh Dao, Christel Vrain, and Ian Davidson. 2017. A framework for minimal clustering modification via constraint programming. In Proceedings of AAAI. 1389--1395.
[33]
Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Proceedings of the 12th International Conference on Machine Learning. 331--339.
[34]
Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, John Stasko, and Haesun Park. 2012. iVisClustering: An interactive visual document clustering via topic modeling. Computer Graphics Forum 31, 3pt3 (2012), 1155--1164.
[35]
Kevin J. Lee, Frank F. Tu, Huong G. Nghiem, and Andrew I. Sokol. 2010. Promises and pitfalls of the AAGL LISTSERV: A descriptive analysis. Journal of Minimally Invasive Gynecology 17, 4 (2010), 407--410.
[36]
Y. Li, C. Luo, and S. M. Chung. 2008. Text clustering with feature selection by using statistical data. IEEE Transactions on Knowledge and Data Engineering 20, 5 (May 2008), 641--652.
[37]
Bing Liu, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. 2004. Text classification by labeling words. In Proceedings of the 19th National Conference on Artifical Intelligence (AAAI’04). AAAI Press, 425--430. http://dl.acm.org/citation.cfm?id=1597148.1597218.
[38]
S. Liu, X. Wang, J. Chen, J. Zhu, and B. Guo. 2014. TopicPanorama: A full picture of relevant topics. In Proceedings of the 2014 IEEE Conference on Visual Analytics Science and Technology (VAST). 183--192.
[39]
Tao Liu, Shengping Liu, Zheng Chen, and Wei-Ying Ma. 2003. An evaluation on feature selection for text clustering. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML’03). AAAI Press, 488--495. http://dl.acm.org/citation.cfm?id=3041838.3041900.
[40]
Samantha Long, Desleigh De Jonge, Jenny Ziviani, and Alison Jones. 2009. Paediatricots: Utilisation of an Australian list serve to support occupational therapists working with children. Australian Occupational Therapy Journal 56, 1 (2009), 63--71.
[41]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9, (Nov. 2008), 2579--2605.
[42]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York.
[43]
Seyednaser Nourashrafeddin, Evangelos Milios, and Dirk Arnold. 2013. Interactive text document clustering using feature labeling. In Proceedings of the 2013 ACM Symposium on Document Engineering (DocEng’13). ACM, New York, 61--70.
[44]
Seyednaser Nourashrafeddin, Evangelos Milios, and Dirk V. Arnold. 2014. An ensemble approach for text document clustering using Wikipedia concepts. In Proceedings of the 2014 ACM Symposium on Document Engineering (DocEng’14). ACM, New York, 107--116.
[45]
Seyednaser Nourashrafeddin, Ehsan Sherkat, Rosane Minghim, and Evangelos E. Milios. 2018. A visual approach for interactive keyterm-based clustering. ACM Transactions of the Interactive Intelligence Systems 8, 1, Article 6 (Feb. 2018), 35 pages.
[46]
Pentti Paatero and Unto Tapper. 1994. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 2 (1994), 111--126.
[47]
F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz. 2008. Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping. IEEE Transactions on Visualization and Computer Graphics 14, 3 (May 2008), 564--575.
[48]
Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D. Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 - Volume 1 (EMNLP’09). Association for Computational Linguistics, Stroudsburg, PA, 248--256. http://dl.acm.org/citation.cfm?id=1699510.1699543.
[49]
Tony Rose, Mark Stevenson, and Miles Whitehead. 2002. The Reuters corpus volume 1 - From yesterday’s news to tomorrow’s language resources. In LREC, Vol. 2. 827--832.
[50]
Andrew Rosenberg and Julia Hirschberg. 2007. V-Measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, Vol. 7. 410--420.
[51]
Peter Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 1 (Nov. 1987), 53--65.
[52]
Ehsan Sherkat, Seyednaser Nourashrafeddin, Evangelos E. Milios, and Rosane Minghim. 2018. Interactive document clustering revisited: A visual analytics approach. In Proceedings of the 23rd International Conference on Intelligent User Interfaces (IUI’18). ACM, New York, 281--292.
[53]
Ehsan Sherkat, Julien Velcin, and Evangelos E. Milios. 2018. Fast and simple deterministic seeding of KMeans for text document clustering. In Experimental IR Meets Multilinguality, Multimodality, and Interaction. Springer International Publishing, Cham, 76--88.
[54]
Kathy Spurr, Gail Dechman, Kelly Lackie, and Robert Gilbert. 2016. Creation of a tool for assessing knowledge in evidence-based decision-making in practicing health care providers. Journal of Continuing Education in the Health Professions 36, 3 (2016), 164--170.
[55]
Ting Su and Jennifer G. Dy. 2007. In search of deterministic methods for initializing k-means and gaussian mixture clustering. Intell. Data Anal. 11, 4 (Dec. 2007), 319--338. http://dl.acm.org/citation.cfm?id=1367948.1367950.
[56]
Teresa L Cervantez Thompson and Barbara Penprase. 2004. RehabNurse-L: An analysis of the rehabilitation nursing LISTSERV experience. Rehabilitation Nursing 29, 2 (2004), 56--61.
[57]
Laurens Van Der Maaten. 2014. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 3221--3245. http://dl.acm.org/citation.cfm?id=2627435.2697068.
[58]
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11 (Dec. 2010), 2837--2854. http://dl.acm.org/citation.cfm?id=1756006.1953024.
[59]
Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schrödl. 2001. Constrained k-means clustering with background knowledge. In Proceedings of the 18th International Conference on Machine Learning (ICML’01). Morgan-Kaufmann Publishers, Inc., San Francisco, CA, 577--584. http://dl.acm.org/citation.cfm?id=645530.655669.
[60]
Svante Wold, Kim Esbensen, and Paul Geladi. 1987. Principal component analysis. Chemometrics and Intelligent Laboratory Systems 2, 1--3 (1987), 37--52.
[61]
Yi Yang, Shimei Pan, Yangqiu Song, Jie Lu, and Mercan Topkara. 2015. User-directed non-disruptive topic model update for effective exploration of dynamic content. In Proceedings of the 20th International Conference on Intelligent User Interfaces (IUI’15). ACM, New York, 158--168.
[62]
Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML’97). Morgan-Kaufmann Publishers, Inc., San Francisco, CA, 412--420. http://dl.acm.org/citation.cfm?id=645526.657137.

Cited By

View all
  • (2023)Addressing the gap between current language models and key-term-based clusteringProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604900(1-10)Online publication date: 22-Aug-2023
  • (2023)Evaluating Visual Analytics for Relevant Information Retrieval in Document CollectionsInteracting with Computers10.1093/iwc/iwad01935:2(247-261)Online publication date: 3-Mar-2023
  • (2023)Network Analysis and Natural Language Processing to Obtain a Landscape of the Scientific Literature on Materials ApplicationsACS Applied Materials & Interfaces10.1021/acsami.3c0163215:23(27437-27446)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Interactive Intelligent Systems
ACM Transactions on Interactive Intelligent Systems  Volume 10, Issue 1
Special Issue on IUI 2018
March 2020
347 pages
ISSN:2160-6455
EISSN:2160-6463
DOI:10.1145/3352585
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 August 2019
Accepted: 01 October 2018
Revised: 01 August 2018
Received: 01 May 2018
Published in TIIS Volume 10, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Interactive document clustering
  2. deterministic
  3. document projection
  4. email list
  5. key-term
  6. seeding
  7. text
  8. user study
  9. visualization

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Natural Sciences and Engineering Research Council of Canada
  • International Development Research Center, Ottawa, Canada
  • Boeing Company(Canada)
  • CNPq and FAPESP (Brazil)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Addressing the gap between current language models and key-term-based clusteringProceedings of the ACM Symposium on Document Engineering 202310.1145/3573128.3604900(1-10)Online publication date: 22-Aug-2023
  • (2023)Evaluating Visual Analytics for Relevant Information Retrieval in Document CollectionsInteracting with Computers10.1093/iwc/iwad01935:2(247-261)Online publication date: 3-Mar-2023
  • (2023)Network Analysis and Natural Language Processing to Obtain a Landscape of the Scientific Literature on Materials ApplicationsACS Applied Materials & Interfaces10.1021/acsami.3c0163215:23(27437-27446)Online publication date: 4-Jun-2023
  • (2022)Interactive clustering and high-recall information retrieval using language modelsProceedings of the 2022 International Conference on Advanced Visual Interfaces10.1145/3531073.3531174(1-5)Online publication date: 6-Jun-2022
  • (2022)ClioQuery: Interactive Query-oriented Text Analytics for Comprehensive Investigation of Historical News ArchivesACM Transactions on Interactive Intelligent Systems10.1145/352402512:3(1-49)Online publication date: 26-Jul-2022
  • (2022)BIGexplore: Bayesian Information Gain Framework for Information ExplorationProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517729(1-16)Online publication date: 29-Apr-2022
  • (2022)CreativeSearch: Proactive design exploration system with Bayesian information gain and information entropyAutomation in Construction10.1016/j.autcon.2022.104502142(104502)Online publication date: Oct-2022
  • (2021)Evaluating visual analytics for text information retrievalProceedings of the XX Brazilian Symposium on Human Factors in Computing Systems10.1145/3472301.3484320(1-11)Online publication date: 18-Oct-2021
  • (2021)Multi-view document clustering based on geometrical similarity measurementInternational Journal of Machine Learning and Cybernetics10.1007/s13042-021-01295-813:3(663-675)Online publication date: 22-Mar-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media