Abstract
The presented novel procedure named SuDoC – or Semi-unsupervised Document Classification – provides an alternative method to standard clustering techniques when it is necessary to separate a very large set of textual instances into groups that represent the text-document semantics. Unlike the conventional clustering, SuDoC proceeds from an initial small set of typical specimen that can be created manually and which provides the necessary bias for generating appropriate classes. SuDoC starts with a higher number of generated clusters and – to avoid over-fitting – reiteratively decreases their quantity, increasing the resulting classification generality. The unlabeled instances are automatically labeled according to their similarity to the defined labeled samples, thus reaching higher classification accuracy in the future. The results of the presented strengthened clustering procedure are demonstrated using a real-world data set represented by hotel guests’ unstructured reviews written in natural language.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abney, S.P.: Semisupervised Learning for Computational Linguistics. Chapman & Hall/CRC (2008)
Aha, D., Kibler, D.: Instance-based learning algorithms. Machine Learning 6, 37–66 (1991)
Berry, M.W., Kogan, J. (eds.): Text Mining: Applications and Theory. John Wiley & Sons (2010)
Van Britsom, D., Bronselaer, A., De Tré, G.: Concept Identification in Constructing Multi-Document Summarizations. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R., et al. (eds.) IPMU 2012, Part II. CCIS, vol. 298, pp. 276–284. Springer, Heidelberg (2012)
le Cessie, S., van Houwelingen, J.C.: Ridge Estimators in Logistic Regression. Applied Statistics 41, 191–201 (1992)
Cleary, J.G., Trigg, L.E.: K*: An Instance-based Learner Using an Entropic Distance Measure. In: 12th International Conference on Machine Learning, pp. 108–114 (1995)
Cohen, W.W.: Fast Effective Rule Induction. In: Twelfth International Conference on Machine Learning, pp. 115–123 (1995)
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Goncalves, M.A., Meira, W.: Word co-occurrence features for text classification. Information Systems 36, 843–858 (2011)
Ghosh, J., Strehl, A.: Similarity-Based Text Clustering: A Comparative Study. In: Grouping Multidimensional Data, pp. 73–97. Springer, Berlin (2006)
Hall, M., et al.: The WEKA Data Mining Software: An Update. SIGKDD Explorations 11, 10–18 (2009)
Joachims, T.: Learning to classify text using support vector machines. Kluwer Academic Publishers (2002)
John, G.H., Langley, P.: Estimating Continuous Distributions in Bayesian Classifiers. In: Eleventh Conference on Uncertainty in Artificial Intelligence, pp. 338–345 (1995)
Karypis, G.: Cluto: A Clustering Toolkit. Technical report, University of Minnesota (2003)
Nie, J.Y.: Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies 3, 1–125 (2010)
Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning (1998)
Quinlan, R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Russel, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Pearson Education, Upper Saddle River (2010)
Sokolova, M., Japkowicz, N., Szpakowicz, S.: Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation. In: Sattar, A., Kang, B.-H. (eds.) AI 2006. LNCS (LNAI), vol. 4304, pp. 1015–1021. Springer, Heidelberg (2006)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.J.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2010)
Zhao, Y., Karypis, K.: Criterion Functions for Document Clustering: Experiments and Analysis. Technical report, University of Minnesota (2003)
Žižka, J., Dařena, F.: Mining Significant Words from Customer Opinions Written in Different Natural Languages. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS, vol. 6836, pp. 211–218. Springer, Heidelberg (2011)
Žižka, J., Burda, K., Dařena, F.: Mining Opinion-Clusters from Very Large Unstructured Real-World Textual Data. In: Ramsay, A., Agre, G. (eds.) AIMSA 2012. LNCS, vol. 7557, pp. 38–47. Springer, Heidelberg (2012)
http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download/ (March 2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dařena, F., Žižka, J. (2013). SuDoC: Semi-unsupervised Classification of Text Document Opinions Using a Few Labeled Examples and Clustering. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds) Flexible Query Answering Systems. FQAS 2013. Lecture Notes in Computer Science(), vol 8132. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40769-7_54
Download citation
DOI: https://doi.org/10.1007/978-3-642-40769-7_54
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40768-0
Online ISBN: 978-3-642-40769-7
eBook Packages: Computer ScienceComputer Science (R0)