Authors:
Giacomo Domeniconi
;
Gianluca Moro
;
Roberto Pasolini
and
Claudio Sartori
Affiliation:
University of Bologna, Italy
Keyword(s):
Term Weighting, Supervised Term Weighting Scheme, Text Categorization, tfidf, Text Representation.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Artificial Intelligence
;
Big Data
;
Biomedical Engineering
;
Business Analytics
;
Data Engineering
;
Data Management and Quality
;
Data Mining
;
Databases and Information Systems Integration
;
Datamining
;
Enterprise Information Systems
;
Health Information Systems
;
Information Retrieval
;
Ontologies and the Semantic Web
;
Pattern Recognition
;
Semi-Structured and Unstructured Data
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Software Engineering
;
Text Analytics
Abstract:
Within text categorization and other data mining tasks, the use of suitable methods for term weighting can
bring a substantial boost in effectiveness. Several term weighting methods have been presented throughout
literature, based on assumptions commonly derived from observation of distribution of words in documents.
For example, the idf assumption states that words appearing in many documents are usually not as important
as less frequent ones. Contrarily to tf.idf and other weighting methods derived from information retrieval,
schemes proposed more recently are supervised, i.e. based on knownledge of membership of training documents
to categories. We propose here a supervised variant of the tf.idf scheme, based on computing the
usual idf factor without considering documents of the category to be recognized, so that importance of terms
frequently appearing only within it is not underestimated. A further proposed variant is additionally based on
relevance frequency, considering occurr
ences of words within the category itself. In extensive experiments on
two recurring text collections with several unsupervised and supervised weighting schemes, we show that the
ones we propose generally perform better than or comparably to other ones in terms of accuracy, using two
different learning methods.
(More)