Abstract
Research in financial domain has shown that sentiment aspects of stock news have a profound impact on volume trades, volatility, stock prices and firm earnings. In-depth analysis of stock news is now sourced from financial reviews by various social networking and marketing sites to help improve decision making. Nonetheless, such reviews are in the form of unstructured text, which requires natural language processing (NLP) in order to extract the sentiments. Accordingly, in this study we investigate the use of NLP tasks in effort to improve the performance of sentiment classification in evaluating the information content of financial news as an instrument in investment decision support system. At present, feature extraction approach is mainly based on the occurrence frequency of words. Therefore low-frequency linguistic features that could be critical in sentiment classification are typically ignored. In this research, we attempt to improve current sentiment analysis approaches for financial news classification by focusing on low-frequency but informative linguistic expressions. Our proposed combination of low and high-frequency linguistic expressions contributes a novel set of features for sentiment classification. The experimental results show that an optimal Ngram feature selection (combination of optimal unigram and bigram features) enhances sentiment classification accuracy as compared to other types of feature sets.
Similar content being viewed by others
References
Fama, E. F. (1965). The behavior of stock-market prices. Journal of Business, 38(1), 34–105.
Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. Journal of Finance, 62(3), 1139–1168.
Li, F. (2010). Textual analysis of corporate disclosures: A survey of the literature. Accounting literature, 29, 143–165.
Hagenau, M., Liebmann, M., & Neumann, D. (2013). Automated news reading: Stock price prediction based on financial news using context-specific features. Decision Support Systems, 55, 685–697.
Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. L. (2015). Text mining of news-headlines for FOREX market prediction: A multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems With Applications, 42(1), 306–324.
Koppel, M., & Shtrimberg, I. (2006). Good news or bad news? Let the market decide. Computing Attitude and Affect in Text: Theory and Applications, 20, 297–301.
Groth, S. S., & Muntermann, J. (2011). An intraday market risk management approach based on textual analysis. Decision Support Systems, 50(4), 680–691.
Yu, Y., Duan, W., & Cao, Q. (2013). The impact of social and conventional media on firm equity value: A sentiment analysis approach. Decision Support Systems, 55(4), 919–926.
Généreux, M., Poibeau, T., & Koppel, M. (2011). Sentiment analysis using automatically labelled financial news items. In Affective computing and sentiment analysis (Vol. 45, no. 2, pp. 101–114). The series Text, Speech and Language Technology, Springer.
Zhai, J. J., Cohen, N., & Atreya, A. (2011). CS224N final project: Sentiment analysis of news articles for financial signal prediction (pp. 1–8). https://nlp.stanford.edu/courses/cs224n/2011/reports/nccohen-aatreya-jameszjj.pdf.
Pestov, V. (2013). Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Computers & Mathematics with Applications, 65(10), 1427–1437.
Joshi, K., Bharathi, H. N., & Jyothi, R. (2016). Stock trend prediction using news sentiment analysis. CoRR. abs/1607.0.
Chen, M. Y., & Chen, T. H. (2017). Modeling public mood and emotion: Blog and news sentiment and socio-economic phenomena. Future Generation Computing Systems. https://doi.org/10.1016/j.future.2017.10.028.
Chan, S. W. K., & Chong, M. W. C. (2017). Sentiment analysis in financial texts. Decision Support Systems, 94(2017), 53–64.
Mayne, A. (2010). Sentiment analysis for financial news. Sydney: University of Sydney.
Foroozan Yazdani, S., Murad, M. A. A., Sharef, N. M., Singh, Y. P., & Latiff, A. R. A. (2016). Sentiment classification of financial news using statistical features. International Journal of Pattern Recognition and Artificial Intelligence, 31(3), 34.
Pederson, T. (2001). A decision tree of bigrams is an accurate predictor of word sence. In Proceeding of the second NAACL (pp. 79–86).
Dave, K., Way, I., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews, In Proceedings of the 12th International World Wide Web Conference, Budapest, (pp. 519–528).
Mejova, Y., & Srinivasan, P. (2011). Exploring feature definition and selection for sentiment classifiers. In Fifth international AAAI conference on weblogs and social media (pp. 546–549).
Lan, M. L. M., Tan, C. L. T. C. L., Su, J. S. J., & Lu, Y. L. Y. (2009). Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4), 721–735.
Pham Xuan, N., & Le Quang, H. (2014). A new improved term weighting scheme for text categorization. Advances in Intelligent Systems and Computing, 271, 261–270.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.
Petrişor, A.-I., Ianoş, I., Iurea, D., & Văidianu, M.-N. (2012). Applications of principal component analysis integrated with GIS. Procedia Environmental Sciences, 14, 247–256.
Alpaydin, E. (2010). Introduction to machine learning, 2nd Edn. The MIT Press.
Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. L. (2014). Text mining for market prediction: A systematic review. Expert Systems with Applications, 41(16), 7653–7670.
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98, 1398(2), 137–142.
Hajek, P., & Henriques, R. (2017). Mining corporate annual reports for intelligent detection of financial statement fraud—A comparative study of machine learning methods. Knowledge-Based Systems, 128, 139-152.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Schölkopf, B., & Smola, A. (2005). Support vector machines and kernel algorithms (pp. 1–22).
Ooi, H. S., Schneider, G., Lim, T., Chan, Y., Eisenhaber, B., & Eisenhaber, F. (2010). Data mining techniques for the life sciences (vol. 609, pp 327–348). New York: Humana Press and Springer Bussiness Media.
Hsu, C., Chang, C., & Lin, C. (2010). A practical guide to support vector classification. Bioinformatics, 1(1), 1–16.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. New York: Springer.
Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. International Joint Conference on Artificial Intelligence, 14(12), 1137–1143.
Taylor, A., Marcus, M., & Santorini, B. (2003). The Penn Treebank: an overview. Treebanks 5–22.
Benamara, F., Cesarano, C., & Reforgiato, D. (2007). Sentiment analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the International Conference on Weblogs and Social Media(ICWSM), (pp. 1–4).
Turney, P. D. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association Computational Linguistics (pp. 417–424).
Hatzivassiloglou, V., McKeown, K. R., Pang, B., Lee, L., Vaithyanathan, S., Ku, L.-W., et al. (2009). Predicting the semantic orientation of adjectives. ACM Transactions on Information Systems, 21(4), 315–346.
Han, J., & Kamber, M. (2006). Data mining (concepts and techniques). Burlington: Elsevier (Morgan Kaufmann).
Acknowledgements
This work is supported in partial by Universiti Tun Hussein Onn Malaysia.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Foroozan Yazdani, S., Tan, Z., Kakavand, M. et al. NgramPOS: a bigram-based linguistic and statistical feature process model for unstructured text classification. Wireless Netw 28, 1251–1261 (2022). https://doi.org/10.1007/s11276-018-01909-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11276-018-01909-0