Abstract
The social media technologies are open to users who are intended in creating a community and publishing their opinions of recent incidents. The participants of the online social networking sites remain ignorant of the criticality of disclosing personal data to the public audience. The private data of users are at high risk leading to many adverse effects like cyberbullying, identity theft, and job loss. This research work aims to define the user entities or data like phone number, email address, family details, health-related information as user’s sensitive private data (SPD) in a social media platform. The proposed system, Tweet-Scan-Post (TSP), is mainly focused on identifying the presence of SPD in user’s posts under personal, professional, and health domains. The TSP framework is built based on the standards and privacy regulations established by social networking sites and organizations like NIST, DHS, GDPR. The proposed approach of TSP addresses the prevailing challenges in determining the presence of sensitive PII, user privacy within the bounds of confidentiality and trustworthiness. A novel layered classification approach with various state-of-art machine learning models is used by the TSP framework to classify tweets as sensitive and insensitive. The findings of TSP systems include 201 Sensitive Privacy Keywords using a boosting strategy, sensitivity scaling that measures the degree of sensitivity allied with a tweet. The experimental results revealed that personal tweets were highly related to mother and children, professional tweets with apology, and health tweets with concern over the father’s health condition.
Similar content being viewed by others
References
Abid Y, Imine A, Rusinowitch M (2018) Sensitive attribute prediction for social networks users. In DARLI-AP 2018–2nd international workshop on data analytics solutions for real-life applications
Ampong G, Mensah A, Adu A, Addae J, Omoregie O, Ofori K (2018) Examining self-disclosure on social networking sites: a flow theory and privacy perspective. Behav Sci 8(6):58
Becker M, Klausing SM, Hess T (2019) Uncovering the privacy paradox: the influence of distraction on data disclosure decision. In: Proceedings of the 27th European conference on information systems (ECIS)
Caliskan Islam A, Walsh J, Greenstadt R (2014) Privacy detective: detecting private information and collective privacy behavior in a large social network. Proceedings of the 13th workshop on privacy in the electronic society, ACM, pp. 35–46
Castillo SRM, Chen Z (2016) Using transfer learning to identify privacy leaks in tweets. IEEE 2nd international conference on collaboration and internet computing (CIC), IEEE, pp. 506–513
Chauhan A, Kummamuru K, Toshniwal D (2017) Prediction of places of visit using tweets. Knowl Inf Syst 50(1):145–166
Corley CD, Cook DJ, Mikler AR, Singh KP (2010) Text and structural data mining of influenza mentions in web and social media. Int J Environ Res Public Health 7(2):596–615
Dong C, Jin H, Knijnenburg BP (2016) Ppm: a privacy prediction model for online social networks. International conference on social informatics. Springer, Cham, pp. 400–420
Eliacik AB, Erdogan N (2018) Influential user weighted sentiment analysis on topic based microblogging community. Exp Syst Appl 92:403–418
Fan S, Huang B (2017) Recurrent collective classification. Knowledge and Information Systems, 1–15
Fares M, Moufarrej A, Jreij E, Tekli J, Grosky W (2019) Difficulties and improvements to graph-based lexical sentiment analysis using LISA. 2019 IEEE international conference on cognitive computing (ICCC). IEEE, pp. 28–35
Fu X, Liu W, Xu Y, Cui L (2017) Combine HowNet lexicon to train phrase recursive autoencoder for sentence-level sentiment analysis. Neurocomputing 241:18–27
Gan D, Jenkins LR (2015) Social networking privacy—Who’s stalking you? Future Internet 7(1):67–93
Gao W, Peng M, Wang H, Zhang Y, Xie Q, Tian G (2018) Incorporating word embeddings into topic modeling of short text. Knowledge and Information Systems, 1–23
Geetha R, Karthika S, Pavithra N, Preethi V (2019) Tweedle: sensitivity check in health-related social short texts based on regret theory. Procedia Comput Sci 165:663–675
Ghosh S, Desarkar MS (2018) Class specific TF-IDF boosting for short-text classification: application to short-texts generated during disasters. In companion proceedings of the the web conference 2018, pp. 1629–1637
Gill AJ, Vasalou A, Papoutsi C, Joinson AN (2011) Privacy dictionary: a linguistic taxonomy of privacy for content analysis. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp. 3227–3236
Gopal, J., Huang, S., & Luo, B. (2015). FamilyID: a hybrid approach to identify family information from microblogs. In IFIP annual conference on data and applications security and privacy. Springer, Cham, pp. 215-222
Househ M, Grainger R, Petersen C, Bamidis P, Merolli M (2018) Balancing between privacy and patient needs for health information in the age of participatory health and social media: a scoping review. Yearb Med Inform 27(01):029–036
Jordan K, Weller M (2018) Academics and social networking sites: benefits, problems and tensions in professional engagement with online networking. J Interact Media Educ 2018(1)
Kotsiantis SB (2005) Logitboost of simple bayesian classifier. Informatica 29(1)
Kumar CP, Babu LD (2019) Novel text preprocessing framework for sentiment analysis. In: Smart intelligent computing and applications. Springer, Singapore, pp 309–317
Kumar HK, Harish BS (2018) Classification of short text using various preprocessing techniques: an empirical evaluation. Recent findings in intelligent computing techniques. Springer, Singapore, pp 19–30
Li P, Cho H, Goh ZH (2019) Unpacking the process of privacy management and self-disclosure from the perspectives of regulatory focus and privacy calculus. Telematics Inform 41:114–125
Li Y, Li T, Liu H (2017) Recent advances in feature selection and its applications. Knowl Inf Syst 53(3):551–577
Liu S, Wang Y, Chen C, Xiang Y (2016) An ensemble learning approach for addressing the class imbalance problem in Twitter spam detection. Australasian conference on information security and privacy. Springer, Cham, pp 215–228
Liu Z, Wang X (2018) How to regulate individuals’ privacy boundaries on social network sites: a cross-cultural comparison. Inform Manag 55(8):1005–1023
Liu Z, Wang X, Liu J (2019) How digital natives make their self-disclosure decisions: a cross-cultural comparison. Inform Technol People
Lu X, Zhaowei Qu, Li Qi, Hui P (2015) Privacy information security classification for internet of things based on internet data. Int J Distrib Sens Netw 11(8):932–941
Mao H, Shuai X, Kapadia A (2011) Loose tweets: an analysis of privacy leaks on twitter. Proceedings of the 10th annual ACM workshop on privacy in the electronic society. ACM, pp. 1–12
Marwick AE, Boyd D (2011) I tweet honestly, I tweet passionately: twitter users, context collapse, and the imagined audience. New Media Soc 13(1):114–133
McCallister E (2010) Guide to protecting the confidentiality of personally identifiable information. Diane Publishing
Moll R, Pieschl S, Bromme R (2014) Trust into collective privacy? The role of subjective theories for self-disclosure in online communication. Societies 4(4):770–784
Nassar L, Karray F (2018) Overview of the crowdsourcing process. Knowledge and Information Systems, 1–24
Parra-Arnau J, Mármol FG, Rebollo-Monedero D, Forné J (2017) Shall I post this now? Optimized, delay-based privacy protection in social networks. Knowl Inf Syst 52(1):113–145
Peddinti ST, Ross KW, Cappos J (2017) User anonymity on twitter. IEEE Secur Priv 15(3):84–87
Pla F, Hurtado LF (2017) Language identification of multilingual posts from Twitter: a case study. Knowl Inf Syst 51(3):965–989
Schapire RE (2003) The boosting approach to machine learning: an overview. In: Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B (eds) Nonlinear estimation and classification. Lecture notes in statistics, vol 171. Springer, pp. 149–171
Shao G (2009) Understanding the appeal of user-generated media: a uses and gratification perspective. Internet Res 19(1):7–25
Sleeper M, Cranshaw J, Kelley PG, Ur G, Acquisti A, Cranor LF, Sadeh N (2013) I read my Twitter the next morning and was astonished: a conversational perspective on Twitter regrets. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp. 3277–3286
Sun X, Chan PK (2018) Estimating effectiveness of twitter messages with a personalized machine learning approach. Knowl Inf Syst 56(1):27–53
Tang JH, Wang CC (2012) Self-disclosure among bloggers: re-examination of social penetration theory. Cyberpsychol Behav Soc Netw 15(5):245–250
Tsakalidis A, Papadopoulos S, Kompatsiaris I (2014) An ensemble model for cross-domain polarity classification on twitter. In international conference on web information systems engineering. Springer, Cham, pp. 168-177
Tu W, Cheung D, Mamoulis N (2015) Time-sensitive opinion mining for prediction. In Twenty-Ninth AAAI conference on artificial intelligence, 29(1): 4214-4215
Tuarob S, Tucker CS, Salathe M, Ram N (2014) An ensemble heterogeneous classification methodology for discovering health-related knowledge in social media messages. J Biomed Inform 49:255–268
Vasalou A, Gill AJ, Mazanderani F, Papoutsi C, Joinson A (2011) Privacy dictionary: a new resource for the automated content analysis of privacy. J Am Soc Inform Sci Technol 62(11):2095–2105
Vitak J, Blasiola S, Patil S, Litt E (2015) Balancing audience and privacy tensions on social network sites: strategies of highly engaged users. Int J Commun 9:20
Wagner A, Krasnova H, Abramova O, Buxmann P, Benbasat I (2018) From˜ Privacy Calculus™ to˜ Social Calculus™: Understanding self-disclosure on social networking sites
Wan Y, Gao Q (2015) An ensemble sentiment classification system of twitter data for airline services analysis. 2015 IEEE international conference on data mining workshop (ICDMW), IEEE, pp. 1318–1325
Wang Q, Bhandal J, Huang S, Luo B (2017) Content-based classification of sensitive tweets. Int J Semant Comput 11(04):541–562
Yue L, Chen W, Li X, Zuo W, Yin M (2018) A survey of sentiment analysis in social media. Knowledge and Information Systems, 1–47
Zhang S, Kwok RCW, Lowry PB, Liu Z, Wu J (2019) The influence of role stress on self-disclosure on social networking sites: a conservation of resources perspective. Inform Manag 56(7):103–147
Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379–398
Statistica. https://www.statista.com/statistics/242606/number-of-active-twitter-users-in-selected-countries/. Accessed 15 February, 2020
IndiaToday. https://www.indiatoday.in/india/story/kotak-mahindra-bank-sacks-employee-after-his-irresponsible-facebook-post-on-kathua-gangrape-victim-1211705-2018-04-13. Accessed 13 April 2018
Times of India. https://timesofindia.indiatimes.com/home/science/hashtags-that-can-put-your-child-in-danger-online/articleshow/63652567.cms Accessed 20 April 2018
Intersoft Consulting. http://gdpr-info.eu Accessed 25 June 2017
Homeland Security. https://www.dhs.gov/publication/dhs-handbook-safeguarding-sensitive-pii Accessed 14 May 2018
Shraddha Bajracharya, Businesstopia, https://www.businesstopia.net/mass-communication/uses-gratifications-theory Accessed 10 February 2018
The Breach Level Index. https://www.breachlevelindex.com/data-breach-database Accessed 18 May 2019.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Geetha, R., Karthika, S. & Kumaraguru, P. Tweet-scan-post: a system for analysis of sensitive private data disclosure in online social media. Knowl Inf Syst 63, 2365–2404 (2021). https://doi.org/10.1007/s10115-021-01592-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-021-01592-2