[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Where You Are Is Who You Are: User Identification by Matching Statistics

Published: 01 February 2016 Publication History

Abstract

Most users of online services have unique behavioral or usage patterns. These behavioral patterns can be exploited to identify and track users by using only the observed patterns in the behavior. We study the task of identifying users from statistics of their behavioral patterns. In particular, we focus on the setting in which we are given histograms of users&#x2019; data collected during two different experiments. We assume that, in the first data set, the users&#x2019; identities are anonymized or hidden and that, in the second data set, their identities are known. We study the task of identifying the users by matching the histograms of their data in the first data set with the histograms from the second data set. In recent works, the optimal algorithm for this user identification task is introduced. In this paper, we evaluate the effectiveness of this method on three different types of data sets with up to 50 000 users, and in multiple scenarios. Using data sets such as call data records, web browsing histories, and GPS trajectories, we demonstrate that a large fraction of users can be easily identified given only histograms of their data; hence, these histograms can act as users&#x2019; fingerprints. We also verify that simultaneous identification of users achieves better performance compared with one-by-one user identification. Furthermore, we show that using the optimal method for identification indeed gives higher identification accuracy than the heuristics-based approaches in the practical scenarios. The accuracy obtained under this optimal method can thus be used to quantify the maximum level of user identification that is possible in such settings. We show that the key factors affecting the accuracy of the optimal identification algorithm are the duration of the data collection, the number of users in the anonymized data set, and the resolution of the data set. We also analyze the effectiveness of <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>-anonymization in resisting user identification attacks on these data sets.

References

[1]
J. Unnikrishnan and F. M. Naini, “De-anonymizing private data by matching statistics,” in Proc. 51th Annu. Allerton Conf. Commun., Control, Comput., Oct. 2013, pp. 1616–1623.
[2]
J. Unnikrishnan, “Asymptotically optimal matching of multiple sequences to source distributions and training sequences,” IEEE Trans. Inf. Theory, vol. 61, no. 1, pp. 452–468, Jan. 2015.
[3]
P. Resnick and H. R. Varian, “Recommender systems,” Commun. ACM, vol. 40, no. 3, pp. 56–58, 1997.
[4]
A. Narayanan and V. Shmatikov, “Robust de-anonymization of large sparse datasets,” in Proc. IEEE Symp. Secur. Privacy, May 2008, pp. 111–125.
[5]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios, “Duplicate record detection: A survey,” IEEE Trans. Knowl. Data Eng., vol. 19, no. 1, pp. 1–16, Jan. 2007.
[6]
D. V. Kalashnikov, Z. Chen, S. Mehrotra, and R. Nuray-Turan, “Web people search via connection analysis,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 11, pp. 1550–1565, Nov. 2008.
[7]
E. Bengtson and D. Roth, “Understanding the value of features for coreference resolution,” in Proc. Conf. EMNLP, 2008, pp. 294–303.
[8]
A. Stolerman, R. Overdorf, S. Afroz, and R. Greenstadt, “Breaking the closed-world assumption in stylometric authorship attribution,” in Advances in Digital Forensics X. Berlin, Germany: Springer-Verlag, 2014, pp. 185–205.
[9]
S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. Mccoy, “Doppelgänger finder: Taking stylometry to the underground,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2014, pp. 212–226.
[10]
O. Peled, M. Fire, L. Rokach, and Y. Elovici, “Entity matching in online social networks,” in Proc. IEEE SocialCom, Sep. 2013, pp. 339–344.
[11]
J. Liu, F. Zhang, X. Song, Y.-I. Song, C.-Y. Lin, and H.-W. Hon, “What’s in a name? An unsupervised approach to link users across communities,” in Proc. 6th ACM Int. Conf. WSDM, 2013, pp. 495–504.
[12]
L. Sweeney, “Weaving technology and policy together to maintain confidentiality,” J. Law, Med., Ethics, vol. 25, nos. 2–3, pp. 98–110, 1997.
[13]
H. Zang and J. Bolot, “Anonymization of location data does not work: A large-scale measurement study,” in Proc. 17th Annu. Int. Conf. ACM MobiCom, 2011, pp. 145–156.
[14]
X. Xiao, Y. Zheng, Q. Luo, and X. Xie, “Finding similar users using category-based location history,” in Proc. ACM 18th SIGSPATIAL Int. Conf. Adv. Geograph. Inf. Syst., 2010, pp. 442–445.
[15]
C. Y. T. Ma, D. K. Y. Yau, N. K. Yip, and N. S. V. Rao, “Privacy vulnerability of published anonymous mobility traces,” in Proc. 16th Annu. Int. Conf. MobiCom, 2010, pp. 185–196.
[16]
J. Freudiger, R. Shokri, and J.-P. Hubaux, “Evaluating the privacy risk of location-based services,” in Financial Cryptography and Data Security. Berlin, Germany: Springer-Verlag, 2012.
[17]
S. Gambs, M.-O. Killijian, and M. Núñez del Prado Cortez, “De-anonymization attack on geolocated data,” J. Comput. Syst. Sci., vol. 80, no. 8, pp. 1597–1614, 2014.
[18]
Y. De Mulder, G. Danezis, L. Batina, and B. Preneel, “Identification via location-profiling in GSM networks,” in Proc. 7th ACM WPES, 2008, pp. 23–32.
[19]
A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber, “Privacy: Theory meets practice on the map,” in Proc. IEEE 24th ICDE, Apr. 2008, pp. 277–286.
[20]
L. Olejnik, C. Castelluccia, and A. Janc, “On the uniqueness of Web browsing history patterns,” Ann. Telecommun., vol. 69, nos. 1–2, pp. 63–74, 2014.
[21]
T.-F. Yen, Y. Xie, F. Yu, R. P. Yu, and M. Abadi, “Host fingerprinting and tracking on the Web: Privacy and security implications,” in Proc. 19th Annu. NDSS, 2012, pp. 1–16.
[22]
K. Sharad and G. Danezis, “De-anonymizing D4D datasets,” in Proc. 6th Workshop HotPETs, 2013, pp. 1–18.
[23]
M. Srivatsa and M. Hicks, “Deanonymizing mobility traces: Using social network as a side-channel,” in Proc. ACM Conf. CCS, 2012, pp. 628–637.
[24]
Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, “Unique in the crowd: The privacy bounds of human mobility,” Sci. Rep., vol. 3, Mar. 2013, Art. ID.
[25]
R. Shokri, G. Theodorakopoulos, J.-Y. Le Boudec, and J.-P. Hubaux, “Quantifying location privacy,” in Proc. IEEE Symp. Secur. Privacy (SP), May 2011, pp. 247–262.
[26]
G. Danezis and C. Troncoso, “Vida: How to use Bayesian inference to de-anonymize persistent communications,” in Privacy Enhancing Technologies. Berlin, Germany: Springer-Verlag, 2009, pp. 56–72.
[27]
C. Troncoso, B. Gierlichs, B. Preneel, and I. Verbauwhede, “Perfect matching disclosure attacks,” in Privacy Enhancing Technologies. Berlin, Germany: Springer-Verlag, 2008, pp. 2–23.
[28]
P. Pedarsani and M. Grossglauser, “On the privacy of anonymized networks,” in Proc. 17th ACM Int. Conf. KDD, 2011, pp. 1235–1243.
[29]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, vol. 2. New York, NY, USA: Springer-Verlag, 2009, no. 1.
[30]
T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, NJ, USA: Wiley, 2006.
[31]
L. Ramshaw and R. E. Tarjan, “A weight-scaling algorithm for min-cost imperfect matchings in bipartite graphs,” in Proc. IEEE 53rd Annu. Symp. FOCS, Oct. 2012, pp. 581–590.
[32]
M. L. Fredman and R. E. Tarjan, “Fibonacci heaps and their uses in improved network optimization algorithms,” J. ACM, vol. 34, no. 3, pp. 596–615, 1987.
[33]
W. J. Cook, W. H. Cunningham, W. R. Pulleyblank, and A. Schrijver, Combinatorial Optimization (Wiley-Interscience Series in Discrete Mathematics and Optimization). Hoboken, NJ, USA: Wiley, 2011.
[34]
R. Duan and S. Pettie, “Linear-time approximation for maximum weight matching,” J. ACM, vol. 61, no. 1, Jan. 2014, Art. ID. [Online]. Available: http://doi.acm.org/10.1145/2529989
[35]
V. D. Blondel et al. (Jan. 2013). “Data for development: The D4D challenge on mobile phone data.” [Online]. Available: http://arxiv.org/abs/1210.0137
[36]
K. L. Huang, S. S. Kanhere, and W. Hu, “Preserving privacy in participatory sensing systems,” Comput. Commun., vol. 33, no. 11, pp. 1266–1280, 2010.
[37]
D. Christin, A. Reinhardt, S. S. Kanhere, and M. Hollick, “A survey on privacy in mobile participatory sensing applications,” J. Syst. Softw., vol. 84, no. 11, pp. 1928–1946, 2011.
[38]
E. Herder, R. Kawase, and G. Papadakis, “Experiences in building the public Web history repository,” in Proc. DataTEL Workshop, 2011, pp. 1–2.
[39]
Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma, “Understanding mobility based on GPS data,” in Proc. ACM 10th Int. Conf. UbiComp, 2008, pp. 312–321.
[40]
C. C. Aggarwal and P. S. Yu, A General Survey of Privacy-Preserving Data Mining Models and Algorithms. New York, NY, USA: Springer-Verlag, 2008.
[41]
L. Sweeney, “k-anonymity: A model for protecting privacy,” Int. J. Uncertainty, Fuzziness, Knowl. Syst., vol. 10, no. 5, pp. 557–570, 2002.
[42]
J. Domingo-Ferrer and J. M. Mateo-Sanz, “Practical data-oriented microaggregation for statistical disclosure control,” IEEE Trans. Knowl. Data Eng., vol. 14, no. 1, pp. 189–201, Jan./Feb. 2002.
[43]
J. Domingo-Ferrer, F. Sebé, and A. Solanas, “A polynomial-time approximation to optimal multivariate microaggregation,” Comput. Math. Appl., vol. 55, no. 4, pp. 714–732, 2008.
[44]
C. Panagiotakis and G. Tziritas, “Successive group selection for microaggregation,” IEEE Trans. Knowl. Data Eng., vol. 25, no. 5, pp. 1191–1195, May 2013.

Cited By

View all
  • (2024)Large and Small Deviations for Statistical Sequence MatchingIEEE Transactions on Information Theory10.1109/TIT.2024.346458670:11(7532-7562)Online publication date: 1-Nov-2024
  • (2024)Database Matching Under Noisy Synchronization ErrorsIEEE Transactions on Information Theory10.1109/TIT.2024.338899070:6(4335-4367)Online publication date: 15-Apr-2024
  • (2024)PRIMϵ: Novel Privacy-Preservation Model With Pattern Mining and Genetic AlgorithmIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.332476919(571-585)Online publication date: 1-Jan-2024
  • Show More Cited By

Index Terms

  1. Where You Are Is Who You Are: User Identification by Matching Statistics
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image IEEE Transactions on Information Forensics and Security
      IEEE Transactions on Information Forensics and Security  Volume 11, Issue 2
      Feb. 2016
      210 pages

      Publisher

      IEEE Press

      Publication History

      Published: 01 February 2016

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Large and Small Deviations for Statistical Sequence MatchingIEEE Transactions on Information Theory10.1109/TIT.2024.346458670:11(7532-7562)Online publication date: 1-Nov-2024
      • (2024)Database Matching Under Noisy Synchronization ErrorsIEEE Transactions on Information Theory10.1109/TIT.2024.338899070:6(4335-4367)Online publication date: 15-Apr-2024
      • (2024)PRIMϵ: Novel Privacy-Preservation Model With Pattern Mining and Genetic AlgorithmIEEE Transactions on Information Forensics and Security10.1109/TIFS.2023.332476919(571-585)Online publication date: 1-Jan-2024
      • (2024)User re-identification via human mobility trajectories with siamese transformer networksApplied Intelligence10.1007/s10489-023-05234-854:1(815-834)Online publication date: 1-Jan-2024
      • (2023)The Trajectory Interval Forest Classifier for Trajectory ClassificationProceedings of the 31st ACM International Conference on Advances in Geographic Information Systems10.1145/3589132.3625617(1-4)Online publication date: 13-Nov-2023
      • (2023)Novel trajectory privacy protection method against prediction attacksExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.118870213:PAOnline publication date: 1-Mar-2023
      • (2022)User Identity Linkage via Co-Attentive Neural Network From Heterogeneous Mobility DataIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298973234:2(954-968)Online publication date: 1-Feb-2022
      • (2022)Planting a Poison SEAD: Using Social Engineering Active Defense (SEAD) to Counter CybercriminalsAugmented Cognition10.1007/978-3-031-05457-0_4(48-57)Online publication date: 26-Jun-2022
      • (2021)EDENProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/34635025:2(1-25)Online publication date: 24-Jun-2021
      • (2021)Database Matching Under Column Deletions2021 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT45174.2021.9518145(2720-2725)Online publication date: 12-Jul-2021
      • Show More Cited By

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media