Abstract
Similar document detection is a well-studied problem with important application domains, such as plagiarism detection, document archiving, and patent/copyright protection. Recently, the research focus has shifted towards the privacy-preserving version of the problem, in which two parties want to identify similar documents within their respective datasets. These methods apply to scenarios such as patent protection or intelligence collaboration, where the contents of the documents at both parties should be kept secret. Nevertheless, existing protocols on secure similar document detection suffer from high computational and/or communication costs, which renders them impractical for large datasets. In this work, we introduce a solution based on simhash document fingerprints, which essentially reduce the problem to a secure XOR computation between two bit vectors. Our experimental results demonstrate that the proposed method improves the computational and communication costs by at least one order of magnitude compared to the current state-of-the-art protocol. Moreover, it achieves a high level of precision and recall.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We follow the simhash definition of Charikar [2].
- 2.
- 3.
Note that the EsPRESSo protocols are also implemented on top of group \(G\).
- 4.
References
Blundo, C., De Cristofaro, E., Gasti, P.: EsPRESSo: efficient privacy-preserving evaluation of sample set similarity. In: Di Pietro, R., Herranz, J., Damiani, E., State, R. (eds.) DPM 2012 and SETOP 2012. LNCS, vol. 7731, pp. 89–103. Springer, Heidelberg (2013)
Charikar, M.: Similarity estimation techniques from rounding algorithms. In: STOC, pp. 380–388 (2002)
Cramer, R., Gennaro, R., Schoenmakers, B.: A secure and optimally efficient multi-authority election scheme. Eur. Trans. Telecommun. 8(5), 481–490 (1997)
Cristofaro, E.D., Gasti, P., Tsudik, G.: Fast and private computation of set intersection cardinality. IACR Cryptology ePrint Archive 2011, 141 (2011)
ElGamal, T.: A public-key cryptosystem and a signature scheme based on discrete logarithms. IEEE Trans. Inf. Theory 31(4), 469–472 (1985)
Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: CIKM, pp. 63–72 (2008)
Jiang, W., Murugesan, M., Clifton, C., Si, L.: Similar document detection with limited information disclosure. In: ICDE, pp. 735–743 (2008)
Jiang, W., Samanthula, B.K.: N-gram based secure similar document detection. In: Li, Y. (ed.) DBSec. LNCS, vol. 6818, pp. 239–246. Springer, Heidelberg (2011)
Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1(1), 59–98 (2009)
Manber, U.: Finding similar files in a large file system. In: USENIX Winter, pp. 1–10 (1994)
Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Murugesan, M., Jiang, W., Clifton, C., Si, L., Vaidya, J.: Efficient privacy-preserving similar document detection. VLDB J. 19(4), 457–475 (2010)
Naor, M., Pinkas, B.: Computationally secure oblivious transfer. J. Cryptology 18(1), 1–35 (2005)
Yao, A.C.C.: How to generate and exchange secrets. In: FOCS, pp. 162–167 (1986)
Acknowledgments
This research has been funded by the NSF CAREER Award IIS-0845262.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Buyrukbilen, S., Bakiras, S. (2014). Secure Similar Document Detection with Simhash. In: Jonker, W., Petković, M. (eds) Secure Data Management. SDM 2013. Lecture Notes in Computer Science(), vol 8425. Springer, Cham. https://doi.org/10.1007/978-3-319-06811-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-06811-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06810-7
Online ISBN: 978-3-319-06811-4
eBook Packages: Computer ScienceComputer Science (R0)