Abstract
Semantic hashing is an effective technique for large-scale information retrieval. Currently, some methods have suggested learning high-quality binary hash codes of documents by leveraging both document contents and neighborhood information. However, it is found that erroneous connections often exist in the provided neighborhood information, but were never taken into account in these models. To alleviate their negative impacts on hash code learning, we first build a basic generative model to simultaneously model the document content and neighborhood. Then, we show that the basic generative model can be placed under a more general framework, dubbed mutual-information (MI) preserving variational auto-encoder (VAE). Capitalizing on this connection, a new hashing method that can tolerate the noisy characteristic of the neighborhood information is further developed by proposing a novel fault-tolerant lower bound for MI. Extensive experiments are conducted on six real-world datasets, and significant performance gains are observed over current state-of-the-art models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)
Chaidaroon, S., Fang, Y.: Variational deep semantic hashing for text documents. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 75–84 (2017)
Dong, W., Su, Q., Shen, D., Chen, C.: Document hashing with mixture-prior generative models. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. pp. 5226–5235 (Nov 2019)
Hansen, C., Hansen, C., Simonsen, J.G., Alstrup, S., Lioma, C.: Unsupervised neural generative semantic hashing. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019)
Hansen, C., Hansen, C., Simonsen, J.G., Alstrup, S., Lioma, C.: Unsupervised semantic hashing with pairwise reconstruction. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)
Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings (2014)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRR abs/1807.03748 (2018)
Ou, Z., Su, Q., Yu, J., Liu, B., Wang, J., Zhao, R., Chen, C., Zheng, Y.: Integrating semantics and neighborhood information with graph-driven generative models for document retrieval. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 2238–2249 (Aug 2021)
Sen, P., Namata, G.M., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3), 93–106 (2008)
Shen, D., Su, Q., Chapfuwa, P., Wang, W., Wang, G., Henao, R., Carin, L.: NASH: Toward end-to-end neural architecture for generative semantic hashing. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. pp. 2041–2050 (Jul 2018)
Stratos, K., Wiseman, S.: Learning discrete structured representations by adversarially maximizing mutual information. In: Proceedings of the 37th International Conference on Machine Learning. pp. 9144–9154 (2020)
Wang, S., Cao, L., Wang, Y., Sheng, Q.Z., Orgun, M.A., Lian, D.: A survey on session-based recommender systems. ACM Comput. Surv. 54(7) (jul 2021)
Zheng, L., Su, Q., Shen, D., Chen, C.: Generative semantic hashing enhanced via Boltzmann machines. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 777–788 (Jul 2020)
Acknowledgement
This work is supported by the National Natural Science Foundation of China (No. 62276280, U1811264), Guangzhou Science and Technology Planning Project (No. 2024A04J9967), the Fundamental Research Funds of the Central Universities, Sun Yat-Sen University (No. 23ptpy78). Qinliang Su is the corresponding author.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, J., Su, Q., Li, Z., Wan, H., Lian, D. (2025). Document Hashing by Exploiting Noisy Neighborhood Information with Fault-Tolerant Mutual-Information-Preserving VAE. In: Onizuka, M., et al. Database Systems for Advanced Applications. DASFAA 2024. Lecture Notes in Computer Science, vol 14851. Springer, Singapore. https://doi.org/10.1007/978-981-97-5779-4_32
Download citation
DOI: https://doi.org/10.1007/978-981-97-5779-4_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5778-7
Online ISBN: 978-981-97-5779-4
eBook Packages: Computer ScienceComputer Science (R0)