[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3459637.3482343acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Fast Extraction of Word Embedding from Q-contexts

Published: 30 October 2021 Publication History

Abstract

The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11 ~ 13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.

Supplementary Material

MP4 File (CIKM21-rgfp0831.mp4)
Word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.

References

[1]
Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pacşca, and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In NAACL '09.
[2]
Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. 2016. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguistics, Vol. 4 (2016), 385--399.
[3]
Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. 2020. Quantum-inspired algorithms in practice. Quantum (2020).
[4]
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, Vol. 5 (2017), 135--146.
[5]
Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. Journal of artificial intelligence research, Vol. 49 (2014), 1--47.
[6]
Kenneth Ward Church and Patrick Hanks. 1989. Word Association Norms, Mutual Information, and Lexicography. In ACL '89.
[7]
Bob Coecke, Giovanni de Felice, Konstantinos Meichanetzidis, and Alexis Toumi. 2020. Foundations for Near-Term Quantum Natural Language Processing. arXiv preprint arXiv:2012.03755 (2020).
[8]
Ido Dagan, Fernando Pereira, and Lillian Lee. 1994. Similarity-Based Estimation of Word Cooccurrence Probabilities. In ACL '94.
[9]
Yogesh Dahiya, Dimitris Konomis, and David P. Woodruff. 2018. An Empirical Evaluation of Sketching for Numerical Linear Algebra. In KDD '18.
[10]
Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American society for information science, Vol. 41, 6 (1990), 391--407.
[11]
Leon Derczynski, Kalina Bontcheva, and Ian Roberts. 2016. Broad twitter corpus: A diverse named entity recognition resource. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 1169--1179.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL '19. 4171--4186.
[13]
Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. 2008. Relative-Error CUR Matrix Decompositions. SIAM J. Matrix Anal. Appl., Vol. 30, 2 (2008), 844--881. https://doi.org/10.1137/07070471X
[14]
Alan Frieze, Ravi Kannan, and Santosh Vempala. 2004. Fast Monte-Carlo Algorithms for Finding Low-Rank Approximations. J. ACM (2004).
[15]
Lushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity.
[16]
Elad Hazan, Tomer Koren, and Nati Srebro. 2011. Beating SGD: Learning SVMs in Sublinear Time. In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12--14 December 2011, Granada, Spain, John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger (Eds.). 1233--1241.
[17]
Elham Khabiri, Wesley M Gifford, Bhanukiran Vinzamuri, Dhaval Patel, and Pietro Mazzoleni. 2019. Industry specific word embedding and its application in log classification. In CIKM '19. 2713--2721.
[18]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP '14.
[19]
Omer Levy and Yoav Goldberg. 2014. Neural Word Embedding as Implicit Matrix Factorization. In NeurIPS '14.
[20]
Qian Liu, Heyan Huang, Yang Gao, Xiaochi Wei, Yuxin Tian, and Luyang Liu. 2018. Task-oriented Word Embedding for Text Classification. In Proceedings of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, Santa Fe, New Mexico, USA, 2023--2032.
[21]
Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LS™-CNNs-CRF. In ACL '16.
[22]
Sameen Maruf and Gholamreza Haffari. 2018. Document Context Neural Machine Translation with Memory Networks. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 1275--1284.
[23]
Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jeffrey Dean. 2013a. Efficient Estimation of Word Representations in Vector Space. ICLR Workshop'13 (2013).
[24]
Tomá s Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS '13, Christopher J. C. Burges, Lé on Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.).
[25]
Andriy Mnih and Koray Kavukcuoglu. 2013. Learning word embeddings efficiently with noise-contrastive estimation. In NeurIPS '13, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.).
[26]
Pedro Javier Ortiz Suárez, Laurent Romary, and Benoît Sagot. 2020. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages. In ACL '20.
[27]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP '14. 1532--1543.
[28]
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of NAACL-HLT). Association for Computational Linguistics, New Orleans, Louisiana, 2227--2237.
[29]
Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. 2014. Quantum support vector machine for big data classification. Physical review letters, Vol. 113, 13 (2014), 130503.
[30]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP '13. 1631--1642.
[31]
Zhao Song, David P. Woodruff, and Huan Zhang. 2016. Sublinear Time Orthogonal Tensor Decomposition. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett (Eds.). 793--801.
[32]
Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. In ACL '19. Association for Computational Linguistics, Florence, Italy, 3645--3650.
[33]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. In ACL '20. Association for Computational Linguistics, Online, 2158--2170.
[34]
Ewin Tang. 2019. A quantum-inspired classical algorithm for recommendation systems. STOC '19 (2019).
[35]
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In NAACL '03.
[36]
Joel A. Tropp. 2015. An Introduction to Matrix Concentration Inequalities. (2015).
[37]
Peter D. Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. J. Artif. Int. Res. (2010).
[38]
William Zeng and Bob Coecke. 2016. Quantum algorithms for compositional natural language processing. arXiv preprint arXiv:1608.01406 (2016).
[39]
Yun Zhang, Yongguo Liu, Jiajing Zhu, Ziqiang Zheng, Xiaofeng Liu, Weiguang Wang, Zijie Chen, and Shuangqing Zhai. 2019. Learning Chinese word embeddings from stroke, structure and pinyin of characters. In CIKM '19. 1011--1020.

Cited By

View all
  • (2024)Optimizing Hate Speech Detection in the Indonesian Language using FastText and LSTM Algorithms2024 2nd International Symposium on Information Technology and Digital Innovation (ISITDI)10.1109/ISITDI62380.2024.10796810(299-305)Online publication date: 24-Jul-2024

Index Terms

  1. Fast Extraction of Word Embedding from Q-contexts

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
    October 2021
    4966 pages
    ISBN:9781450384469
    DOI:10.1145/3459637
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 October 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. fast extraction
    2. large-scale
    3. q-contexts
    4. word embedding

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    CIKM '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing Hate Speech Detection in the Indonesian Language using FastText and LSTM Algorithms2024 2nd International Symposium on Information Technology and Digital Innovation (ISITDI)10.1109/ISITDI62380.2024.10796810(299-305)Online publication date: 24-Jul-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media