research-article

Fast Extraction of Word Embedding from Q-contexts

Authors:

Yi Cai,

Shengyu ZhangAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 873 - 882

https://doi.org/10.1145/3459637.3482343

Published: 30 October 2021 Publication History

Get Access

Abstract

The notion of word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Mutual information between contexts and words can be encoded canonically as a sampling state, thus, Q-contexts can be fast constructed. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. In practical scenarios, our algorithm runs 11 ~ 13 times faster than well-established methods. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.

Supplementary Material

MP4 File (CIKM21-rgfp0831.mp4)

Word embedding plays a fundamental role in natural language processing (NLP). However, pre-training word embedding for very large-scale vocabulary is computationally challenging for most existing methods. In this work, we show that with merely a small fraction of contexts (Q-contexts) which are typical in the whole corpus (and their mutual information with words), one can construct high-quality word embedding with negligible errors. Furthermore, we present an efficient and effective WEQ method, which is capable of extracting word embedding directly from these typical contexts. By comparing with well-known methods such as matrix factorization, word2vec, GloVe and fasttext, we demonstrate that our method achieves comparable performance on a variety of downstream NLP tasks, and in the meanwhile maintains run-time and resource advantages over all these baselines.

Download
34.01 MB

References

[1]

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pacşca, and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In NAACL '09.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

Improving Vietnamese WordNet using word embedding

Incorporating Prior Knowledge into Word Embedding for Chinese Word Similarity Measurement

Adaptive Probabilistic Word Embedding

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations