[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Composing Word Embeddings for Compound Words Using Linguistic Knowledge

Published: 30 March 2023 Publication History

Abstract

In recent years, the use of distributed representations has been a fundamental technology for natural language processing. However, Japanese has multiple compound words, and often we must compare the meanings of a word and a compound word. Moreover, word boundaries in Japanese are unspecific because Japanese does not have delimiters between words, e.g., “ぶどう狩り” (grape picking) is one word according to one dictionary, whereas “ぶどう” and “狩り” are different words according to another dictionary. This study describes an attempt to compose word embeddings of a compound word from its constituent words in Japanese. We used “short unit” and “long unit,” both of which are the units of terms in UniDic—a Japanese dictionary compiled by the National Institute for Japanese Language and Linguistics—for constituent and compound words, respectively. Furthermore, we composed a word embedding of a compound word from the word embeddings of two constituent words using a neural network. The training data for the word embedding of compound words was created using a corpus generated by concatenating the corpora divided by constituent and compound words. We propose using linguistic knowledge for compositing word embedding to demonstrate how it improves the composition performance. We compared cosine similarity between composed and correct word embeddings of compound words to assess models with and without linguistic knowledge. Furthermore, we evaluated our methods by the ranking of synonyms using a thesaurus. We compared several frameworks and algorithms that use three types of linguistic knowledge—semantic patterns, parts of speech patterns, and compositionality score—and then investigated which linguistic knowledge improves the composition performance. The experiments demonstrated that the multitask models with the classification task of the parts of speech patterns and the estimation task of compositionality scores achieved high performances.

References

[1]
Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). 1183–1193. https://aclanthology.org/D10-1115.pdf.
[2]
Kazuma Hashimoto and Yoshimasa Tsuruoka. 2015. Learning embeddings for transitive verb disambiguation by implicit tensor factorization. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 1–11. https://aclanthology.org/W15-4001.pdf.
[3]
Kazuma Hashimoto and Yoshimasa Tsuruoka. 2016. Adaptive joint learning of compositional and non-compositional phrase embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 205–215. http://arxiv.org/abs/1603.06067.
[4]
Teruo Hirabayashi, Kanako Komiya, Masayuki Asahara, and Hiroyuki Shinnou. 2020. Composing word vectors for Japanese compound words using bilingual word embeddings. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 404–410. https://aclanthology.org/2020.paclic-1.46.pdf.
[5]
Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2020. Optimizing word segmentation for downstream task. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1341–1351.
[6]
Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, and Naoaki Okazaki. 2021. Joint optimization of tokenization and downstream model. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 244–255.
[7]
Sorami Hisamoto, Takashi Yamamura, Akihiko Katsuta, Yuto Takebayashi, Kazuma Takaoka, Yoshitake Uchida, Teruaki Oka, and Masayuki Asahara. 2020. chiVe: Towards industrial-strength Japanese word vector resources. In Proceedings of the 16th Text Analytic Symposium.40–45.
[8]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (Nov. 1997), 1735–1780.
[9]
Jimin Hong, TaeHee Kim, Hyesu Lim, and Jaegul Choo. 2021. AVocaDo: Strategy for adapting vocabulary to downstream domain. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4692–4700.
[10]
Natthawut Kertkeidkachorn and Ryutaro Ichise. 2017. Estimating distributed representations of compound words using recurrent neural network. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. 235–246.
[11]
Kanako Komiya, Takumi Seitou, Minoru Sasaki, and Hiroyuki Shinnou. 2019. Composing word vectors for Japanese compound words using dependency relations. In Proceedings of the 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLING’19). 1–7.
[12]
Kanako Komiya, Daiki Yaginuma, Masayuki Asahara, and Hiroyuki Shinnou. 2020. Generation and evaluation of concept embeddings via fine-tuning using automatically tagged corpus. In Proceedings of the 34th Pacific Asia Conference on Language, Information, and Computation (PACLIC’20). 122–128. https://aclanthology.org/2020.paclic-1.15.pdf
[13]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (Proceedings of Machine Learning Research), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. PMLR, Bejing, China, 1188–1196. https://proceedings.mlr.press/v32/le14.html.
[14]
Kikuo Maekawa, Makoto Yamazaki, Takehiko Maruyama, Masaya Yamaguchi, Hideki Ogura, Wakako Kashino, Toshinobu Ogiso, Hanae Koiso, and Yasuharu Den. 2010. Design, compilation, and preliminary analyses of Balanced Corpus of Contemporary Written Japanese. In Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC’10). 1483–1486. http://www.lrec-conf.org/proceedings/lrec2010/pdf/99_Paper.pdf.
[15]
Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48, 2 (2014), 345–371.
[16]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Representations Workshop (ICLR’13). 1–12. https://arxiv.org/pdf/1301.3781.pdf.
[17]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS’13). 3111–3119. https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
[18]
Tomas Mikolov, Wen tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL’13). 746–751. https://aclanthology.org/N13-1090.pdf.
[19]
Masayasu Muraoka, Sonse Shimaoka, Kazeto Yamamoto, Yotaro Watanabe, Naoaki Okazaki, and Kentaro Inui. 2014. Finding the best model among representative compositional models. In Proceedings of the 28th Pacific Asia Conference on Language, Information, and Computation (PACLIC’14). 65–74. https://aclanthology.org/Y14-1010.pdf.
[20]
National Institute for Japanese Language and Linguistics. 1964. Word List by Semantic Principles [in Japanese]. Shuuei Shuppan.
[21]
Yuval Pinter, Robert Guthrie, and Jacob Eisenstein. 2017. Mimicking word embeddings using subword RNNs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP’17).102–112. https://aclanthology.org/D17-1010.pdf.
[22]
Richard Socher, John Bauer, Christopher D. Manning, and Andrew Y. Ng. 2013. Parsing with compositional vector grammars. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 455–465. https://nlp.stanford.edu/pubs/SocherBauerManningNg_ACL2013.pdf.
[23]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2 (2014), 207–218. https://aclanthology.org/Q14-1017.pdf.
[24]
Hirotaka Tanaka and Hiroyuki Shinnou. 2022. Vocabulary enhancement of compound words of BERT for domain adaptations. In Proceedings of the 28th Annual Meeting of the Association for Natural Language Processing (NLP’22).998–1002. https://www.anlp.jp/proceedings/annual_meeting/2022/pdf_dir/PT2-8.pdf.
[25]
Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhattacharyya, and Partha Talukdar. 2019. Incorporating syntactic and semantic information in word embeddings using graph convolutional networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL’19). 3308–3318. https://aclanthology.org/P19-1320.pdf.
[26]
Yirui Wu, Haifeng Guo, Chinmay Chakraborty, Mohammad Khosravi, Stefano Berretti, and Shaohua Wan. 2022. Edge computing driven low-light image dynamic enhancement for object detection. IEEE Transactions on Network Science and Engineering. Early access, February 14, 2022.
[27]
Yirui Wu, Yuntao Ma, and Shaohua Wan. 2021. Multi-scale relation reasoning for multi-modal visual question answering. Signal Processing: Image Communication 96 (2021), 116319.
[28]
Yirui Wu, Wenqin Mao, and Jun Feng. 2021. AI for online customer service: Intent recognition and slot filling based on deep learning technology. Mobile Networks and Applications 2021 (2021), 1572–8153.
[29]
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 6442–6454.
[30]
Yunzhi Yao, Shaohan Huang, Wenhui Wang, Li Dong, and Furu Wei. 2021. Adapt-and-Distill: Developing small, fast and effective pretrained language models for domains. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, 460–470.

Index Terms

  1. Composing Word Embeddings for Compound Words Using Linguistic Knowledge

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
    February 2023
    624 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3572719
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 March 2023
    Online AM: 07 September 2022
    Accepted: 29 August 2022
    Revised: 22 August 2022
    Received: 07 October 2021
    Published in TALLIP Volume 22, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Word embedding
    2. compound word
    3. multitask learning
    4. linguistic knowledge
    5. Japanese
    6. parts of speech
    7. constituent word

    Qualifiers

    • Research-article

    Funding Sources

    • JSPS KAKENHI
    • Younger Researchers Grants from Ibaraki University

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 296
      Total Downloads
    • Downloads (Last 12 months)66
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media