[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3488560.3498430acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

Published: 15 February 2022 Publication History

Abstract

Abbreviations often used in our daily communication play an important role in natural language processing. Most of the existing studies regard the Chinese abbreviation prediction as a sequence labeling problem. However, sequence labeling models usually ignore label dependencies in the process of abbreviation prediction, and the label prediction of each character should be conditioned on its previous labels. In this paper, we propose to formalize the Chinese abbreviation prediction task as a sequence generation problem, and a novel sequence-to-sequence model is designed. To boost the performance of our deep model, we further propose a multi-level pre-trained model that incorporates character, word, and concept-level embeddings. To evaluate our methods, a new dataset for Chinese abbreviation prediction is automatically built, which contains 81,351 pairs of full forms and abbreviations. Finally, we conduct extensive experiments on a public dataset and the built dataset, and the experimental results on both datasets show that our model outperforms the state-of-the-art methods. More importantly, we build a large-scale database for a specific domain, i.e., life services in Meituan Inc., with high accuracy of about 82.7%, which contains 4,134,142 pairs of full forms and abbreviations. The online A/B testing on Meituan APP and Dianping APP suggests that Click-Through Rate increases by 0.59% and 0.86% respectively when the built database is used in the searching system. We have released our API on http://kw.fudan.edu.cn/ddemos/abbr/ with over 87k API calls in 9 months.

Supplementary Material

MP4 File (WSDM-fp330.mp4)
In the video, we outline four parts, namely motivation, methodology, experiments, and conclusion. In the motivation, we introduce motivation with some examples from Chinese media news and the difference between the English abbreviations. In the methodology, considering the inadequacy of the sequence label model. We first attempt to formalize the Chinese abbreviation prediction task as a sequence generation problem. A novel seq2seq model and a multi-level pre-trained model are proposed to address this problem. Combined with adequate experimental results, our model outperforms the baselines on almost all metrics, which shows the effectiveness of our model. In addition, we focus on the application in the field of e-commerce and detail the introduction of A/B testing. The experimental results fully demonstrate the positive contribution of the model in this paper to real e-commerce.

References

[1]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[2]
Laurie Bauer. 1983. English word-formation .Cambridge university press.
[3]
Jing-Shin Chang and Yu-Tso Lai. 2004. A preliminary study on probabilistic models for Chinese abbreviations. In Proceedings of the Third SIGHAN Workshop on Chinese Language Processing .
[4]
Jing-Shin Chang and Wei-Lun Teng. 2006. Mining atomic Chinese abbreviations with a probabilistic single character recovery model. Language Resources and Evaluation, Vol. 40, 3--4 (2006), 367--374.
[5]
Huan Chen, Qi Zhang, Jin Qian, and Xuanjing Huang. 2013. Chinese Named Entity Abbreviation Generation Using First-Order Logic. In IJCNLP .
[6]
Jindong Chen, Yizhou Hu, Jingping Liu, Yanghua Xiao, and Haiyun Jiang. 2019 a. Deep short text classification with knowledge powered attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6252--6259.
[7]
Lihan Chen, Jiaqing Liang, Chenhao Xie, and Yanghua Xiao. 2018. Short Text Entity Linking with Fine-grained Topics. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 457--466.
[8]
Tong Chen, Hongzhi Yin, Hongxu Chen, Hao Wang, Xiaofang Zhou, and Xue Li. 2019 b. Online sales prediction via trend alignment-based multitask recurrent neural networks. Knowledge and Information Systems (2019), 1--29.
[9]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[10]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[11]
Weijiang Feng, Long Lan, Xiang Zhang, and Zhigang Luo. 2020. Learning sequence-to-sequence affinity metric for near-online multi-object tracking. Knowledge and Information Systems, Vol. 62, 10 (2020), 3911--3930.
[12]
Binzong Geng, Fajie Yuan, Qiancheng Xu, Ying Shen, Ruifeng Xu, and Min Yang. 2021. Continual Learning for Task-oriented Dialogue System with Iterative Network Pruning, Expanding and Masking. arXiv preprint arXiv:2107.08173 (2021).
[13]
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6645--6649.
[14]
Zhong Ji, Kailin Xiong, Yanwei Pang, and Xuelong Li. 2019. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology (2019).
[15]
Nagendra Kumar, Eshwanth Baskaran, Anand Konjengbam, and Manish Singh. 2021. Hashtag recommendation for short social media texts using word-embeddings and external knowledge. Knowledge and Information Systems, Vol. 63, 1 (2021), 175--198.
[16]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[17]
Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie, and Houfeng Wang. 2019. Exploring Sequence-to-Sequence Learning in Aspect Term Extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics . 3538--3547.
[18]
Dimos Makris, Kat R Agres, and Dorien Herremans. 2021. Generating lead sheets with affect: A novel conditional seq2seq framework. arXiv preprint arXiv:2104.13056 (2021).
[19]
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR, Vol. abs/1301.3781 (2013).
[20]
Graham Neubig. 2017. Neural machine translation and sequence-to-sequence models: A tutorial. arXiv preprint arXiv:1703.01619 (2017).
[21]
Peng Si Ow and Thomas E Morton. 1988. Filtered beam search in scheduling. The International Journal Of Production Research, Vol. 26, 1 (1988), 35--62.
[22]
Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly. 2017. A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Interspeech . 939--943.
[23]
Ye Qi, Devendra Singh Sachan, Matthieu Felix, Sarguna Janani Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? arXiv preprint arXiv:1804.06323 (2018).
[24]
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018. Bi-directional block self-attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857 (2018).
[25]
Avirup Sil, Gourab Kundu, Radu Florian, and Wael Hamza. 2018. Neural cross-lingual entity linking. In Thirty-Second AAAI Conference on Artificial Intelligence .
[26]
Xu Sun, Wenjie Li, Fanqi Meng, and Houfeng Wang. 2013. Generalized Abbreviation Prediction with Negative Full Forms and Its Application on Improving Chinese Web Search. In IJCNLP .
[27]
Xu Sun, Naoaki Okazaki, and Jun'ichi Tsujii. 2009. Robust approach to abbreviating terms: A discriminative latent variable model with global information. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2. Association for Computational Linguistics, 905--913.
[28]
Xu Sun, Houfeng Wang, and Yu Zhang. 2006. Chinese Abbreviation-Definition Identification: A SVM Approach Using Context Information. In PRICAI .
[29]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104--3112.
[30]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[31]
Bing Xia, Ruizhi Chu, HongVE Yu, Zuoshuang Zhang, and Yin Gu. 2021. Abbreviation of Chinese Botanical Garden Names. In Phytohortology . EDP Sciences, xvii--xx.
[32]
Bo Xu, Yong Xu, Jiaqing Liang, Chenhao Xie, Bin Liang, Wanyun Cui, and Yanghua Xiao. 2017. CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System. In IEA/AIE .
[33]
Liwen Xu, Yan Zhang, Lei Hong, Yi Cai, and Szui Sung. 2021. ChicHealth@ MEDIQA 2021: Exploring the limits of pre-trained seq2seq models for medical summarization. In Proceedings of the 20th Workshop on Biomedical Language Processing. 263--267.
[34]
Dong Yang, Yi-Cheng Pan, and Sadaoki Furui. 2009. Automatic Chinese Abbreviation Generation Using Conditional Random Field. In HLT-NAACL .
[35]
Dong Yang, Yi-Cheng Pan, and Sadaoki Furui. 2012. Vocabulary expansion through automatic abbreviation generation for Chinese voice search. Computer Speech & Language, Vol. 26, 5 (2012), 321--335.
[36]
Jiexia Ye, Furong Zheng, Juanjuan Zhao, Kejiang Ye, and Chengzhong Xu. 2021. Incorporating Reachability Knowledge into a Multi-Spatial Graph Convolution Based Seq2Seq Model for Traffic Forecasting. arXiv preprint arXiv:2107.01528 (2021).
[37]
Qi Zhang, Jin Qian, Ya Guo, Yaqian Zhou, and Xuanjing Huang. 2016. Generating Abbreviations for Chinese Named Entities Using Recurrent Neural Network with Dynamic Dictionary. In EMNLP .
[38]
Yi Zhang and Xu Sun. 2018. A Chinese Dataset with Negative Full Forms for General Abbreviation Prediction. CoRR, Vol. abs/1712.06289 (2018).

Cited By

View all

Index Terms

  1. A Sequence-to-Sequence Model for Large-scale Chinese Abbreviation Database Construction

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining
    February 2022
    1690 pages
    ISBN:9781450391320
    DOI:10.1145/3488560
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 February 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. chinese abbreviation
    2. sequence-to-sequence model

    Qualifiers

    • Research-article

    Conference

    WSDM '22

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 225
      Total Downloads
    • Downloads (Last 12 months)32
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media