Task’s Choice: Pruning-Based Feature Sharing (PBFS) for Multi-Task Learning
<p>Most of the existing multi-task learning architecture. (<b>a</b>) Hard parameter sharing. (<b>b</b>) Soft parameter sharing. (<b>c</b>)Hierarchical sharing. (<b>d</b>) Sparse sharing. (<b>e</b>) The proposed model PBFS.</p> "> Figure 2
<p>Structure of gated network.</p> "> Figure 3
<p>Model architecture of PBFS.</p> "> Figure 4
<p>Results with different task’s correlation score on synthetic dataset.</p> "> Figure 5
<p>Results of different sparsity on census-income dataset. (<b>a</b>) Task 1-income’s performance with subnet sparsity from 0 to 1. (<b>b</b>) Task 2-marital’s performance with subnet sparsity from 0 to 1.</p> "> Figure 6
<p>Results of different sparsity on MovieLens dataset.</p> "> Figure 7
<p>Results of different sparsity on student dataset.</p> "> Figure 8
<p>Results of different sparsity on synthetic dataset.</p> ">
Abstract
:1. Introduction
- A soft sharing model jointed with parameter pruning is designed, in which each task has an independent subnet, and the pruning is used to learn the information of shared experts on the parameter level;
- A task-specific pruning strategy is used to find the optimal subnet for each task so that it can automatically learn to select the hidden representations and cut off harmful noise;
- Establish a shared pool based on the difference of the tasks’ relevance, and define the limits of whether diverse tasks can prune a base shared expert so that more than two tasks have an adaptive sharing scheme;
- Experiments in three benchmark public datasets demonstrated that the proposed model achieves considerable improvement compared with single-task learning and several soft parameter sharing models. Meanwhile, we found that our pruning strategy can avoid the impact of the seesaw phenomenon, which appeared in many MTL models.
2. Related Work
2.1. Multi-Task Learning for Deep Learning
2.2. Sparse Networks
3. Pruning-Based Feature Sharing Architecture
3.1. Base Model
3.2. Model Architecture
3.3. Pruning-Based Feature Sharing Strategy
Algorithm 1: Generate sharing pool for tasks |
: M tasks, threshold parameter , correlation matrix : Set of sharing pool Q : Generate task pair set T= and , S=0 for in T: if Q==NULL : S=1; Q[1]<—<i, j>; else: for to : if and ): Q[s]<—<i, j>; update ; break; else if (): ; Q[s]<—<i,j>; =; break; end for end for : Q |
Algorithm 2: Generate sparse shared-expert for each task |
: Task M, target sparsity ,(, s is the shared pool s), initial sparsity , sharing pool set Q : Sparse shared-expert network 1:for s in Q: generate base shared-expert network ,( is input data of all tasks in ). 2:for to M: for s in : = prune % parameters of with low magnitude, save in . end for end for And then train model with . 3:Prune parameters with polynomial decay and training until shared-expert’s sparsity arrived to , record the performance of training process with different sparsity. 4:Replace with the sparse shared-expert reached best performance of task m. 5: Repeat steps 3–4 until training is over. : |
4. Experiments and Results
4.1. Datasets
4.2. Comparison Model
4.3. Experiments Setup
4.4. Experiments Results
4.4.1. Results in Public Datasets
4.4.2. Result in Synthetic Dataset
5. Analysis and Discussions
5.1. Hyperparameters Tuning
5.2. Subnet’s Sparsity
5.3. Adjustable Parameter
6. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
MTL | Multi-task Learning |
NLP | Natural Language Processing |
DNN | Deep Neural Networks |
RNN | Recurrent Neural Networks |
LSTM | Long Short-Term Memory |
NAS | Neural Architecture Search |
AutoML | Automated Machine Learning |
UCI | University of California |
MLP | Multi Layer Perceptron |
MSE | Mean Squared Error |
AUC | Area Under the Curve |
References
- Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.; Gai, K. Entire space multi-task model: An effective approach for estimating post-click conversion rate. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1137–1140. [Google Scholar]
- Hadash, G.; Shalom, O.S.; Osadchy, R. Rank and rate: Multi-task learning for recommender systems. In Proceedings of the 12th ACM Conference on Recommender Systems, Vancouver, BC, Canada, 2 October 2018; pp. 451–454. [Google Scholar]
- Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
- Negrinho, R.; Gordon, G. Deeparchitect: Automatically designing and training deep architectures. arXiv 2017, arXiv:1704.08792. [Google Scholar]
- Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 160–167. [Google Scholar]
- Subramanian, S.; Trischler, A.; Bengio, Y.; Pal, C.J. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv 2018, arXiv:1804.00079. [Google Scholar]
- Liu, X.; He, P.; Chen, W.; Gao, J. Multi-task deep neural networks for natural language understanding. arXiv 2019, arXiv:1901.11504. [Google Scholar]
- Ma, J.; Zhao, Z.; Yi, X.; Chen, J.; Hong, L.; Chi, E.H. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1930–1939. [Google Scholar]
- Søgaard, A.; Goldberg, Y. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 231–235. [Google Scholar]
- Hashimoto, K.; Xiong, C.; Tsuruoka, Y.; Socher, R. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv 2016, arXiv:1611.01587. [Google Scholar]
- Sun, T.; Shao, Y.; Li, X.; Liu, P.; Yan, H.; Qiu, X.; Huang, X. Learning sparse sharing architectures for multiple tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8936–8943. [Google Scholar]
- Tang, H.; Liu, J.; Zhao, M.; Gong, X. Progressive layered extraction (ple): A novel multi-task learning (mtl) model for personalized recommendations. In Proceedings of the Fourteenth ACM Conference on Recommender Systems, Virtual Event, Brazil, 22–26 September 2020; pp. 269–278. [Google Scholar]
- Xiao, X.; Chen, H.; Liu, Y.; Yao, X.; Liu, P.; Fan, C.; Ji, N.; Jiang, X. LT4REC: A Lottery Ticket Hypothesis Based Multi-task Practice for Video Recommendation System. arXiv 2020, arXiv:2008.09872. [Google Scholar]
- Qin, Z.; Cheng, Y.; Zhao, Z.; Chen, Z.; Metzler, D.; Qin, J. Multitask mixture of sequential experts for user activity streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, San Diego, CA, USA, 23–27 August 2020; pp. 3083–3091. [Google Scholar]
- Clarkson, G.; Jacobsen, T.E.; Batcheller, A.L. Information asymmetry and information sharing. Gov. Inf. Q. 2007, 24, 827–839. [Google Scholar] [CrossRef]
- Ruder, S.; Bingel, J.; Augenstein, I.; Søgaard, A. Sluice networks: Learning what to share between loosely related tasks. arXiv 2017, arXiv:1705.08142. [Google Scholar]
- Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive mixtures of local experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
- Zhao, J.; Du, B.; Sun, L.; Zhuang, F.; Lv, W.; Xiong, H. Multiple relational attention network for multi-task learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1123–1131. [Google Scholar]
- Liu, C.; Zoph, B.; Neumann, M.; Shlens, J.; Hua, W.; Li, L.J.; Fei-Fei, L.; Yuille, A.; Huang, J.; Murphy, K. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 19–34. [Google Scholar]
- Vanhoucke, V.; Senior, A.; Mao, M.Z. Improving the Speed of Neural Networks on CPUs. In Proceedings of the Deep Learning and Unsupervised Feature NIPS Workshop. 2011. Available online: http://research.google.com/pubs/archive/37631.pdf (accessed on 17 March 2022).
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2704–2713. [Google Scholar]
- LeCun, Y.; Denker, J.; Solla, S. Optimal brain damage. Adv. Neural Inf. Process. Syst. 1989, 2, 598–605. [Google Scholar]
- Hassibi, B.; Stork, D.G. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon; Morgan Kaufmann: Burlington, MA, USA, 1993. [Google Scholar]
- Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning filters for efficient convnets. arXiv 2016, arXiv:1608.08710. [Google Scholar]
- Liu, H.; Simonyan, K.; Vinyals, O.; Fernando, C.; Kavukcuoglu, K. Hierarchical representations for efficient architecture search. arXiv 2017, arXiv:1711.00436. [Google Scholar]
- Elsken, T.; Metzen, J.H.; Hutter, F. Efficient multi-objective neural architecture search via lamarckian evolution. arXiv 2018, arXiv:1804.09081. [Google Scholar]
- Ma, J.; Zhao, Z.; Chen, J.; Li, A.; Hong, L.; Chi, E.H. Snr: Sub-network routing for flexible parameter sharing in multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 216–223. [Google Scholar]
- Frankle, J.; Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
- Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
- Asuncion, A.; Newman, D. UCI Machine Learning Repository. 2007. Available online: http://archive.ics.uci.edu/ml (accessed on 17 March 2022).
- Cortez, P.; Silva, A.M.G. Using Data Mining to Predict Secondary School Student Performance. 2008. Available online: http://archive.ics.uci.edu/ml/datasets/Student+Performance (accessed on 17 March 2022).
Math | Portuguese | |||||
---|---|---|---|---|---|---|
G1 | G2 | G3 | G1 | G2 | G3 | |
G1 | 1.0000 | 0.8521 | 0.8015 | 1.0000 | 0.8650 | 0.8263 |
G2 | 0.8521 | 1.0000 | 0.9049 | 0.8650 | 1.0000 | 0.9185 |
G3 | 0.8015 | 0.9049 | 1.0000 | 0.8263 | 0.9185 | 1.0000 |
Model | Task 1-Income | Task 2-Marital | ||
---|---|---|---|---|
F1-Score | ACC | F1-Score | ACC | |
Single-task | 0.6931 | 0.9520 | 0.9270 | 0.9283 |
Shared-bottom | 0.6436 | 0.8451 | 0.9313 | 0.9327 |
Cross-stitch | 0.7423 | 0.9505 | 0.9334 | 0.9345 |
MMOE | 0.6790 | 0.9482 | 0.9325 | 0.9336 |
PLE | 0.7139 | 0.9509 | 0.9272 | 0.9290 |
PBFS(ours) | 0.7466 | 0.9511 | 0.9537 | 0.9548 |
Task 1-Income | Task 2-Marital | |||
---|---|---|---|---|
F1-score | ACC | F1-score | ACC | |
sparsity | 0.2891 | 0.6691 | 0.7002 | 0.7002 |
Model | AUC-Rating | MSE-Age |
---|---|---|
Single-task | 0.6154 | 0.0115 |
Shared-bottom | 0.5387 | 0.0133 |
Cross-stitch | 0.5977 | 0.0117 |
MMOE | 0.5836 | 0.0110 |
PLE | 0.6046 | 0.0110 |
PBFS(ours) | 0.6173 | 0.0078 |
Task 1-Rating | Task 2-Age | |
---|---|---|
sparsity | 0.1016 | 0.6992 |
Subject | Portuguese | Math | ||||
---|---|---|---|---|---|---|
Model | MSE-G1 | MSE-G2 | MSE-G3 | MSE-G1 | MSE-G2 | MSE-G3 |
Shared-bottom | 0.0005 | 0.0008 | 0.0009 | 0.0010 | 0.0014 | 0.0021 |
MMOE | 0.0008 | 0.0006 | 0.0007 | 0.0015 | 0.0020 | 0.0053 |
PBFS | 0.0005 | 0.0007 | 0.0006 | 0.0008 | 0.0017 | 0.0019 |
Units | 8 | 16 | 32 | 64 | 128 |
---|---|---|---|---|---|
Income-F1 | 0.7179 | 0.7331 | 0.7441 | 0.7448 | 0.7138 |
Marital-F1 | 0.9457 | 0.9456 | 0.9537 | 0.9550 | 0.9536 |
Sparsity | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | Number of Parameters | |
---|---|---|---|---|---|---|---|---|---|---|---|
32 units | F1-income | 0.7425 | 0.7405 | 0.7466 | 0.7448 | 0.7448 | 0.7228 | 0.7441 | 0.7433 | 0.7128 | 18,682 |
F1-marital | 0.9536 | 0.9535 | 0.9534 | 0.9534 | 0.9534 | 0.9525 | 0.9537 | 0.9535 | 0.9533 | ||
64 units | F1-income | 0.7415 | 0.7413 | 0.7400 | 0.7448 | 0.7439 | 0.7391 | 0.7448 | 0.7475 | 0.7452 | 61,402 |
F1-marital | 0.9541 | 0.9528 | 0.9545 | 0.9539 | 0.9537 | 0.9548 | 0.9550 | 0.9538 | 0.9548 |
Number of Sharing Pool | Portuguese | Math | ||||
---|---|---|---|---|---|---|
MSE-G1 | MSE-G2 | MSE-G3 | MSE-G1 | MSE-G2 | MSE-G3 | |
1 | 0.0005 | 0.0007 | 0.0006 | 0.0008 | 0.0017 | 0.0019 |
3 | 0.0005 | 0.0006 | 0.0009 | 0.0007 | 0.0019 | 0.0045 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Yu, J.; Zhao, Y.; Chen, J.; Du, X. Task’s Choice: Pruning-Based Feature Sharing (PBFS) for Multi-Task Learning. Entropy 2022, 24, 432. https://doi.org/10.3390/e24030432
Chen Y, Yu J, Zhao Y, Chen J, Du X. Task’s Choice: Pruning-Based Feature Sharing (PBFS) for Multi-Task Learning. Entropy. 2022; 24(3):432. https://doi.org/10.3390/e24030432
Chicago/Turabian StyleChen, Ying, Jiong Yu, Yutong Zhao, Jiaying Chen, and Xusheng Du. 2022. "Task’s Choice: Pruning-Based Feature Sharing (PBFS) for Multi-Task Learning" Entropy 24, no. 3: 432. https://doi.org/10.3390/e24030432
APA StyleChen, Y., Yu, J., Zhao, Y., Chen, J., & Du, X. (2022). Task’s Choice: Pruning-Based Feature Sharing (PBFS) for Multi-Task Learning. Entropy, 24(3), 432. https://doi.org/10.3390/e24030432