PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data
<p>A Bayesian network example - Cancer Network.</p> "> Figure 2
<p>An Overview of PEnBayes approach.</p> "> Figure 3
<p>Full PEnBayes Workflow in Kepler.</p> "> Figure 4
<p>Data Preprocessing Sub-workflow.</p> "> Figure 5
<p>Local Learner Sub-workflow.</p> "> Figure 6
<p>Global Ensemble Sub-workflow.</p> "> Figure 7
<p>Structural Hamming distance (SHD) of different data set sizes using calculated <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>L</mi> <mi>S</mi> </mrow> </semantics></math> (<a href="#sensors-19-04400-t003" class="html-table">Table 3</a>) as reference value (red circle), Child Dataset.</p> "> Figure 8
<p>Structural Hamming distance (SHD) of different data set sizes using calculated <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>L</mi> <mi>S</mi> </mrow> </semantics></math> (<a href="#sensors-19-04400-t003" class="html-table">Table 3</a>) as reference value (red circle), Insurance Dataset.</p> "> Figure 9
<p>Structural Hamming distance (SHD) of different data set sizes using calculated <math display="inline"><semantics> <mrow> <mi>A</mi> <mi>L</mi> <mi>S</mi> </mrow> </semantics></math> (<a href="#sensors-19-04400-t003" class="html-table">Table 3</a>) as reference value (red circle), Alarm Dataset.</p> "> Figure 10
<p>Alarm Set Execution Time.</p> "> Figure 11
<p>Child Set Execution Time.</p> "> Figure 12
<p>Insurance Set Execution Time.</p> "> Figure 13
<p>Alarm Dataset Accuracy Results. Negative values indicate that the algorithm was unsuccessful in learning a network for the dataset.</p> "> Figure 14
<p>Child Dataset Accuracy Results.</p> "> Figure 15
<p>Insurance Dataset Accuracy Results.</p> "> Figure 16
<p>Alarm Set Standard Deviation Results.</p> "> Figure 17
<p>Alarm Set Execution Time vs. Number of Local Learners.</p> "> Figure 18
<p>Child Set Execution Time vs. Number of Local Learners.</p> "> Figure 19
<p>Insurance Set Execution Time vs. Number of Local Learners.</p> ">
Abstract
:1. Introduction
- A greedy data size calculation algorithm is proposed for adaptively partitioning a big dataset into data slices of appropriate size for distributed BN learning.
- A distributed three-layered ensemble approach called PenBayes is proposed to achieve stable and accurate Bayesian network learning from big datasets at both data and algorithm levels.
2. Background
2.1. Distributed Data-Parallel Patterns and Supporting Systems for Scalable Big Data Application
2.2. Scientific Workflow System
2.3. Bayesian Network
2.4. Bayesian Network Learning Algorithms
3. Related Work
4. Problem Formulation
Algorithm 1 CalculateALS. | |
Input: | |
D: Dataset; | |
ϵ1, ϵ2: Thresholds; | |
mstep: Maximum loop steps. | |
Output: | |
AMBS: Average Markov blanket size; | |
ALS: Appropriate Learning Size. | |
1: | bestAMBS = 1; bestES = −1; step = 0; |
2: | sliceSize = InitialSize * number of attributes in D;// Initial data slice size |
3: | Dsliced = read sliceSize rows from D; |
4: | BNDS = LearnBNStructure(Dsliced); |
5: | currentAMBS = average Markov Blanket size of BNDS; |
6: | currentES = Edge Strength of BNDS; |
7: | while (step ≤ mstep) AND ((|currentAMBS − bestAMBS| > bestAMBS * ϵ1) OR (|currentES − bestES| > bestES * ϵ2)) do |
8: | sliceSize = sliceSize * 2; |
9: | bestAMBS = currentAMBS |
10: | bestES = currentES; |
11: | Dsliced = readData(D, nrows = sliceSize); |
12: | BDDS = learnBNStructure(Dsliced); |
13: | currentAMBS = AMBS of BDDS; |
14: | currentES = Edge Strength of BDDS; |
15: | step = step + 1; |
16: | end while |
17: | ALS = number of records in Dsliced; |
18: | returnALS. |
5. The Proposed Approach
5.1. Overview of PEnBayes
5.1.1. Adaptive Two-Stage Data Slicing
5.1.2. Local Learner
5.1.3. Global Ensemble
5.2. Structure Ensemble Method
Algorithm 2 StructureEnsemble. | |
Input: | |
BN: BN Structures; | |
D: Data set; | |
T: Threshold factor. | |
Output: | |
BNE: Ensembled BN Structure. | |
1: | Obtain from ; |
2: | = ; |
3: | = |
4: | = ; |
5: | ; |
6: | , |
7: | if and i->j does not form a circle in then |
8: | = 1; |
9: | end if |
10: | return. |
5.3. Data Slice Learner
Algorithm 3 DataSliceLearner. | |
Input: | |
DS: Data slice. | |
Output: | |
BNDS: Merged network structure in matrix. | |
1: | = (); |
2: | = (); |
3: | = (); |
4: | |
5: | |
6: | ; |
7: | return |
5.4. Local Learner
Algorithm 4 LocalLearner. | |
Input: | |
DS: Data slices; | |
Nd: number of data slices. | |
Output: | |
BNLocal: Local network structure. | |
1: | For each |
2: | ; |
3: | = the data slice with the best Edge Strength; |
4: | End For |
5: | |
6: | return. |
5.5. Global Ensemble
Algorithm 5 GlobalEnsemble. | |
Input: | |
LS: Local Structures; | |
DSBG: Data slice with the best global Edge Strength; | |
K: Number of Local Learners. | |
Output: | |
BNfinal: Local network structure. | |
1: | |
2: | return. |
5.6. The Time Complexity of PenBayes
6. PEnBayes Workflow in Kepler
6.1. Overall Workflow
6.2. ALS Calculation Sub-Workflow
6.3. Local Learner Sub-Workflow
6.4. Global Ensemble Sub-Workflow
7. Experimental Setup
7.1. Hardware Specification
7.2. Datasets
7.3. PEnBayes Experimental Setup
7.4. Baseline Experimental Setup
8. Experimental Results
8.1. ALS Calculation Results
8.2. PEnBayes and Baseline Experimental Result Comparison
8.3. PEnBayes Scalability Experiments
9. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
BN | Bayesian Networks |
PEnBayes | Parallel Ensemble based Bayesian network learning |
MB | Markov Blanket |
ALS | Appropriate Learning Size |
DAG | directed acyclic graph |
DDP | Distributed Data-Parallel |
UDF | User Defined Functions |
GUI | Graphical User Interface |
BDeu | Bayesian Dirichlet equivalence with uniform prior |
UDF | User Defined Functions |
HC | Hill Climbing |
TPDA | Three Phase Dependency Analysis |
MMHC | Max-Min Hill-Climbing |
AMBS | Average Markov Blanket Size |
ES | Edge Strength |
FWAM | Final Weighted Adjacent Matrix |
AMBS | Average Markov Blanket Size |
ALARM | A Logical Alarm Reduction Mechanism |
SHD | Structural Hamming distance |
References
- Yoo, C.; Ramirez, L.; Liuzzi, J. Big Data Analysis Using Modern Statistical and Machine Learning Methods in Medicine. Int. Neurourol J. 2014, 18, 50–57. [Google Scholar] [CrossRef] [PubMed]
- Hasna, N.; Jamoussi, S. Weighted ensemble learning of Bayesian network for gene regulatory networks. Neurocomputing 2015, 150, 404–416. [Google Scholar]
- Yue, K.; Wu, H.; Fu, X.; Xu, J.; Yin, Z.; Liu, W. A data-intensive approach for discovering user similarities in social behavioral interactions based on the bayesian network. Neurocomputing 2017, 219, 364–375. [Google Scholar] [CrossRef]
- Yang, J.; Tong, Y.; Liu, X.; Tan, S. Causal inference from financial factors: Continuous variable based local structure learning algorithm. In Proceedings of the 2014 IEEE Conference on Computational Intelligence for Financial Engineering & Economics (CIFEr), London, UK, 27–28 March 2014; pp. 278–285. [Google Scholar]
- He, L.; Liu, B.; Hu, D.; Ying, W.; Meng, W.; Long, J. Motor imagery EEG signals analysis based on Bayesian network with Gaussian distribution. Neurocomputing 2016, 188, 217–224. [Google Scholar] [CrossRef]
- Chickering, D.; Heckerman, D.; Meek, C. Large-Sample Learning of Bayesian Networks is NP-Hard. J. Mach. Learn. Res. 2004, 5, 1287–1330. [Google Scholar]
- Fang, Q.; Yue, K.; Fu, X.; Wu, H.; Liu, W. A MapReduce-Based Method for Learning Bayesian Network from Massive Data. In Proceedings of the 15th Asia-Pacific Web Conference (APWeb 2013), Sydney, Australia, 4–6 April 2013; pp. 697–708. [Google Scholar]
- Wang, J.; Tang, Y.; Nguyen, M.; Altintas, I. A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning. In Proceedings of the 2014 IEEE/ACM International Symposium on Big Data Computing (BDC 2014), London, UK, 8–11 December 2014; pp. 16–25. [Google Scholar]
- Apache Spark Project. Available online: http://spark.apache.org (accessed on 2 October 2019).
- The Kepler Project. Available online: https://kepler-project.org (accessed on 2 October 2019).
- Khan, Z.A.; Anjum, A.; Soomro, K.; Tahir, M.A. Towards cloud based big data analytics for smart future cities. J. Cloud Comput. 2015, 4, 2. [Google Scholar] [CrossRef]
- Talia, D. A view of programming scalable data analysis: From clouds to exascale. J. Cloud Comput. 2019, 8, 4. [Google Scholar] [CrossRef]
- Wang, J.; Crawl, D.; Altintas, I.; Li, W. Big data applications using workflows for data parallel computing. Comput. Sci. Eng. 2014, 16, 11–21. [Google Scholar] [CrossRef]
- Apache Hadoop Project. Available online: http://hadoop.apache.org (accessed on 2 October 2019).
- The Stratosphere Project. Available online: http://stratosphere.eu/ (accessed on 2 October 2019).
- Apache Flink Project. Available online: http://flink.apache.org (accessed on 2 October 2019).
- Gonzalez, N.M.; de Brito Carvalho, T.C.M.; Miers, C.C. Cloud resource management: towards efficient execution of large-scale scientific applications and workflows on complex infrastructures. J. Cloud Comput. 2017, 6, 13. [Google Scholar] [CrossRef]
- Altintas, I.; Berkley, C.; Jaeger, E.; Jones, M.; Ludascher, B.; Mock, S. Kepler: An extensible system for design and execution of scientific workflows. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, Santorini Island, Greece, 23 June 2004; pp. 423–424. [Google Scholar]
- Ludäscher, B.; Altintas, I.; Berkley, C.; Higgins, D.; Jaeger, E.; Jones, M.; Lee, E.A.; Tao, J.; Zhao, Y. Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exp. 2006, 18, 1039–1065. [Google Scholar] [CrossRef]
- Goderis, A.; Brooks, C.; Altintas, I.; Lee, E.A.; Goble, C. Heterogeneous composition of models of computation. Future Gener. Comput. Syst. 2009, 25, 552–560. [Google Scholar] [CrossRef] [Green Version]
- Ben-Gal, I. Bayesian Networks. Available online: https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470061572.eqr089 (accessed on 2 October 2019).
- Korb, K.B.; Nicholson, A.E. Bayesian Artificial Intelligence; CRC Press: Boca Raton, FL, USA, 2010. [Google Scholar]
- Cheng, J.; Greiner, R.; Kelly, J.; Bell, D.; Liu, W. Learning Bayesian networks from data: An information-theory based approach. Artif. Intell. 2002, 137, 43–90. [Google Scholar] [CrossRef] [Green Version]
- Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988. [Google Scholar]
- Spirtes, P.; Glymour, C.N.; Scheines, R.; Heckerman, D.; Meek, C.; Cooper, G.; Richardson, T. Causation, Prediction, and Search; MIT Press: Cambridge, MA, USA; London, UK, 2001. [Google Scholar]
- Meek, C. Strong completeness and faithfulness in Bayesian networks. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–20 August 1995; pp. 411–418. [Google Scholar]
- Tsamardinos, I.; Brown, L.E.; Aliferis, C.F. The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 2006, 65, 31–78. [Google Scholar] [CrossRef] [Green Version]
- Tsamardinos, I.; Aliferis, C.F.; Statnikov, A.R.; Statnikov, E. Algorithms for Large Scale Markov Blanket Discovery. In Proceedings of the FLAIRS 2003, St. Augustine, FL, USA, 12–14 May 2003; Volume 2, pp. 376–380. [Google Scholar]
- Heckerman, D.; Geiger, D.; Chickering, D. Learning Bayesian networks: The combination of knowledge and statistical data. Mach. Learn. 1995, 20, 197–243. [Google Scholar] [CrossRef] [Green Version]
- Yaramakala, S.; Margaritis, D. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 809–812. [Google Scholar]
- Khanteymoori, A.R.; Olyaee, M.; Abbaszadeh, O.; Valian, M. A novel method for Bayesian networks structure learning based on Breeding Swarm algorithm. Soft Comput. 2018, 22, 3049–3060. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, W.; Xie, Y. Improved heuristic equivalent search algorithm based on Maximal Information Coefficient for Bayesian Network Structure Learning. Neurocomputing 2013, 117, 186–195. [Google Scholar] [CrossRef]
- Yuan, J.; Wang, Z.; Sun, Y.; Wei, Z.; Jiang, J. An effective pattern-based Bayesian classifier for evolving data stream. Neurocomputing 2018, 295, 17–28. [Google Scholar] [CrossRef]
- Nikravesh, A.Y.; Ajila, S.A.; Lung, C. Using genetic algorithms to find optimal solution in a search space for a cloud predictive cost-driven decision maker. J. Cloud Comput. 2018, 7, 20. [Google Scholar] [CrossRef]
- Zhu, X.; Yuan, C. An Exact Algorithm for Solving Most Relevant Explanation in Bayesian Networks. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; pp. 3649–3656. [Google Scholar]
- Ordyniak, S.; Szeider, S. Parameterized Complexity Results for Exact Bayesian Network Structure Learning. J. Artif. Intell. Res. 2013, 46, 263–302. [Google Scholar] [CrossRef] [Green Version]
- Wilczynski, B.; Dojer, N. BNFinder: Exact and efficient method for learning Bayesian networks. Bioinformatics 2009, 25, 286–287. [Google Scholar] [CrossRef] [PubMed]
- Yue, K.; Fang, Q.; Wang, X.; Li, J.; Weiyi, W. A Parallel and Incremental Approach for Data-Intensive Learning of Bayesian Networks. IEEE Trans. Cybern. 2017, 45, 2890–2904. [Google Scholar] [CrossRef] [PubMed]
- Martínez, A.M.; Webb, G.I.; Chen, S.; Zaidi, N.A. Scalable learning of Bayesian network classifiers. J. Mach. Learn. Res. 2016, 17, 1515–1549. [Google Scholar]
- Tang, Y.; Wang, Y.; Cooper, K.M.; Li, L. Towards Big Data Bayesian Network Learning - An Ensemble Learning Based Approach. In Proceedings of the 2014 IEEE International Congress on Big Data, Anchorage, AK, USA, 27 June–2 July 2014; pp. 355–357. [Google Scholar]
- Rokach, L. Ensemble-based classifiers. Artif. Intell. Rev. 2010, 33, 1–39. [Google Scholar] [CrossRef]
- Abusitta, A.; Bellaïche, M.; Dagenais, M. An SVM-based framework for detecting DoS attacks in virtualized clouds under changing environment. J. Cloud Comput. 2018, 7, 9. [Google Scholar] [CrossRef] [Green Version]
- Wu, G.; Li, H.; Hu, X.; Bi, Y.; Zhang, J.; Wu, X. MReC4. 5: C4. 5 ensemble classification with MapReduce. In Proceedings of the 2009 Fourth ChinaGrid Annual Conference, Yantai, China, 21–22 August 2009; pp. 249–255. [Google Scholar]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
- Lin, J.; Kolcz, A. Large-scale machine learning at twitter. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 793–804. [Google Scholar]
- Madsen, A.L.; Jensen, F.; Salmerón, A.; Langseth, H.; Nielsen, T.D. A parallel algorithm for Bayesian network structure learning from large data sets. Knowl.-Based Syst. 2017, 117, 46–55. [Google Scholar] [CrossRef] [Green Version]
- MLlib: Apache Spark’s Scalable Machine Learning Library. Available online: https://spark.apache.org/mllib/ (accessed on 2 October 2019).
- The Mahout Project. Available online: http://mahout.apache.org/ (accessed on 2 October 2019).
- H2O.ai: Brings AI to Enterprise. Available online: http://www.h2o.ai/ (accessed on 2 October 2019).
- FlinkML: Machine Learning for Flink. Available online: https://github.com/FlinkML (accessed on 2 October 2019).
- Beinlich, I.; Suermondt, H.; Chavez, R.; Cooper, G. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. In Proceedings of the 2nd European Conference on Artificial Intelligence in Medicine, London, UK, 29–31 August 1989; pp. 247–256. [Google Scholar]
- Cowell, R.G.; Dawid, P.; Lauritzen, S.L.; Spiegelhalter, D.J. Probabilistic Networks and Expert Systems: Exact Computational Methods for Bayesian Networks; Springer Science & Business Media: New York, NY, USA, 2007. [Google Scholar]
- Binder, J.; Koller, D.; Russell, S.; Kanazawa, K. Adaptive probabilistic networks with hidden variables. Mach. Learn. 1997, 29, 213–244. [Google Scholar] [CrossRef]
- SamIam Tool for Modeling and Reasoning with Bayesian Networks. Available online: http://reasoning.cs.ucla.edu/samiam/ (accessed on 2 October 2019).
- Scutari, M. Learning Bayesian Networks with the bnlearn R Package. J. Stat. Softw. 2010, 35, 1–22. [Google Scholar] [CrossRef]
- Zaharia, M.; Konwinski, A.; Joseph, A.D.; Katz, R.H.; Stoica, I. Improving MapReduce Performance in Heterogeneous Environments. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, San Diego, CA, USA, 8–10 December 2008; Volume 8, pp. 7–18. [Google Scholar]
- Saranti, A.; Taraghi, B.; Ebner, M.; Holzinger, A. Insights into Learning Competence Through Probabilistic Graphical Models. In Proceedings of the International Cross-Domain Conference, CD-MAKE 2019, Canterbury, UK, 26–29 August 2019; pp. 250–271. [Google Scholar]
- Goebel, R.; Chander, A.; Holzinger, K.; Lecue, F.; Akata, Z.; Stumpf, S.; Kieseberg, P.; Holzinger, A. Explainable AI: The New 42? In Proceedings of the International Cross-Domain Conference, CD-MAKE 2018, Hamburg, Germany, 27–30 August 2018; pp. 295–303. [Google Scholar]
Name | Nodes | Edges | AMBS | Edge Strength |
---|---|---|---|---|
Alarm | 37 | 46 | 3.51 | 0.23 |
Child | 20 | 25 | 3.0 | 0.49 |
Insurance | 27 | 52 | 5.19 | 0.25 |
Dataset | 10 M | 20 M | 50 M | 100 M | 150 M | 200 M |
---|---|---|---|---|---|---|
Alarm | 1.828 | 3.656 | 9.140 | 18.280 | 27.421 | 36.561 |
Child | 1.073 | 2.146 | 5.364 | 10.728 | 16.093 | 21.457 |
Insurance | 1.829 | 3.658 | 9.144 | 18.288 | 27.432 | 36.576 |
Network | Calculated ALS | Calculated AMBS | Actual AMBS |
---|---|---|---|
Alarm | 14,800 | 3.656 | 3.51 |
Child | 4000 | 3.00 | 3.00 |
Insurance | 43,200 | 4.66 | 5.19 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, Y.; Wang, J.; Nguyen, M.; Altintas, I. PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data. Sensors 2019, 19, 4400. https://doi.org/10.3390/s19204400
Tang Y, Wang J, Nguyen M, Altintas I. PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data. Sensors. 2019; 19(20):4400. https://doi.org/10.3390/s19204400
Chicago/Turabian StyleTang, Yan, Jianwu Wang, Mai Nguyen, and Ilkay Altintas. 2019. "PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data" Sensors 19, no. 20: 4400. https://doi.org/10.3390/s19204400
APA StyleTang, Y., Wang, J., Nguyen, M., & Altintas, I. (2019). PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data. Sensors, 19(20), 4400. https://doi.org/10.3390/s19204400