Abstract
With the increasing of data at an incredible rate, the development of cloud computing technologies is of critical importance to the advances of researches. MapReduce is a widely adopted computing framework for data-intensive applications running on clusters. Traditional parallel XML parsing and indexing approaches are inadequate for processing large-scale XML datasets on clusters and; therefore, we propose an approach to exploit data parallelisms in XML processing using MapReduce in Hadoop. Our solution seamlessly integrates data storage, labeling, indexing, and parallel queries to process a massive amount of XML data. Specifically, we introduce an SDN labeling algorithm and a distributed hierarchical index using DHTs. More importantly, we design an advanced two phase MapReduce solution that is able to efficiently address the issues of labeling, indexing, and query processing on big XML data. The first MapReduce phase applies filtering, labeling, index building techniques, in which each DataNode performs elements labeling using a map function and a reduce function to merge and build indexes. In the second phase, local XML queries in multiple partitions are performed in parallel using index-table-enabled B-SLCA. Our experimental results show the efficiency and effectiveness of our proposed parallel XML data approach using MapReduce Framework.
Similar content being viewed by others
References
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI (2004)
Fegaras, L., Li, C., Philip, J.J.: Xml query optimization in map-reduce. In: WebDB (2011)
Yang, D.D., Wei, Z.Q., Yang, Y.Q.: A novel implementation of a Hash function based on XML DOM parser. In: Cyber-Enabled Distributed Computing and Knowledge, Discovery, pp. 5–8 (2015)
Choi, H., Lee, K.-H., Lee, Y.-J.: Parallel labeling of massive xml data with mapreduce. J. Supercomput. 67, 408–437 (2013)
Zhou, J., Bao, Z., Meng, X.: Efficient query processing for xml keyword queries based on the idlist index. VLDB J. 23, 1–26 (2013)
Xu, L., Ling, T., Bao, Z.: Dde: from dewey to a fully dynamic xml labeling scheme. In: 2009 ACM SIGMOD International Conference on Management of data, pp. 719–730 (2009)
Camacho-Rodriguez, J., Colazzo, D., Manolescu, I.: Building large xml stores in the amazon cloud. In: Data Engineering Workshops (ICDEW), pp. 151–158 (2012)
Chen, G., Vo, H.T., Ooi, B.C.: A framework for supporting dbms-like indexes in the cloud. VLDB 4, 702–713 (2011)
Ottaviano, G., Grossi, R.: Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM international conference on Information and Knowledge Management, pp. 1485–1494 (2011)
Feng, J., Li, G.: Efficient fuzzy type-ahead search in xml data. IEEE Trans. Knowl. Data Eng. 24, 882–895 (2012)
Li, J.F.G., Li, C., Zhou, L.: Sail: structure-aware indexing for effective and progressive top-k keyword search over xml documents. Inf. Sci. 179, 3745–3762 (2009)
Chen, L.J., Papakonstantinou, Y.: Supporting top-k keyword search in xml databases. In: ICDE (2010)
Ling, Y., Xu, G.: A distributed keyword search algorithm in xml databases using mapreduce. Comput. Inform. Cybern. Appl. 107, 1307–1316 (2012)
Zhang, C., Ma, Q., Wang, X., Zhou, A.: Distributed slca-based xml keyword search by map-reduce. Database Syst. Adv. Appl. 6193, 386–397 (2010)
Zhou, M., Hu, H., Zhou, M.: Search xml data by slca on a mapreduce cluster. In: IUCS, pp. 84–89 (2010)
Zinn, D., Bowers, S., Kohler, S., Ludascher, B.: Parallelizing xml data-streaming workflows via mapreduce. J. Comput. Syst. Sci. 76, 447463 (2010)
Fadika, Z., Head, M.R., Govindaraju, M.: Parallel and distributed approach for processing large-scale xml datasets. In: 10th IEEE/ACM International Conference on Grid Computing, pp. 105–112 (2009)
Y. Zhang, Q. L. Li and B. Liu. MapReduce implementation of XML keyword search algorithm. In: 2015 IEEE International Conference on Smart City, pp. 721–728 (2015)
Wang, X.W.W., Zhou, A.: Hash-search: an efficient slca-based keyword search algorithm on xml documents. In: DASFAA, p. 496510 (2009)
Lee, k, Choi, H., Moon, B.: Parallel data processing with mapreduce: a survey. ACM SIGMOD Rec. 40, 11–20 (2012)
Hsu, W.-C., Shih, H.-C.: A cloud computing implementation of xml indexing method using hadoop. In: Intelligent Information and Database Systems, vol. 7198, pp. 256–265 (2012)
Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. VLDB 7, 145–156 (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Song, K., Lu, H. High-performance XML modeling of parallel queries based on MapReduce framework. Cluster Comput 19, 1975–1986 (2016). https://doi.org/10.1007/s10586-016-0628-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-016-0628-z