Abstract
Web crawlers collect and index the vast amount of data available online to gather specific types of objective data such as news that researchers or practitioners need. As big data are increasingly used in a variety of fields and web data are exponentially growing each year, the importance of web crawlers is growing as well. Web servers that currently handle high traffic, such as portal news servers, have safeguards against security threats such as distributed denial-of-service (DDoS) attacks. In particular, the crawler, which causes a large amount of traffic to the Web server, has a very similar nature to DDoS attacks, so the crawler’s activities tend to be blocked from the web server. A peer-to-peer (P2P) crawler can be used to solve these problems. However, the limitations with the pure P2P crawler is that it is difficult to maintain the entire system when network traffic increases or errors occur. Therefore, in order to overcome these limitations, we would like to propose a hybrid P2P crawler that can collect web data using the cloud service platform provided by Amazon Web Services (AWS). The hybrid P2P networking distributed web crawler using AWS (HP2PNC-AWS) is applied to collecting news on Korea’s current smart work lifestyle from three portal sites. In Portal A where the target server does not block crawling, the HP2PNC-AWS is faster than the general web crawler (GWC) and slightly slower than the server/client distributed web crawler (SC-DWC), but it has a similar performance to the SC-DWC. However, in both Portal B and C where the target server blocks crawling, the HP2PNC-AWS performs better than other methods, with the collection rate and the number of data collected at the same time. It was also confirmed that the hybrid P2P networking system could work efficiently in web crawler architectures.
Similar content being viewed by others
References
Chen H, Chiang RHL, Storey VC (2012) Business intelligence and analytics: from big data to big impact. MIS Q 36(4):1165–1188
De Mauro A, Greco M, Grimaldi M (2016) A Formal Definition of Big Data Based on Its Essential Features. Libr Rev 65(3):122–135. https://doi.org/10.1108/LR-06-2015-0061
Philipp P, Maleshkova M, Rettinger A, Katic D (2017) A semantic framework for sequential decision making. Journal of Web Engineering 16(5–6):471–504
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data Mining with Big Data. IEEE Trans Knowl Data Eng 26(1):97–107. https://doi.org/10.1109/TKDE.2013.109
Hashem IAT, Yaqoob I, Anuar NB, Mokhtar S, Gani A, Ullah Khan S (2015) The rise of “big data” on cloud computing: review and open research issues. Inf Syst 47:98–115. https://doi.org/10.1016/j.is.2014.07.006
Ryu S, Song T-M (2014) Big data analysis in healthcare. Healthcare informatics research 20(4):247–248. https://doi.org/10.4258/hir.2014.20.4.247
Cho J, Garcia-Molina H, Haveliwala T, Lam W, Paepcke A, Raghavan S, Wesley G (2006) Stanford WebBase components and applications. ACM Trans Internet Technol 6(2):153–186
Thelwall M (2001) A web crawler design for data mining. J Inf Sci 27(5):319–325
Choudhary S, Dincturk E, Mirtaheri S, Bochmann GV, Jourdan G-V, Onut IV (2014) Model-based rich internet applications crawling: "menu" and "probability" models. Journal of Web Engineering 13(3–4):243–262
Thenmalar S, Geetha TV (2014) The modified concept based focused crawling using ontology. Journal of Web Engineering 13(5–6):525–538
Cho J, Garcia-Molina H (2002) Parallel crawlers. In: 11th international conference on world wide web, pp. 124-135. ACM
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Computer Networks and ISDN Systems 30(1):161–172. https://doi.org/10.1016/S0169-7552(98)00108-1
Heydon A, Najork M (1999) Mercator: a scalable. Extensible Web Crawler World Wide Web 2(4):219–229. https://doi.org/10.1023/a:1019213109274
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717
Garcia-Teodoro P, Diaz-Verdejo J, Maciá-Fernández G, Vázquez E (2009) Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security 28(1–2):18–28
Zhou B, Li J, Ji Y, Guizani M (2018) Online internet traffic monitoring and DDoS attack detection using Big Data frameworks. In 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC) (pp. 1507–1512). IEEE
Amazon Web Service. (2018.08.06). https://aws.amazon.com/
Xu H, Li K, Fan G (2018) An Improved Strategy of Distributed Network Crawler Based on Hadoop and P2P. In International Conference on Applications and Techniques in Cyber Security and Intelligence (pp. 849–855). Springer, Cham
Hafaiedh, K. B., von Bochmann, G., Jourdan, G. V., Onut, I. V.: Fault Tolerant P2P RIA Crawling. In International Conference on Networked Systems (pp. 32–47). Springer, Cham (2016, May)
Ahmad S, Bouras C, Buyukkaya E, Dawood M, Hamzaoui R, Kapoulas V, Papazois A, Simon G (2018) Peer-to-peer live video streaming with Rateless codes for massively multiplayer online games. Peer-to-Peer Networking and Applications 11(1):44–62. https://doi.org/10.1007/s12083-016-0495-7
Duan Z, Tian C, Zhou M, Wang X, Zhang N, Du H, Wang L (2017) Two-layer hybrid peer-to-peer networks. Peer-to-Peer Networking and Applications 10(6):1304–1322. https://doi.org/10.1007/s12083-016-0460-5
Kim J-C, Chung K (2018) Mining health-risk factors using PHR similarity in a hybrid P2P network. Peer-to-Peer Networking and Applications 11(6):1278–1287. https://doi.org/10.1007/s12083-018-0631-7
Kim Y-Y, Oh S, Lee H, Cha KJ (2015) A study on smart Workers' work/nonwork boundary management strategies. Knowledge Management Research 16(4):133–155
Dixit DA (2012) Web crawler design issues: a review. International Journal of Managment, IT and Engineering 2(8):394–404
Desai K, Devulapalli V, Agrawal S, Kathiria P, Patel A (2017) Web Crawler: Review of Different Types of Web Crawler, Its Issues, Applications and Research Opportunities. Int J Adv Res Comput Sci 8(3)
Sozer EM, Stojanovic M, Proakis JG (2000) Underwater acoustic networks. IEEE J Ocean Eng 25(1):72–83
Acknowledgments
This paper was supported by Konkuk University in 2018.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the Topical Collection: Special Issue on P2P Computing for Intelligence of Things
Guest Editors: Sunmoon Jo, Jieun Lee, Jungsoo Han, and Supratip Ghose
Rights and permissions
About this article
Cite this article
Kim, YY., Kim, YK., Kim, DS. et al. Implementation of hybrid P2P networking distributed web crawler using AWS for smart work news big data. Peer-to-Peer Netw. Appl. 13, 659–670 (2020). https://doi.org/10.1007/s12083-019-00841-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12083-019-00841-0