[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
review-article

A survey on semi-structured web data manipulations by non-expert users

Published: 01 May 2021 Publication History

Abstract

Since the emergence of web 2.0, data started floating all over the web, through online and offline applications, and across all application domains. Web data (semi-structured data loaded through web browsers and applications communicating via internet protocols such as HTTP), in particular XML-based data, is being used for simple commercial information display (i.e., XHTML/HTML in commercial websites), instant messaging (e.g., XMPP for messaging in Whatsapp, Skype, Gtalk etc.), financial transactions (i.e., CDF3 in ecommerce), medical record processing and storage (e.g., HL7 for electronic medical records), social media (e.g., XHTML/HTML in facebook, LinkedIn, Google Plus, etc.), and others. This phenomenon rendered web data manipulation (i.e., monitoring, modifying, controlling, etc.) by IT (information technology) experts, computer technicians and engineers utterly difficult seeing its exponential growth rate in volume and diversity. Not to mention the dynamicity of the data which is continuously changing on the clock and its heterogeneity (e.g., HTML/HTML5, XML, XHTML, RDF, OWL, etc.).
Consequently, the manipulation of web data and in particular XML data (since XML has become one of the most essential data types used in computer communications) has shifted from the hands of computer scientists and programmers towards public computer users in all application domains.
This has brought a new criterion into the web data manipulation research field, web data manipulation by non-experts. In this paper, we study and analyze existent techniques for manipulating semi-structured web data, particularly XML data, from a non-expert point of view while relating it to traditional manipulation techniques defined in the literature (i.e., filtering, adaptation, data extraction, transformation, access control, encryption, etc.). Web data manipulation techniques by non-experts were categorized under 3 major titles: (i) XML-oriented visual languages dealing with XML data extraction and transformations, (ii) Mashups tackling mainly XML restructuring with value manipulations, and (iii) Dataflow visual programming languages targeting non-expert manipulations and providing means to visually manipulate scientific data. A full analysis was conducted which allowed existent approaches/techniques to be compared and evaluated providing an overview of the current requirements on this subject.

References

[1]
Lawson B., Sharp R., Introducing Html5, New Riders, 2011.
[2]
Pemberton S., The extensible hypertext markup language: A reformulation of HTML 4.0 in XML 1.0, 2002, Available: http://www.w3.org/TR/2002/REC-xhtml1-20020801/.
[3]
Foundation J.S., Extensible messaging and presence protocol (XMPP): Instant messaging and presence, 2004, Available: http://xmpp.org/rfcs/rfc3921.html.
[4]
MasterCard J.S., Common data format 3 overview—XML data format, 2012, Available: http://smartdatasupport.mastercard.com/CDF3Overview.pdf.
[5]
Dolin R.H., Alschuler L., Boyer S., Beebe C., Behlen F.M., Biron P.V., et al., HL7 clinical document architecture, release 2, J. Am. Med. Inf. Assoc. 13 (2006) 30–39.
[7]
Eastman C., Eastman C.M., Teicholz P., Sacks R., BIM Handbook: A Guide To Building Information Modeling for Owners, Managers, Designers, Engineers and Contractors, John Wiley & Sons, 2011.
[8]
Braga D., Campi A., Ceri S., XQBE (xquery by example): a visual interface to the standard XML query language, ACM Trans. Database Syst. 30 (2005) 398–443.
[9]
M. Erwig, A visual language for XML, Visual Languages, in: IEEE Symposium on, 2000 pp. 47–54.
[10]
Ceri S., Comai S., Fraternali P., Paraboschi S., Tanca L., Damiani E., XML-GL: A graphical language for querying and restructuring XML documents, in: SEBD, 1999, pp. 151–165.
[11]
E. Pietriga, J.-Y. Vion-Dury, V. Quint, VXT: a visual approach to XML transformations, in: Proceedings of the 2001 ACM Symposium on Document engineering, Atlanta, Georgia, USA, 2001, pp. 1–10.
[12]
D.E. Simmen, M. Altinel, V. Markl, S. Padmanabhan, A. Singh, Damia: data mashups for intranet applications, in: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008, pp. 1171–1182.
[13]
Stolee K.T., Elbaum S., Sarma A., Discovering how end-user programmers and their communities use public repositories: A study on Yahoo! pipes, Inf. Softw. Technol. 55 (2013) 1289–1303.
[14]
Daniel F., Matera M., Mashups: Concepts, Models and Architectures, Springer, 2014.
[15]
J. Heasley, Securing XML data, in: Proceedings of the 1st annual conference on Information security curriculum development, Kennesaw, Georgia, 2004, pp. 112–114.
[16]
I. Kofler, C. Timmerer, H. Hellwagner, T. Ahmed, Towards MPEG-21-based cross-layer multimedia content adaptation, in: Proceedings of the Second International Workshop on Semantic Media Adaptation and Personalization, 2007, pp. 3–8.
[17]
T. Lemlouma, N. Layaıda, SMIL content adaptation for embedded devices, in: IN SMIL EUROPE 2003 CONFERENCE, 2003, pp. 12–14.
[18]
B. Pellan, C. Concolato, Adaptation of scalable multimedia documents, in: Proceeding of the eighth ACM symposium on Document engineering, Sao Paulo, Brazil, 2008, pp. 32–41.
[19]
Hidders J., Kwasnikowska N., Sroka J., Tyszkiewicz J., d. Bussche J.V., DFL: A dataflow language based on Petri nets and nested relational calculus, Inf. Syst. 33 (2008) 261–284.
[20]
Auguston M., Delgado A., The V Experimental Visual Programming Language, New Mexico State University, 1996, p. 6.
[21]
D. Turi, P. Missier, C. Goble, D.D. Roure, T. Oinn, Taverna workflows: syntax and semantics, in: Proceedings of the Third IEEE International Conference on e-Science and Grid Computing, 2007, pp. 441–448.
[22]
Johnston W.M., Hanna J.R.P., Millar R.J., Advances in dataflow programming languages, ACM Comput. Surv. 36 (2004) 1–34.
[23]
W3C W.M., XQuery 1.0: An XML Query Language, second ed., 2010, Available: http://www.w3.org/TR/xquery/.
[24]
W3C W.M., XML path language (Xpath) version 1.0, 1999, Available: http://www.w3.org/TR/xpath/.
[25]
Kay M., XSL transformations (XSLT) version 2.0, 2007, Available: http://www.w3.org/TR/2007/REC-xslt20-20070123/.
[26]
W3C M., Extensible stylesheet language transformations -XSLT 1.0, 1999, Available: http://www.w3.org/TR/xslt.
[27]
S. Ceri, S. Comai, E. Damiani, P. Fraternali, L. Tanca, Complex queries in XML-GL, in: Proceedings of the 2000 ACM Symposium on Applied Computing, Vol. 2, Como, Italy, 2000, pp. 888–893.
[28]
R.J. Ennals, M.N. Garofalakis, MashMaker: mashups for the masses, in: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 1116–1118.
[29]
Loton T., Introduction to microsoft popfly, no programming required, in: Bibliometrics, Lotontech Limited, 2008, p. 128.
[30]
Lorenzo G.D., Hacid H., Paik H.-y., Benatallah B., Data integration in mashups, SIGMOD Rec. 38 (2009) 59–66.
[31]
Hwang G.-H., Chang T.-K., An operational model and language support for securing XML documents, Comput. Secur. 23 (2004) 498–529.
[32]
I.B.M.-Group C.S., IBM websphere datapower XML security gateway XS40, 2008, pp. 2–4.
[33]
Lautenbach B., Introduction to XML encryption and XML signature, Inf. Secur. Tech. Rep. 9 (2004) 6–18.
[34]
C.-H. Lim, S. Park, S.H. Son, Access control of XML documents considering update operations, in: Proceedings of the 2003 ACM workshop on XML security, Fairfax, Virginia, 2003, pp. 49–59.
[35]
Wright T., Security, privacy, and anonymity, Crossroads Mag. 11 (2004) 5.
[36]
Xu H., Ayachit M.M., Reddyreddy A., Formal modelling and analysis of XML firewall for service-oriented systems, Int. J. Secur. Netw. 3 (2008) 147–160.
[37]
B. Zhao, R. Sandhu, X. Zhang, X. Qin, Towards a times-based usage control model, in: Proceedings of the 21st annual IFIP WG 11.3 working conference on Data and applications security, Redondo Beach, CA, USA, 2007, pp. 227–242.
[38]
Harrison M.A., Ruzzo W.L., Ullman J.D., Protection in operating systems, Commun. ACM 19 (1976) 461–471.
[39]
Ferraiolo D.F., Sandhu R., Gavrila S., Kuhn D.R., Chandramouli R., Proposed NIST standard for role-based access control, ACM Trans. Inf. Syst. Secur. 4 (2001) 224–274.
[40]
R.K. Thomas, R.S. Sandhu, Task-Based Authorization Controls (TBAC): A family of models for active and enterprise-oriented autorization management, in: Proceedings of the IFIP TC11 WG11.3 Eleventh International Conference on Database Securty XI: Status and Prospects, 1998, pp. 166–181.
[41]
A.A.E. Kalam, S. Benferhat, A. Miuge, R.E. Baida, F. Cuppens, C. Saurel, et al. Organization based access control, in: Proceedings of the 4th IEEE International Workshop on Policies for Distributed Systems and Networks, 2003, pp. 120–131.
[42]
Damiani E., d. Vimercati S.D.C., Paraboschi S., Samarati P., A fine-grained access control system for XML documents, ACM Trans. Inf. Syst. Secur. 5 (2002) 169–202.
[43]
B. Luo, D. Lee, W.-C. Lee, P. Liu, QFilter: fine-grained run-time XML access control via NFA-based query rewriting, in: Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C. USA, 2004, pp. 543–552.
[44]
I. Fundulaki, S. Maneth, Formalizing XML access control for update operations, in: Proceedings of the 12th ACM symposium on Access control models and technologies, Sophia Antipolis, France, 2007, pp. 169–174.
[45]
Park J., Sandhu R., The uconabc usage control model, ACM Trans. Inf. Syst. Secur. 7 (2004) 128–174.
[46]
Yu Y., Chiueh T.-c., Enterprise digital rights management: Solutions against information theft by insiders, in: Research Proficiency Examination, 2004, pp. 2–24.
[47]
Piatek M., Distributed web proxy caching in a local network environment, in: The Student Research Competition, 2004, p. 8.
[48]
Zenel B.A., A Proxy-Based Filtering Mechanism for the Mobile Environment, (Doctoral Dissertation) Columbia University, 1998.
[49]
Heyman K., A new virtual private network for today’s mobile world, Computer 40 (2007) 17–19.
[50]
Takeshi Imamura B.D., Ed Simon T.-c., Syntax XML encryption, W3C, 2005.
[51]
M. Altinel, M.J. Franklin, Efficient filtering of XML documents for selective dissemination of information, in: Proceedings of the 26th International Conference on Very Large Data Bases, 2000, pp. 53–64.
[52]
Diao Y., Altinel M., Franklin M.J., Zhang H., Fischer P., Path sharing and predicate evaluation for high-performance XML filtering, ACM Trans. Database Syst. 28 (2003) 467–516.
[53]
Byun C., Lee K., Park S., A keyword-based filtering technique of document-centric XML using NFA representation, Int. J. Appl. Math. Comput. Sci. (2007) 136–143.
[54]
Timmerer C., Hellwagner H., Interoperable adaptive multimedia communication, IEEE MultiMedia 12 (2005) 74–79.
[55]
Adelberg B., Nodose-a tool for semi-automatically extracting structured and semistructured data from text documents, SIGMOD Rec. 27 (1998) 283–294.
[56]
C.-H. Chang, S.-C. Lui, IEPAD: information extraction based on pattern discovery, in: Proceedings of the 10th international conference on World Wide Web, Hong Kong, Hong Kong, 2001, pp. 681–688.
[57]
V. Crescenzi, G. Mecca, P. Merialdo, Automatic web information extraction in the ROADRUNNER system, in: Revised Papers from the HUMACS, DASWIS, ECOMO, and DAMA on ER 2001 Workshops, 2002, pp. 264–277.
[58]
K.S. Candan, W.-P. Hsiung, S. Chen, J. Tatemura, D. Agrawal, AFilter: adaptable XML filtering with prefix-caching suffix-clustering, in: Proceedings of the 32nd international conference on Very large data bases, Seoul, Korea, 2006, pp. 559–570.
[59]
Bulterman D., Synchronized multimedia integration language (SMIL 2.1) specification, 2005, Available: http://www.w3.org/TR/2005/REC-SMIL2-20051213/.
[60]
Ferraiolo J., Scalable vector graphics (SVG) 1.1 specification, 2003, Available: http://www.w3.org/TR/2003/REC-SVG11-20030114/.
[61]
Villard L., Roisin C., Layaıda N., An XML-based multimedia document processing model for content adaptation, in: Digital Documents: Systems and Principles, Springer, 2000, pp. 104–119.
[62]
Erwig M., Smeltzer K., Wang X., What is a visual language?, J. Vis. Lang. Comput. 38 (2017) 9–17.
[63]
J. Hidders, N. Kwasnikowska, J. Sroka, J. Tyszkiewicz, J.V. den Bussche, Petri net + nested relational calculus = dataflow, in: OTM Conferences, 1’05, ed, 2005, pp. 220–237.
[64]
M. Auguston, A. Delgado, Iterative constructs in the visual data flow language, in: Proceedings of the 1997 IEEE Symposium on Visual Languages, VL ’97, 1997, pp. 152–159.
[65]
Oinn T., Greenwood M., Addis M., Alpdemir M.N., Ferris J., Glover K., et al., taverna: lessons in creating a workflow environment for the life sciences: Research articles, Concurr. Comput. : Pract. Exper. 18 (2006) 1067–1100.
[66]
S. Pettifer, K. Wolstencroft, P. Alper, T. Attwood, A. Coletta, C. Goble, et al. myGrid and UTOPIA: an integrated approach to enacting and visualising in silico experiments in the life sciences, in: Proceedings of the 4th international conference on Data integration in the life sciences, Philadelphia, PA, USA, 2007, pp. 59–70.
[67]
P. Buneman, S. Naqvi, V. Tannen, L. Wong, Principles of programming with complex objects and collection types, in: Selected papers of the fourth international conference on Database theory, Berlin, Germany, 1995, pp. 3–48.
[68]
Ouyang C., Verbeek E., v. d. Aalst W.M.P., Breutel S., Dumas M., t. Hofstede A.H.M., Formal semantics and analysis of control flow in WS-BPEL, Sci. Comput. Program. 67 (2007) 162–198.

Cited By

View all
  • (2023)Detection of cross-site scripting (XSS) attacks using machine learning techniques: a reviewArtificial Intelligence Review10.1007/s10462-023-10433-356:11(12725-12769)Online publication date: 1-Nov-2023
  • (2021)Centy: Scalable Server-Side Web Integrity Verification System Based on Fuzzy HashesDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-030-80825-9_19(371-390)Online publication date: 14-Jul-2021

Index Terms

  1. A survey on semi-structured web data manipulations by non-expert users
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Please enable JavaScript to view thecomments powered by Disqus.

              Information & Contributors

              Information

              Published In

              cover image Computer Science Review
              Computer Science Review  Volume 40, Issue C
              May 2021
              778 pages

              Publisher

              Elsevier Science Publishers B. V.

              Netherlands

              Publication History

              Published: 01 May 2021

              Author Tags

              1. Web data
              2. Semi-structured data
              3. XML
              4. XML manipulation
              5. XML control
              6. Visual languages
              7. Dataflow
              8. Mashups
              9. Visual language

              Qualifiers

              • Review-article

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • Downloads (Last 12 months)0
              • Downloads (Last 6 weeks)0
              Reflects downloads up to 03 Mar 2025

              Other Metrics

              Citations

              Cited By

              View all
              • (2023)Detection of cross-site scripting (XSS) attacks using machine learning techniques: a reviewArtificial Intelligence Review10.1007/s10462-023-10433-356:11(12725-12769)Online publication date: 1-Nov-2023
              • (2021)Centy: Scalable Server-Side Web Integrity Verification System Based on Fuzzy HashesDetection of Intrusions and Malware, and Vulnerability Assessment10.1007/978-3-030-80825-9_19(371-390)Online publication date: 14-Jul-2021

              View Options

              View options

              Figures

              Tables

              Media

              Share

              Share

              Share this Publication link

              Share on social media