Abstract
With the advent of large string datasets in several scientific and business applications, there is a growing need to perform ad-hoc analysis on strings. Currently, strings are stored, managed, and queried using procedural codes. This limits users to certain operations supported by existing procedural applications and requires manual query planning with limited tuning opportunities. This paper presents StarQL, a generic and declarative query language for strings. StarQL is based on a native string data model that allows StarQL to support a large variety of string operations and provide semantic-based query optimization. String analytic queries are too intricate to be solved on one machine. Therefore, we propose a scalable and efficient data structure that allows StarQL implementations to handle large sets of strings and utilize large computing infrastructures. Our evaluation shows that StarQL is able to express workloads of application-specific tools, such as BLAST and KAT in bioinformatics, and to mine Wikipedia text for interesting patterns using declarative queries. Furthermore, the StarQL query optimizer shows an order of magnitude reduction in query execution time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Apostolico, A., Comin, M., Parida, L.: VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM Trans. Comput. Biol. Bioinform. 7(4), 752–762 (2010)
Balkir, N., Sukan, E., Ozsoyoglu, G., Ozsoyoglu, G.: Visual: a graphical icon-based query language. In: Proceedings of International Conference on Data Engineering (ICDE) (1996)
Benedikt, M., Libkin, L., Schwentick, T., Segoufin, L.: String operations in query languages. In: Proceedings of PODS (2001)
Carvalho, A.M., Oliveira, A.L., Freitas, A.T., Sagot, M.F.: A parallel algorithm for the extraction of structured motifs. In: Proceedings of the ACM Symposium on Applied Computing (SAC) (2004)
Date, C.: An Introduction to Database Systems, 8th edn. Pearson/Addison-Wesley, Boston (2003)
Dube, K., Mansour, E., Wu, B.: Supporting collaboration and information sharing in computer-based clinical guideline management. In: 18th IEEE Symposium on Computer-Based Medical Systems (CBMS), Dublin, Ireland (2005)
Etzold, T., Argos, P.: SRS - an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci. 9(1), 49–57 (1993)
Etzold, T., Argos, P.: Transforming a set of biological flat file libraries to a fast access network. Comput. Appl. Biosci. 9(1), 49–57 (1993)
Floratou, A., Tata, S., Patel, J.M.: Efficient and accurate discovery of patterns in sequence data sets. TKDE 23(8), 1154–1168 (2011)
Ginsburg, S., Wang, X.S.: Regular sequence operations and their use in database queries. J. Comput. Syst. Sci. 56(1), 1–26 (1998)
Ginsburg, S., Wang, X.: Pattern matching by RS-operations: towards a unified approach to querying sequenced data. In: Proceedings of PODS (1992)
Grahne, G., Hakli, R., Nykänen, M., Tamm, H., Ukkonen, E.: Design and implementation of a string database query language. Inf. Syst. 28(4), 311–337 (2003)
Grahne, G., Nykänen, M., Ukkonen, E.: Reasoning about strings in databases. In: Proceedings of the Thirteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS 1994 (1994)
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J., Clavijo, B.J.: KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33(4), 574–576 (2017)
Mathur, A., Sihag, A., Bagaria, E., Rajawat, S., et al.: A new perspective to data processing: Big data. In: Proceedings of INDIACom, pp. 110–114 (2014)
Niedringhaus, T.P., Milanova, D., Kerby, M.B., Snyder, M.P., Barron, A.E.: Landscape of next-generation sequencing technologies. Anal. Chem. 83(12), 4327–4341 (2011)
O’Connor, B.D., Merriman, B., Nelson, S.F.: Seqware query engine: storing and searching sequence data in the cloud. BMC Bioinf. 11(12), S2 (2010)
Richardson, J.: Supporting lists in a data model (a timely approach). In: Proceedings of the 18th International Confernce on Very Large Data Bases, VLDB 1992 (1992)
Sahli, M., Mansour, E., Alturkestani, T., Kalnis, P.: Automatic tuning of bag-of-tasks application. In: International Conference on Data Engineering (ICDE) (2015)
Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM) (2013)
Sahli, M., Mansour, E., Kalnis, P.: ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J. 23(6), 871–873 (2014)
Sahli, M., Mansour, E., Kalnis, P.: StarDB: a large-scale DBMS for strings. Proc. VLDB 8, 1844–1847 (2015)
Seshadri, P., Livny, M., Ramakrishnan, R.: The design and implementation of a sequence database system. In: Proceedings of the International Conference on Very Large Data Bases, VLDB 1996 (1996)
Stonebraker, M., Cetintemel, U.: “One size fits all”: an idea whose time has come and gone. In: Proceedings of International Conference on Data Engineering (ICDE) (2005)
Tata, S., Friedman, J., Swaroop, A.: Declarative querying for biological sequences. In: Proceedings of International Conference on Data Engineering (ICDE), pp. 87–87, April 2006
Tata, S., Lang, W., Patel, J.M.: Periscope/SQ: interactive exploration of biological sequence databases. In: Proceedings of VLDB (2007)
Wolper, P.: Temporal logic can be more expressive. In: 22nd Annual Symposium on Foundations of Computer Science, SFCS 1981, pp. 340–348, October 1981
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sahli, M., Mansour, E., Kalnis, P. (2017). Querying and Mining Strings Made Easy. In: Cong, G., Peng, WC., Zhang, W., Li, C., Sun, A. (eds) Advanced Data Mining and Applications. ADMA 2017. Lecture Notes in Computer Science(), vol 10604. Springer, Cham. https://doi.org/10.1007/978-3-319-69179-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-69179-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69178-7
Online ISBN: 978-3-319-69179-4
eBook Packages: Computer ScienceComputer Science (R0)