Abstract
Probabilistic approaches for data integration have much potential [7]. We view data integration as an iterative process where data understanding gradually increases as the data scientist continuously refines his view on how to deal with learned intricacies like data conflicts. This paper presents a probabilistic approach for integrating data on groupings. We focus on a bio-informatics use case concerning homology. A bio-informatician has a large number of homology data sources to choose from. To enable querying combined knowledge contained in these sources, they need to be integrated. We validate our approach by integrating three real-world biological databases on homology in three iterations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Altenhoff, A., Dessimoz, C.: Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput. Biol. 5, e1000262 (2009)
Antova, L., Koch, C., Olteanu, D.: \({10^{(10^{6})}}\) worlds and beyond: efficient representation and processing of incomplete information. VLDB J. 18(5), 1021–1040 (2009)
Koonin, E.: Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, 309–338 (2005)
Kuzniar, A., Lin, K., He, Y., Nijveen, H., Pongor, S., Leunissen, J.A.M.: Progmap: an integrated annotation resource for protein orthology. Nucleic Acids Res. 37(suppl. 2), W428–W434 (2009)
Kuzniar, A., van Ham, R., Pongor, S., Leunissen, J.: The quest for orthologs: finding the corresponding gene across genomes. Trends Genet. 24, 539–551 (2008)
Louie, B., Detwiler, L., Dalvi, N., Shaker, R., Tarczy-Hornoch, P., Suciu, D.: Incorporating uncertainty metrics into a general-purpose data integration system. In: 19th International Conference on Scientific and Statistical Database Management. SSBDM 2007, p. 19 (2007)
Magnani, M., Montesi, D.: A survey on uncertainty management in data integration. J. Data Inf. Qual. 2(1), 5:1–5:33 (2010)
NCBI Resource Coordinators. Database resources of the national center for biotechnology information. 41(D1), D8–D20 (2013)
Powell, S., Szklarczyk, D., Trachana, K., Roth, A., Kuhn, M., Muller, J., Arnold, R., Rattei, T., Letunic, I., Doerks, T., et al.: eggNOG v3. 0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res. 40, D284–D289 (2011)
van Keulen, M.: Managing uncertainty: the road towards better data interoperability. IT - Inf. Technol. 54(3), 138–146 (2012)
Wanders, B., van Keulen, M., van der Vet, P.E.: Uncertain groupings: probabilistic combination of grouping data. Technical report TR-CTIT-14-12, Centre for Telematics and Information Technology, University of Twente, Enschede (2014)
Widom, J.: Trio: A system for integrated management of data, accuracy, and lineage. Technical report 2004–40, Stanford InfoLab (2004)
Wu, C.H., Nikolskaya, A., Huang, H., Yeh, L.-S.L., Natale, D.A., Vinayaka, C.R., Hu, Z.-Z., Mazumder, R., Kumar, S., Kourtesis, P., Ledley, R.S., Suzek, B.E., Arminski, L., Chen, Y., Zhang, J., Cardenas, J.L., Chung, S., Castro-Alvear, J., Dinkov, G., Barker, W.C.: Pirsf: family classification system at the protein information resource. Nucleic Acids Res. 32(suppl. 1), D112–D114 (2004)
Acknowledgements
We would like to thank the late Tjeerd Boerman for his work on the use case and his initial concept of groupings. We would also like to thank Arnold Kuzniar for his insights and feedback on our use of biological databases and Ivor Wanders for his reviewing and editing assistance.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Wanders, B., van Keulen, M., van der Vet, P. (2015). Uncertain Groupings: Probabilistic Combination of Grouping Data. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-22849-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)