[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

eTuner: tuning schema matching software using synthetic scenarios

Published: 25 January 2007 Publication History

Abstract

Most recent schema matching systems assemble multiple components, each employing a particular matching technique. The domain user mustthen tune the system: select the right component to be executed and correctly adjust their numerous “knobs” (e.g., thresholds, formula coefficients). Tuning is skill and time intensive, but (as we show) without it the matching accuracy is significantly inferior. We describe eTuner, an approach to automatically tune schema matching systems. Given a schema S, we match S against synthetic schemas, for which the ground truth mapping is known, and find a tuning that demonstrably improves the performance of matching S against real schemas. To efficiently search the huge space of tuning configurations, eTuner works sequentially, starting with tuning the lowest level components. To increase the applicability of eTuner, we develop methods to tune a broad range of matching components. While the tuning process is completely automatic, eTuner can also exploit user assistance (whenever available) to further improve the tuning quality. We employed eTuner to tune four recently developed matching systems on several real-world domains. The results show that eTuner produced tuned matching systems that achieve higher accuracy than using the systems with currently possible tuning methods.

References

[1]
1. Aberer, K.: Special issue on peer to peer data management. SIGMOD Rec. 32(3), 138-140 (2003).
[2]
2. Agrawal, S., Chaudhuri, S., Kollr, L., Marathe, A.P., Narasayya, V. R., Syamala, M.: Database tuning advisor for microsoft sql server 2005. In: VLDB, 2004.
[3]
3. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: Proceedings of SIGMOD, 2004.
[4]
4. Aslan, G., McLeod, D.: Semantic heterogeneity resolution in federated databases by metadata implantation and stepwise evolution. VLDB J. 8(2), 120-132 (1999).
[5]
5. Batini, C., Lenzerini, M., Navathe, S.B.: A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18(4), 323-364 (1986).
[6]
6. Benjelloun, O., Garcia-Molina, H., Jonas, J., Su, Q., Widom, J.: Swoosh: a generic approach to entity resolution. Technical report, Stanford University (2005).
[7]
7. Bergamaschi, S., Castano, S., Vincini, M., Beneventano, D.: Semantic integration of heterogeneous information sources. Data Knowl. Eng. 36(3), 215-249 (2001).
[8]
8. Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: Proceedings of the Conference on Cooperative Information Systems (CoopIS), 2001.
[9]
9. Berlin, J., Motro, A.: Database schema matching using machine learning with feature selection. In: Proceedings of the Conference on Advanced Information Systems Engineering (CAiSE), 2002.
[10]
10. Bernstein, P.A., Melnik, S., Petropoulos, M., Quix, C.: Industrial-strength schema matching. SIGMOD Record, Special Issue in Semantic Integration, December 2004.
[11]
11. Bilke, A., Naumann, F.: Schema matching using duplicates. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005.
[12]
12. Borkar, V., Deshmukh, K., Sarawagi, S.: Automatic text segmentation for extracting structured records. In: Proceedings of SIGMOD-01.
[13]
13. Brown, A., Kar, G., Keller, A.: An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In: Proceedings of the Seventh IFIP/IEEE International Symposium on Integrated Network Management (IM), 2001.
[14]
14. Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS), 1999.
[15]
15. Chaudhuri, S., Dageville, B., Lohman, G.: Self-managing technology in database management systems (tutorial). In: Proceedings of VLDB, 2004.
[16]
16. Chaudhuri, S., Weikum, G.: Rethinking database system architecture: towards a self-tuning risc-style database system. In: VLDB, 2000.
[17]
17. Chidlovskii, B.: Automatic repairing of web wrappers. In: Third International Workshop on Web Information and Data Management, 2001.
[18]
18. Clifton, C., Housman, E., Rosenthal, A.: Experience with a combined approach to attribute-matching across heterogeneous databases. In: Proceedings of the IFIP Working Conference on Data Semantics (DS-7), 1997.
[19]
19. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., Domingos P.: iMAP: discovering complex matches between database schemas. In: Proceedings of SIGMOD, 2004.
[20]
20. Dietterich, T.G.: Machine learning research: four current directions. AI Mag. 18(4), 97-136 (1997).
[21]
21. Do, H.: Schema matching and Mapping-based Data Integration. PhD Thesis, University of Leipzig, 2006.
[22]
22. Do, H., Melnik, S., Rahm, E.: Comparison of schema matching evaluations. In: Proceedings of the 2nd International Workshop on Web Databases (German Informatics Society), 2002.
[23]
23. Do, H., Rahm, E.: Coma: a system for flexible combination of schema matching approaches. In: Proceedings of the 28th Conference on Very Large Databases (VLDB), 2002.
[24]
24. Doan, A.: Learning to Map between Structured Representations of Data. PhD Thesis, University of Washington, 2003.
[25]
25. Doan, A., Domingos, P., Halevy, A.: Reconciling schemas of disparate data sources: A machine learning approach. In: Proceedings of the ACM SIGMOD Conference, 2001.
[26]
26. Doan, A., Domingos, P., Halevy, A.: Learning to match the database schemas: a multistrategy approach. Mach. Learn. 50(3), 279-301 (2003).
[27]
27. Doan, A., Madhavan, Dhamankar, R., Domingos, P., Halevy, A.: Learning to match ontologies on the Semantic Web. VLDB J. 12, 303-319 (2003).
[28]
28. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to map ontologies on the semantic web. In: Proceedings of the World-Wide Web Conference (WWW-02), 2002.
[29]
29. Doan, A., Noy, N., Halevy, A.: Introduction to the special issue on semantic integration. SIGMOD Rec. 33(4), 11-13 (2004).
[30]
30. Embley, D., Jackman, D., Xu, L.: Multifaceted exploitation of metadata for attribute match discovery in information integration. In: Proceedings of the WIIW-01, 2001.
[31]
31. Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: Proceedings of the ACM SIGIR Conference, 2004.
[32]
32. Freitag, D.: Machine learning for information extraction in informal domains. PhD. Thesis, Deptartment of Computer Science, Carnegie Mellon University, 1998.
[33]
33. Ganti, V., Chaudhuri, S., Motwani, R.: Robust identification of fuzzy duplicates. In: ICDE, 2005.
[34]
34. He, B., Chang, K.: Statistical schema matching across web query interfaces. In: Proceedings of the ACM SIGMOD Conference (SIGMOD), 2003.
[35]
35. He, B., Chang, K.C.C., Han, J.: Discovering complex matchings across Web query interfaces: a correlation mining approach. In: Proceedings of the ACM SIGKDD Conference (KDD), 2004.
[36]
36. Kang, J., Naughton, J.: On schema matching with opaque column names and data values. In: Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD-03), 2003.
[37]
37. Keim, G., Shazeer, N., Littman, M., Agarwal, S.: Cheves, C., Fitzgerald, J., Grosland, J., Jiang, F., Pollard, S., Weinmeister, K.: PROVERB: the probabilistic cruciverbalist. In: Proceedings of the 6th National Conference on Artificial Intelligence (AAAI-99), pp. 710-717 (1999).
[38]
38. Kushmerick, N.: Wrapper verification. World Wide Web J. 3(2), 79-94 (2000).
[39]
39. Lerman, K., Minton, S., Knoblock, C.: Wrapper maintenance: a machine learning approach. J. Artif. Intell. Res. 18:149-187 (2003).
[40]
40. Li, W., Clifton, C., Liu, S.: Database integration using neural network: implementation and experience. Knowl. Inf. Syst. 2(1), 73-96 (2000).
[41]
41. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the 18th IEEE International Conf. on Data Engineering (ICDE), 2005.
[42]
42. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with cupid. In: Proceedings of VLDB, 2001.
[43]
43. McCann, R., Alshebli, B., Le, Q., Nguyen, H., Vu, L., Doan, A.: Mapping maintenance for data integration systems. In: Proceedings of VLDB 2005.
[44]
44. McCann, R., Doan, A., Kramnik, A.: Varadarajan, V.: Building data integration systems via mass collaboration. In: Proceedings of the SIGMOD-03 Workshop on the Web and Databases (WebDB-03), 2003.
[45]
45. McCann, R., Kramnik, A., Shen, W., Varadarajan, V., Sobulo, O., Doan, A.: Integrating data from disparate sources: a mass collaboration approach. In: Proceedings of the International Conference on Data Engineering (ICDE), 2005.
[46]
46. Melnik, S., Molina-Garcia, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002.
[47]
47. Melville, P., Mooney, R.: Creating diversity in ensembles using artificial data. J. Inf. Fusion Spec. Issue Divers. Mult. Classifier Syst. 6(1):99-111 (2004).
[48]
48. Meng, X., Hu, D., Li, C.: Schema-guided wrapper maintenance for web-data extraction. In: Fifth International Workshop on Web Information and Data Management, 2003.
[49]
49. Milo, T., Zohar, S.: Using schema matching to simplify heterogeneous data translation. In: Proceedings of the International Conference on Very Large Databases (VLDB), 1998.
[50]
50. Mitchell, T.: Machine Learning. McGraw-Hill, NY (1997).
[51]
51. Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic integration of knowledge sources. In: Proceedings of Fusion, 1999.
[52]
52. Neumann, F., Ho, C.T., Tian, X., Haas, L., Meggido, N.: Attribute classification using feature analysis. In: Proceedings of the International Conference on Data Engineering (ICDE), 2002.
[53]
53. Noy, N.F., Musen, M.A.: PROMPT: algorithm and tool for automated ontology merging and alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI), 2000.
[54]
54. Noy, N.F., Musen, M.A.: Anchor-PROMPT: using nonlocal context for semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the International Joint Conference on Artificial Intelligence (IJCAI), 2001.
[55]
55. Ouksel, A., Seth, A.P.: Special issue on semantic interoperability in global information systems. SIGMOD Re. 28(1) 5-12 (1999).
[56]
56. Palopoli, L., Sacca, D., Terracina, G., Ursino, D.: A unififed graph-based framework for deriving nominal interscheme properties, type conflicts, and object cluster similarities. In: Proceedings of the Conf. on Cooperative Information Systems (CoopIS), 1999.
[57]
57. Palopoli, L., Sacca, D., Ursino, D.: Semi-automatic, semantic discovery of properties from database schemes. In: Proceedings of the International Database Engineering and Applications Symposium (IDEAS-98), pp. 244-253 (1998).
[58]
58. Palopoli, L., Terracina, G., Ursino, D.: The system DIKE: towards the semi-automatic synthesis of cooperative information systems and data warehouses. In: Proceedings of the ADBIS-DASFAA Conference, 2000.
[59]
59. Patterson, D.A., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., Treuhaft, N.: Recovery-oriented computing (ROC): motivation, definition, techniques, and case studies. Technical Report UCB//CSD-02-1175, University of California, 2002.
[60]
60. Perkowitz, M., Etzioni, O.: Category translation: Learning to understand information on the Internet. In: Proceedigns of Internatinal Joint Conference on AI (IJCAI), 1995.
[61]
61. Punyakanok, V., Roth, D.: The use of classifiers in sequential inference. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS-00), 2000.
[62]
62. Rahm, E., Bernstein, P.A.: On matching schemas automatically. VLDB J. 10(4) 334-350 (2001).
[63]
63. Rahm, E. Do, H., Massmann, S.: Matching large XML schemas. SIGMOD Record, Special Issue in Semantic Integration, December 2004.
[64]
64. Rahm, E., Thor, A., Aumueller, D., Do, H., Golovin, N., Kirsten, T.: iFuice - Information fusion utilizing instance correspondences and peer mappings. In: Proceedings of the Eighth International Workshop on the Web and Databases (WebDB), 2005.
[65]
65. Ryutaro, I., Hideaki, T., Shinichi, H.: Rule induction for concept hierarchy alignment. In: Proceedings of the 2nd Workshop on Ontology Learning at the 17th International Joint Conference on AI (IJCAI), 2001.
[66]
66. Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Keyword search across heterogeneous relational databases. Technical report, Department of Computer Science, Universtiy of Illinois (2006).
[67]
67. Seligman, L., Rosenthal, A.: The impact of xml in databases and data sharing. IEEE Computer, 2001.
[68]
68. UIMA: Unstructured information management architecture. http://www.research.ibm.com/UIMA/.
[69]
69. Velegrakis, Y., Miller, R., Popa, L., Mylopoulos, J.: Tomas: a system for adapting mappings while schemas evolve. In: Proceedings of the Twentieth International Conference on Data Engineering, 2004.
[70]
70. Weis, M., Naumann, F.: Dogmatix tracks down duplicates in xml. In: Proceedings of the ACM Conference on Management of Data (SIGMOD), 2005.
[71]
71. Wu, W., Yu, C., Doan, A., Meng, W.: An interactive clustering-based approach to integrating source query interfaces on the Deep Web. In: Proceedings of SIGMOD, 2004.
[72]
72. Xu, L., Embley, D.: Using domain ontologies to discover direct and indirect matches for schema elements. In: Proceedings of the Semantic Integration Workshop at ISWC-03. http://smi.stanford.edu/si2003, 2003.
[73]
73. Yan, L.L., Miller, R.J., Haas, L.M., Fagin, R.: Data driven understanding and refinement of schema mappings. In: Proceedings of the ACM SIGMOD, 2001.

Cited By

View all
  • (2021)Valentine in actionProceedings of the VLDB Endowment10.14778/3476311.347636614:12(2871-2874)Online publication date: 28-Oct-2021
  • (2021)Matching Large Biomedical Ontologies Using Symbolic RegressionThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487781(162-167)Online publication date: 29-Nov-2021
  • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases
The VLDB Journal — The International Journal on Very Large Data Bases  Volume 16, Issue 1
January 2007
162 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 25 January 2007

Author Tags

  1. Compositional approach
  2. Machine learning
  3. Schema matching
  4. Synthetic schemas
  5. Tuning

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)76
  • Downloads (Last 6 weeks)25
Reflects downloads up to 20 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2021)Valentine in actionProceedings of the VLDB Endowment10.14778/3476311.347636614:12(2871-2874)Online publication date: 28-Oct-2021
  • (2021)Matching Large Biomedical Ontologies Using Symbolic RegressionThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487781(162-167)Online publication date: 29-Nov-2021
  • (2020)Generic schema matching, ten years laterProceedings of the VLDB Endowment10.14778/3402707.34027104:11(695-701)Online publication date: 3-Jun-2020
  • (2020)SPMProceedings of the 13th International Conference on Intelligent Systems: Theories and Applications10.1145/3419604.3419782(1-6)Online publication date: 23-Sep-2020
  • (2018)Non-binary evaluation measures for big data integrationThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0489-y27:1(105-126)Online publication date: 1-Feb-2018
  • (2015)YAMTransactions on Large-Scale Data- and Knowledge-Centered Systems XXV - Volume 962010.1007/978-3-662-49534-6_5(150-185)Online publication date: 1-Dec-2015
  • (2014)Thematic event processingProceedings of the 15th International Middleware Conference10.1145/2663165.2663335(109-120)Online publication date: 8-Dec-2014
  • (2014)Approximate Semantic Matching of Events for the Internet of ThingsACM Transactions on Internet Technology10.1145/263368414:1(1-23)Online publication date: 7-Aug-2014
  • (2014)Search-based metamodel matching with structural and syntactic measuresJournal of Systems and Software10.1016/j.jss.2014.06.04097:C(1-14)Online publication date: 1-Oct-2014
  • (2014)Optimizing ontology alignment through Memetic Algorithm based on Partial Reference AlignmentExpert Systems with Applications: An International Journal10.1016/j.eswa.2013.11.02141:7(3213-3222)Online publication date: 1-Jun-2014
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media