[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article
Free access

Statistical profile estimation in database systems

Published: 01 September 1988 Publication History

Abstract

A statistical profile summarizes the instances of a database. It describes aspects such as the number of tuples, the number of values, the distribution of values, the correlation between value sets, and the distribution of tuples among secondary storage units. Estimation of database profiles is critical in the problems of query optimization, physical database design, and database performance prediction. This paper describes a model of a database of profile, relates this model to estimating the cost of database operations, and surveys methods of estimating profiles. The operators and objects in the model include build profile, estimate profile, and update profile. The estimate operator is classified by the relational algebra operator (select, project, join), the property to be estimated (cardinality, distribution of values, and other parameters), and the underlying method (parametric, nonparametric, and ad-hoc). The accuracy, overhead, and assumptions of methods are discussed in detail. Relevant research in both the database and the statistics disciplines is incorporated in the detailed discussion.

References

[1]
APERS, P. M. G., HEV~ER, A. R., ^NO Y^O, S. B. 1983. Optimization algorithms for distributed queries. IEEE Trans. Softw. Eng. SE-9, 1, 57-68.]]
[2]
ASTRAHAN, M., SCHKOLNICK, M., AND WHANG, K. 1985. Counting unique values of an attribute without sorting. Tech. Rep. RJ 4960, IBM Research Division.]]
[3]
BATORY, D., AND MANNINO, M. 1986. Panel on extensible database systems. In Proceedings of the ACM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 187-190.]]
[4]
BERNSTEIN, P., GOODMAN, N., WONG, E., REEVE, C., AND ROTHNIE, J. 1981. Query processing in SDD-1: A system for distributed databases. ACM Trans. Database Syst. 6, 4 (Dec.), 602-625.]]
[5]
BREIMAN, L., MEISEL, W., AND PURCELL, E. 1977. Variable kernel estimates of multivariate densities. Technometrics 19, 135-144.]]
[6]
CACOULLOS, V., 1966. Estimation of a multivariate density. Ann. Inst. Stat. Math. 18, 178-189.]]
[7]
CARDENAS, A. 1975. Analysis and performance of inverted data-base structures. Commun. A CM 18, 5 (May), 253-263.]]
[8]
CERI, S., AND PELAGATTI, G. 1984. Distributed Databases: Principles & Systems. McGraw-Hill, New York.]]
[9]
CHAMBERLIN, D., ASTRAHAN, M., KING, W., LORiE, R., MEHL, J., PRICE, T., SCHKOLNICK, M., GRIFFITHS SELINGER, P., SLUTZ, D., WADE, B., AND YOST, R. 1981. Support for repetitive transactions and ad hoc queries in system R. ACM Trans. Database Syst. 6, i (Mar.), 70-94.]]
[10]
CHEUNG, T. 1982. Estimating block accesses and number of records in file management. Commun. ACM 25, 7 (July), 484-487.]]
[11]
CHRISTODOULAKIS, S. 1981. Estimating selectivities in data bases. Ph.D. dissertation, CSRG-136, Computer Systems Research Group, Univ. of Toronto.]]
[12]
CHRISTODOULAKIS, S. 1983a. Estimating record selectivities. Inf. Syst. 8, 2 105-115.]]
[13]
CHRISTODOULAKIS, $. 1983b. Estimating block transfers and join sizes. In Proceedings of the ACM SIGMOD Conference (May). ACM, New York, pp. 40-54.]]
[14]
CHRISTODOULAKIS, $. 1984a. Estimating block selectivities. Inf. Syst. 9, 1, 69-79.]]
[15]
CHRISTODOULAKIS, $. 1984b. Implications of certain assumptions in database performance evaluation. ACM Trans. Database Syst. 9, 2 (June), 163-186.]]
[16]
CLARK, C., AND SCHKADE, L. 1983. StatisticalAnatysis for Administrative Decisions. South-Western, Cincinnati.]]
[17]
COPELAND, G., KHOSHAFIAN, $., SMITH, M., AND VALDURIEZ, P. 1986. Buffering schemes for permanent data. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.). IEEE, New York, pp. 214-221.]]
[18]
DEVROYE, L. 1985. A note on the L~ consistency of variable kernel estimates. Ann. Star. 13, 1041-1049.]]
[19]
DEWITT, D., KATZ, R., OLKEN, F., SHAPIRO, L., STONEBRAKER, M., AND WOOD, D. 1984. Implementation techniques for main memory database systems. In Proceedings of the A CM SIG- MOD Conference (Boston, Mass. June). ACM, New York, pp. 1-8.]]
[20]
FEDOROWICZ, J. 1984. Database evaluation using multiple regression techniques. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 70-76.]]
[21]
FEDOROWICZ, J. 1987. Database performance evaluation in an indexed file environment. A CM Trans. Database Syst. 12, i (Mar.), 85-110.]]
[22]
FINKELSTEIN, $., SCHKOLNICK, M., AND TIBERIO, P. 1988. Physical database design for relational databases. A CM Trans. Database Syst. 13, 1 (Mar.), 91-128.]]
[23]
FRASER, D. 1957. Nonparametric Methods in Statistics. Wiley, New York.]]
[24]
GELENSE, E., AND GARDY, D. 1982. The size of projections of relations satisfying a functional dependency. In Proceedings of the 8th International Conference on Very Large Data Bases (Mexico City). Very Large Data Base Endowment, Saratoga, Calif., pp. 325-333.]]
[25]
HEVNER, A., AND YAO, S. B. 1979. Query processing in distributed database systems. IEEE Trans. Softw. Eng. SE-3, 3.]]
[26]
IJBEMA, A., ANO BLANKEN, H. 1986. Estimating bucket accesses: A practical approach. In Proceedings of the Conference on Data Engineering (COMPDEC) (Los Angeles, Calif., Feb.), pp. 30-37.]]
[27]
JARKE, M., ANO KOCH, J. 1984. Query optimization in database systems. A CM Comput. Surv. 16, 2 (June), 111-152.]]
[28]
JOHNSON, N., AND KOTZ, S. 1970. Distributions in Statistics: Continuous Univariate Distributions, vols. I and 2. Houghton Mifflin, Boston.]]
[29]
KAMEL, N., AND KING, R. 1985. A model of data distribution based on texture analysis. In Proceedings of the ACM SIGMOD Conference (Austin, Tex., May). ACM, New York, pp. 319-325.]]
[30]
KERSCHBERG, L. TING, P. D., AND YAO, S. B. 1982. Query optimization in star computer networks. A CM Trans. Database Syst. 7, 4 (Dec.), 678-711.]]
[31]
KOOI, R. 1980. The optimization of queries in relational databases. Ph.D. dissertation, Case Western Reserve Univ., Cleveland, Ohio.]]
[32]
KOTZ, S., AND JOHNSON, N. 1977. Urn Models and Their Application. Wiley, New York.]]
[33]
KUMAR, A., AND STONEBRAKER, M. 1987. The effect of join selectivities on optimal nesting order. SIGMOD Rec. 16, i (Mar.), 28-41.]]
[34]
LOHMAN, G., MOHAN, C., HAAS, L., LINDSAY, B., SELINGER, P., WILMS, P., AND DANIELS, D. 1985. Query processing in R*. In Query Processing in Database Systems, W. Kim, D. Batory, and D. Reiner, Eds. Springer-Verlag, New York, pp. 31-47.]]
[35]
LUK, W. 1983. On estimating block accesses in database organizations. Commun. A CM 26, 11 (Nov.), 945-947.]]
[36]
MACKERT, L., AND LOHMAN, G. 1985. Index scans using a finite LRU buffer: A validated I/O model. IBM Research Rep. RJ4836, Almaden Research Center, San Jose, Calif.]]
[37]
MACKERT, L., AND LOHMAN, G. 1986a. R* optimizer validation and performance evaluation for local queries. In Proceedings of the A CM SIGMOD Conference (Washington, D.C., May). ACM, New York, pp. 84-95.]]
[38]
MACKERT, L., AND LOHMAN, G. 1986b. R* Optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc. (Also IBM Research Report RJ5050, Alamaden Research Center, San Jose, Calif.)]]
[39]
MAHALANOBIS, P. 1936. On the generalized distance in statistics. In Proceedings of the National Institute of Sciences of India 12, pp. 35-49.]]
[40]
MANNINO, M. 1986. Selectivity estimation in unify SQL. Tech. Rep. Dept. of Management Science and Information Systems, Univ. of Texas, Austin.]]
[41]
MANNINO, M., AND RIVERA, A. 1988. An extensible model of selectivity estimation. Inf. Sci.-An International Journal. Special issue on database systems to appear Spring 1988.]]
[42]
MERRETT, T. H., AND OTOO, E. 1979. Distribution models of relations. In Proceedings of the 5th International Conference on Very Large Data Bases (Rio de Janeiro, Brazil, Oct.). ACM, New York, pp. 418-425.]]
[43]
MONTGOMERY, A., D'SOuzA, D., AND LEE, S. 1983. The cost of relational algebraic operations on skewed data: Estimates and experiments. In Information Processing Letters 83. Elsevier North-Holland, New York, pp. 235-241.]]
[44]
MOORE, D., AND YACKEL, J. 1977. Consistency properties of nearest neighbor density function estimates. Ann. Stat. 5, 143-154.]]
[45]
MURALIKRISHNA, M., AND DEWlTT, D. 1988. Equidepth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD Conference (Chicago, Ill., June). ACM, New York, pp. 28-36.]]
[46]
MUTHUSWAMY, B., AND KERSCHBERG, G. 1985. A DDSM for relational query optimization. Tech. Rep., Univ. of South Carolina, Columbia. Also in Proceedings of the ACM Annual Conference (Denver, Colo., Oct.). ACM, New York.]]
[47]
PIATETSKY-SHAPIRO, G., AND CONNELL, G. 1984. Accurate estimation of the number of tuples satisfying a condition. In Proceedings of the ACM SIGMOD Conference (Boston, Mass., June). ACM, New York, pp. 256-276.]]
[48]
PIATETSKY-SHAPIRO, G. 1985. Estimating the number of distinct attribute values by using sampling. Submitted for publication.]]
[49]
ROWE, N. 1985. Antisampling for estimation: An overview. IEEE Trans. Softw. Eng. 11, 10 (Oct.)., 1081-1091.]]
[50]
SAMSON, W., AND BENDELL, A. 1983. Rank order distributions and secondary key indexing (extended abstract). In Proceedings of the 2nd International Conference on Databases, (Cambridge, England).]]
[51]
SELINGER, P., ASTRAHAN, M., CHAMBERLIN, D., LORIE, R., ANO PRICE, T. 1979. Access path selection in a relational database management system. In Proceedings of the ACM SIGMOD Conference (San Jose, Calif.). ACM, New York, pp. 23-34.]]
[52]
SCHKOLNICK, M., AND TIBERIO, P. 1979. Considerations in developing a design tool for a relational DBMS. In Proceedings of the IEEE COMPSAC Conference (Chicago, Ill., Nov.). IEEE Press, New York, pp. 228-235.]]
[53]
STONEBRAKER, M., RUBENSTEIN, B., AND GUTMANN, A. 1983. Application of abstract data types and abstract indices to CAD databases. In Proceedings of Database Week: Engineering Design Applications (San Jose, Calif., May). pp. 107-113.]]
[54]
STUROES, H. 1926. The choice of class interval. J. Am. Stat. Assoc. 65-66.]]
[55]
TAPIA, R., AND THOMPSON, J. 1978. Nonparametric Probability Density Estimation. John Hopkins University Press, Baltimore, Md.]]
[56]
TEOREY, T. ANO FRY, J. 1982. Design of Database Structures. Prentice-Hall, Englewood, Cliffs, N.J.]]
[57]
UNIFY CORPORATION. 1985. Unix Relational Database Management System--Reference Manual. Release 3.2. Portland, Oreg.]]
[58]
VANDER ZANDER, B., TAYLOR, H., AND BITON, D. 1986. Estimating block accesses when attributes are correlated. In Proceedings of the 12th International Conference on Very Large Databases (Kyoto, Japan, Aug.). Morgan Kaufmann Publishers, Inc., pp. 119-127.]]
[59]
WEOMAN, E. 1983. Density estimation. In Encyclopedia of Statistical Sciences, Vol. 2, S. Kotz and N. Johnson, Eds. Wiley, New York.]]
[60]
WERTZ, W. 1978. Statistical Density Estimation: A Survey. Vandenhoeck and Ruprecht, Gottingen.]]
[61]
WHANG, K., WIEDERHOLD, G., AND SAGALOWICZ, D. 1983. Estimating block accesses in database organizations: A closed noniterative formula. Commun. ACM 26, 11 (Nov.), 940-944.]]
[62]
YAO, S. 1977. Approximating block accesses in database organizations. Commun. A CM 20, 4 (Apr.), 260-261.]]
[63]
ZAHORJAN, J., BELL. B., AND SEVCIK, K. 1983. Estimating block transfers when record access probabilities are non-uniform. Inf. Process. Lett. 16, 5 (June), 249-252.]]
[64]
ZIPF, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, Mass.]]

Cited By

View all
  • (2024)Identifying the Root Causes of DBMS SuboptimalityACM Transactions on Database Systems10.1145/363642549:1(1-40)Online publication date: 28-Feb-2024
  • (2023)Characteristic sets profile features: Estimation and application to SPARQL query planningSemantic Web10.3233/SW-22290314:3(491-526)Online publication date: 5-Apr-2023
  • (2023)High-Performance Row Pattern Recognition Using JoinsProceedings of the VLDB Endowment10.14778/3579075.357909016:5(1181-1195)Online publication date: 1-Jan-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 20, Issue 3
Sept. 1988
68 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/62061
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 1988
Published in CSUR Volume 20, Issue 3

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)162
  • Downloads (Last 6 weeks)17
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Identifying the Root Causes of DBMS SuboptimalityACM Transactions on Database Systems10.1145/363642549:1(1-40)Online publication date: 28-Feb-2024
  • (2023)Characteristic sets profile features: Estimation and application to SPARQL query planningSemantic Web10.3233/SW-22290314:3(491-526)Online publication date: 5-Apr-2023
  • (2023)High-Performance Row Pattern Recognition Using JoinsProceedings of the VLDB Endowment10.14778/3579075.357909016:5(1181-1195)Online publication date: 1-Jan-2023
  • (2023)Dataset Discovery and Exploration: A SurveyACM Computing Surveys10.1145/362652156:4(1-37)Online publication date: 9-Nov-2023
  • (2022)Survey on Learnable Databases: A Machine Learning PerspectiveBig Data Research10.1016/j.bdr.2021.100304(100304)Online publication date: Jan-2022
  • (2022)Data ProfilingEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_8-2(1-6)Online publication date: 17-Feb-2022
  • (2021)Index-Accelerated Pattern Matching in Event StoresProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457245(1023-1036)Online publication date: 9-Jun-2021
  • (2021)Join cardinality estimation by combining operator-level deep neural networksInformation Sciences10.1016/j.ins.2020.09.065546(1047-1062)Online publication date: Feb-2021
  • (2020)Anatomy of Metadata for Data CurationJournal of Data and Information Quality10.1145/337192512:3(1-30)Online publication date: 13-Jun-2020
  • (2020)From Relation Algebra to Semi-join Algebra: An Approach to Graph Query OptimizationThe Computer Journal10.1093/comjnl/bxaa03164:5(789-811)Online publication date: 9-May-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media