[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

The problem of disguised missing data

Published: 01 June 2006 Publication History

Abstract

Missing data is a well-recognized problem in large datasets, widely discussed in the statistics and data analysis literature. Many programming environments provide explicit codes for missing data, but these are not standardized and are not always used. This lack of standardization is one of the leading causes of the subtle problem of disguised missing data, in which unknown, inapplicable, or otherwise nonspecified responses are encoded as valid data values. Following a brief overview of the problem of explicitly coded missing data, this paper discusses sources, consequences, and detection of disguised missing data, including two real-world examples. As the first of these examples illustrates, the consequences of disguised missing data can be quite serious. The key to its detection lies in first, recognizing disguised missing data as a possibility and second, finding a sufficiently informative view of the data to reveal its presence.

References

[1]
P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley, 1996.
[2]
V. Barnett and T. Lewis. Outliers in Statistical Data. Wiley, 3rd edition, 1994.
[3]
J. Breault. Data mining diabetic databases: Are rough sets a useful addition? In Proc. 33rd Symposium on the Interface, Computing Science and Statistics, Fairfax, VA, 2001.
[4]
L. Breiman. Bagging predictors. Machine Learning, 24:123--140, 1996.
[5]
L. Breiman. Heuristics of instability and stabilization in model selection. Ann. Statist., 24:2350--2383, 1996.
[6]
C. Date. An Introduction to Database Systems. Addison-Wesley, 7th edition, 2000.
[7]
D. DesJardins. Outliers, inliers, and just plain liars---new EDA+ (EDA Plus) techniques for understanding data. In Proc. SAS User's Group Intl. Conf., SUGI26, Long Beach, CA, 2001. Paper 169.
[8]
A. Feelders. Handling missing data in trees: surrogate splits or statistical imputation? In Principles of Data Mining and Knowledge Discovery (PKDD99), pages 329--334, 1999.
[9]
D. Heitjan. Ignorability and coarse data: Some biomedical examples. Biometrics, 49:1099--1109, 1993.
[10]
D. Heitjan and D. Rubin. Ignorability and coarse data. Ann. Statist., 19:2244--2253, 1991.
[11]
N. Horton and S. Lipsitz. Multiple imputation in practice: Comparison of software packages for regression models with missing variables. American Statistician, 55:244--254, 2001.
[12]
M. Huisman. Missing data in behavioral science research: Investigation of a collection of datasets. Kwantitatieve Methoden, 57:69--93, 1998. (in English).
[13]
M. Huisman. Post-stratification to correct for nonresponse: classification of zip code areas. In J. Bethlehem and P. van der Heijden, editors, Proc. 14th Symposium Computational Statistics, COMPSTAT 2000, pages 325--330, Utrecht, 2000.
[14]
M. Huisman and J. van der Zouwen. Item nonresponse in scale data from surveys: Types, determinants, and measures. Technical report, University of Groningen, 1998.
[15]
M. Jaeger. Ignorability for categorical data. Ann. Statist, 33:1964--1981, 2005.
[16]
G. John, R. Kohavi, and K. Pfleger. Irrelevant features and the subset selection problem. In W. Cohen and H. Hirsch, editors, Machine Learning: Proc. 11th International Conf., pages 121--129, 1994.
[17]
R. Little and D. Rubin. Statistical Analysis with Missing Data. Wiley, 2002.
[18]
G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1997.
[19]
J. Mistiaen and M. Ravallion. Survey compliance and the distribution of income. Policy Research Working Paper WPS2956, The World Bank, Development Research Group, Poverty Team, available at http://econ.worldbank.org, 2003.
[20]
J. Myllymaki. Effective web data extraction with standard XML technologies. In Proc. 10th International World Wide Web Conf., Hong Kong, 2001.
[21]
R. Pearson. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. SIAM, 2005.
[22]
J. Schafer and J. Graham. Missing data: Our view of the state of the art. Psychological Methods, 7:147--177, 2002.
[23]
A. Trontell. How the US Food and Drug Administration defines and detects adverse drug events. Current Therapeutic Research, 62:641--649, 2001.
[24]
W. Venables and B. Ripley. Modern Applied Statistics with S. Springer, 2002.
[25]
Y. Wei, K. Detre, and J. Everhart. The NIDDK liver transplantation database. Liver Transplant Surgery, 3:10--22, 1997.

Cited By

View all
  • (2024)A machine learning approach to discrimination of igneous rocks and ore deposits by zircon trace elementsAmerican Mineralogist10.2138/am-2022-8899109:6(1129-1142)Online publication date: 1-Jun-2024
  • (2024)Investigating User Estimation of Missing Data in Visual AnalysisProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670977(1-13)Online publication date: 3-Jun-2024
  • (2024)Quantum mechanics-based missing value estimation framework for industrial dataExpert Systems with Applications10.1016/j.eswa.2023.121385236(121385)Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 8, Issue 1
June 2006
104 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1147234
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2006
Published in SIGKDD Volume 8, Issue 1

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)7
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A machine learning approach to discrimination of igneous rocks and ore deposits by zircon trace elementsAmerican Mineralogist10.2138/am-2022-8899109:6(1129-1142)Online publication date: 1-Jun-2024
  • (2024)Investigating User Estimation of Missing Data in Visual AnalysisProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670977(1-13)Online publication date: 3-Jun-2024
  • (2024)Quantum mechanics-based missing value estimation framework for industrial dataExpert Systems with Applications10.1016/j.eswa.2023.121385236(121385)Online publication date: Feb-2024
  • (2023)Regression with sensor data containing incomplete observationsProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619061(15911-15927)Online publication date: 23-Jul-2023
  • (2023)Spatiotemporal Generative Adversarial Imputation Networks: An Approach to Address Missing Data for Wind TurbinesIEEE Transactions on Instrumentation and Measurement10.1109/TIM.2023.331249372(1-8)Online publication date: 2023
  • (2023)Efficient permutation testing of variable importance measures by the example of random forestsComputational Statistics & Data Analysis10.1016/j.csda.2022.107689181(107689)Online publication date: May-2023
  • (2023)A review on the significance of body temperature interpretation for early infectious disease diagnosisArtificial Intelligence Review10.1007/s10462-023-10528-x56:12(15449-15494)Online publication date: 19-Jun-2023
  • (2022)Toward Systematic Considerations of Missingness in Visual Analytics2022 IEEE Visualization and Visual Analytics (VIS)10.1109/VIS54862.2022.00031(110-114)Online publication date: Oct-2022
  • (2022)The detection algorithm for disguised missing value based on filter-KmeansMultimedia Tools and Applications10.1007/s11042-022-13421-x82:5(7583-7598)Online publication date: 2-Sep-2022
  • (2022)Uncertainty-bounded reinforcement learning for revenue optimization in air cargo: a prescriptive learning approachKnowledge and Information Systems10.1007/s10115-022-01713-564:9(2515-2541)Online publication date: 2-Aug-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media