Abstract
Two major data mining competitions in 2008 presented challenges in medical domains: KDD Cup 2008, which concerned cancer detection from mammography data; and Informs Data Mining Challenge 2008, dealing with diagnosis of pneumonia based on patient information from hospital files. Our team won both of these competitions, and in this paper we share our lessons learned and insights. We emphasize the aspects that pertain to the general practice and methodology of medical data mining, rather than to the specifics of each modeling competition. We concentrate on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bandos AI, Rockette HE, Song T, Gur D (2008) Area under the free-response ROC curve (FROC) and a related summary index. Biometrics 65(1): 247–256
DeLuca PM, Wambersie A, Whitmore GF (2008) Extensions to conventional ROC methodology: LROC, FROC, and AFROC. J ICRU 8: 31–35
Domingos P, Richardson M (2007) Markov logic: a unifying framework for statistical relational learning. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, Cambridge
Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of the international conference on machine learning
Getoor L, Friedman N, Koller D, Pfeffer A, Taskar B (2007) Probabilistic relational models. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, Cambridge
Glymour C, Scheines R, Spirtes P, Kelly K (1987) Discovering causal structure: artificial intelligence, philosophy of science, and statistical modeling. Academic Press, San Diego
Inger A, Vatnik N, Rosset S, Neumann E (2000) KDD-Cup 2000: question 1 winner’s report, SIGKDD explorations
Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the international conference on machine learning
Joachims T (1999) Making large-scale SVM learning practical. In: Scholkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT Press, Cambridge
Kou Z, Cohen WW (2007) Stacked graphical learning for efficient inference in markov random fields. In: Proceedings of the international conference on data mining
Krogel M-A, Wrobel S (2003) Facets of aggregation approaches to propositionalization. In: Proceedings of the international conference on inductive logic programming
Lawrence R, Perlich C, Rosset S et al (2007) Analytics-driven solutions for customer targeting and sales-force allocation. IBM Syst J 46(4): 797–816
Melville P, Rosset S, Lawrence R (2008) Customer targeting models using actively-selected web content. In: Proceedings of the conference on knowledge discovery and data mining
Muggleton SH, DeRaedt L (1994) Inductive logic programming: theory and methods. J Logic Program 19 & 20: 629–680
NIST/SEMATECH (2006) e-Handbook of Statistical Methods, chap. 1. http://www.itl.nist.gov/div898/handbook/eda/eda.htm
Perlich C (2005) Approaching the ILP challenge 2005: class-conditional bayesian propositionalization for genetic classification. In: Proceedings of the conference on inductive logic programming
Perlich C, Provost F (2006) ACORA: distribution-based aggregation for relational learning from identifier attributes, special issue on statistical relational learning and multi-relational data mining. J Mach Learn 62: 65–105
Perlich C, Melville P, Liu Y, Swirszcz G, Lawrence R, Rosset S (2008) Breast cancer identification: KDD cup winner’s report, SIGKDD explorations
Platt J (1998) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Bartlett PJ, Schölkopf B, Schuurmans D, Smola AJ (eds) Advances in large margin classifiers. MIT Press, Cambridge
Rao RB, Yakhnenko O, Krishnapuram B (2008) KDD Cup 2008 and the workshop on mining medical data, SIGKDD explorations
Rosset S, Perlich C, Liu Y (2007) Making the most of your data: KDD Cup 2007 “How many ratings” winner’s report, SIGKDD Explorations
Russ TA (1989) Using hindsight in medical decision making. In: Proceedings of the thirteenth annual symposium on computer applications in medical care
Saar-Tsechansky M, Pliskin N, Rabinowitz G, Porath A (2001) Monitoring quality of care with relational patterns. Top Health Inf Manag 22(1): 24–35
Shahar Y (2000) Dimension of time in illness: an objective view. Ann Intern Med 132: 45–53
Simon HA (1954) Spurious correlation: a causal interpretation. J Am Stat Assoc 49: 467–479
Turney PD (2000) Types of cost in inductive concept learning In: Proceedings of the workshop on cost-sensitive learning at the international conference on machine learning
Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: International conference on machine learning
Weiss GM, Saar-Tsechansky M, Zadrozny B (2008) Special issue on utility-based data mining (editors). Data Min Knowl Discov 17(2)
White K, Dufresne RL (1997) The placebo effect in drug trials and the double blind. In: Hertzman M, Feltner DE (eds) The handbook of psychopharmacology trials. NYU Press, New York, pp 123–136
Wolpert DH (1992) Stacked generalization. Neural Networks 5: 241–259
Yan R, Zhang J, Yang J, Hauptmann A (2004) A discriminative learning framework with pairwise constraints for video object classification. In: Proceedings of IEEE conference on computer vision and pattern recognition
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by R. Bharat Rao and Romer Rosales.
Rights and permissions
About this article
Cite this article
Rosset, S., Perlich, C., Świrszcz, G. et al. Medical data mining: insights from winning two competitions. Data Min Knowl Disc 20, 439–468 (2010). https://doi.org/10.1007/s10618-009-0158-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0158-x