Abstract
Genetic programming (GP) has been shown to be a powerful tool for automatic modeling and program induction. It is often used to solve difficult symbolic regression tasks, with many examples in real-world domains. However, the robustness of GP-based approaches has not been substantially studied. In particular, the present work deals with the issue of outliers, data in the training set that represent severe errors in the measuring process. In general, a datum is considered an outlier when it sharply deviates from the true behavior of the system of interest. GP practitioners know that such data points usually bias the search and produce inaccurate models. Therefore, this work presents a hybrid methodology based on the RAndom SAmpling Consensus (RANSAC) algorithm and GP, which we call RANSAC-GP. RANSAC is an approach to deal with outliers in parameter estimation problems, widely used in computer vision and related fields. On the other hand, this work presents the first application of RANSAC to symbolic regression with GP, with impressive results. The proposed algorithm is able to deal with extreme amounts of contamination in the training set, evolving highly accurate models even when the amount of outliers reaches 90%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
All results presented in this work are based on a standard GP implementation, using a tree representation, subtree crossover and subtree mutation.
- 2.
An extensive discussion of these methods is beyond the scope of this work.
- 3.
For all practitioners, it is intuitively evident that 10% of outliers can have drastic effects in the modeling process.
References
Alfons, A., Croux, C., Gelper, S., et al.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Annals Appl. Stat. 7(1), 226–248 (2013)
Derpanis, K.G.: Overview of the RANSAC algorithm. Image Rochester NY 4(1), 2–3 (2010)
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Fortin, F.A., De Rainville, F.M., Gardner, M.A., Parizeau, M., Gagné, C.: DEAP: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)
Giloni, A., Padberg, M.: Least trimmed squares regression, least median squares regression, and mathematical programming. Math. Comput. Model. 35(9), 1043–1060 (2002)
Gonçalves, I., Silva, S.: Balancing learning and overfitting in genetic programming with interleaved sampling of training data. In: Krawiec, K., Moraglio, A., Hu, T., Etaner-Uyar, A.Ş., Hu, B. (eds.) EuroGP 2013. LNCS, vol. 7831, pp. 73–84. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37207-0_7
Hast, A., Nysjö, J., Marchetti, A.: Optimal RANSAC-towards a repeatable algorithm for finding the optimal set. J. WSCG 21(1), 21–30 (2013)
Kotanchek, M.E., Vladislavleva, E.Y., Smits, G.F.: Symbolic regression via genetic programming as a discovery engine: insights on outliers and prototypes. In: Riolo, R., O’Reilly, U.-M., McConaghy, T. (eds.) Genetic Programming Theory and Practice VII, pp. 55–72. Springer, Heidelberg (2010)
La Cava, W., Spector, L., Danai, K.: Epsilon-lexicase selection for regression. In: GECCO 2016 Proceedings of the Genetic and Evolutionary Computation Conference 2016, pp. 741–748. ACM, New York (2016)
Lacey, A., Pinitkarn, N., Thacker, N.A.: An evaluation of the performance of RANSAC algorithms for stereo camera calibrarion. In: BMVC, pp. 1–10 (2000)
Martínez, Y., Trujillo, L., Naredo, E., Legrand, P.: A comparison of fitness-case sampling methods for symbolic regression with genetic programming. In: Tantar, A.-A. (ed.) EVOLVE - A Bridge between Probability, Set Oriented Numerics, and Evolutionary Computation V. AISC, vol. 288, pp. 201–212. Springer, Heidelberg (2014). doi:10.1007/978-3-319-07494-8_14
Martnez, Y., Naredo, E., Trujillo, L., Legrand, P., Lpez, U.: A comparison of fitness-case sampling methods for genetic programming. Journal of Experimental and Theoretical Artificial Intelligence (accepted to appear 2016)
McDermott, J., White, D.R., Luke, S., Manzoni, L., Castelli, M., Vanneschi, L., Jaskowski, W., Krawiec, K., Harper, R., De Jong, K., O’Reilly, U.M.: Genetic programming needs better benchmarks. In: GECCO 2012 Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference, pp. 791–798. ACM, New York (2012)
Nunkesser, R., Morell, O.: An evolutionary algorithm for robust regression. Comput. Stat. & Data Anal. 54(12), 3242–3248 (2010)
Pearson, R.K.: Mining imperfect data: dealing with contamination and incomplete records. SIAM (2005)
Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79(388), 871–880 (1984)
Spector, L.: Assessment of problem modality by differential performance of lexicase selection in genetic programming: a preliminary report. In: GECCO Companion 2012 Proceedings of the Fourteenth International Conference on Genetic and Evolutionary Computation Conference Companion, pp. 401–408. ACM (2012)
Tarsha-Kurdi, F., Landes, T., Grussenmeyer, P., et al.: Hough-transform and extended RANSAC algorithms for automatic detection of 3D building roof planes from Lidar data. In: Proceedings of the ISPRS Workshop on Laser Scanning. vol. 36, pp. 407–412 (2007)
Torr, P.H., Zisserman, A.: MLESAC: a new robust estimator with application to estimating image geometry. Comput. Vis. Image Underst. 78(1), 138–156 (2000)
Zuliani, M.: RANSAC for Dummies. Vision Research Lab, University of California, Santa Barbara (2009)
Acknowledgments
First author was supported by CONACYT (México) scholarships No. 573397. This research was partially supported by CONACYT Basic Science Research Project No. 178323, CONACYT Fronteras de la Ciencia 2015-2 No. 944, as well as by FP7- Marie Curie-IRSES 2013 European Commission program with project ACoBSEC with contract No. 612689. Sara Silva acknowledges project PERSEIDS (PTDC/EMS-SIS/0642/2014) and BioISI RD unit, UID/MULTI/04046/2013, funded by FCT/MCTES/PIDDAC, Portugal.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
López, U., Trujillo, L., Martinez, Y., Legrand, P., Naredo, E., Silva, S. (2017). RANSAC-GP: Dealing with Outliers in Symbolic Regression with Genetic Programming. In: McDermott, J., Castelli, M., Sekanina, L., Haasdijk, E., García-Sánchez, P. (eds) Genetic Programming. EuroGP 2017. Lecture Notes in Computer Science(), vol 10196. Springer, Cham. https://doi.org/10.1007/978-3-319-55696-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-319-55696-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55695-6
Online ISBN: 978-3-319-55696-3
eBook Packages: Computer ScienceComputer Science (R0)