A Generalized Additive Model Combining Principal Component Analysis for PM2.5 Concentration Estimation
<p>Study area and partial basic geographical feature data: (<b>a</b>) PM<sub>2.5</sub> monitoring sites and elevation; (<b>b</b>) road traffic; (<b>c</b>) land use/cover; (<b>d</b>) industrial plants; and (<b>e</b>) surface dust.</p> "> Figure 2
<p>Framework of the study procedure.</p> "> Figure 3
<p>Histogram and description of PM<sub>2.5</sub> concentrations and potential predictor variables.</p> "> Figure 4
<p>All-subsets regression.</p> "> Figure 5
<p>Scatter plots of fitting and validating results for three models. (<b>a</b>) OLS model fitting results; (<b>b</b>) GAM model fitting results; (<b>c</b>) PCA–GAM model fitting results; (<b>d</b>) OLS model validating results; (<b>e</b>) GAM model validating results; and (<b>f</b>) PCA–GAM model validating results.</p> "> Figure 6
<p>Results of PCA–GAM analysis. The <span class="html-italic">x</span>-axis represents the frequency of data; the <span class="html-italic">y</span>-axis represents the smooth fitted values. Degrees of freedom for linear or non-linear fits are in the parentheses on the <span class="html-italic">y</span>-axis. The solid line represents the fitted curve of each independent variable, the blue area represents the 95% confidence interval.</p> "> Figure 7
<p>Spatial distributions of estimated PM<sub>2.5</sub> concentrations.</p> ">
Abstract
:1. Introduction
2. Experiments
2.1. Study Area and Data Collection
2.2. Methods
2.2.1. Predictor Variable Extraction and Screening
2.2.2. Regression Modelling
- PCA refers to a mathematical method that transforms the original set of inter-correlated variables into a new set of an equal number of independent uncorrelated variables, which gives the linear combination of the original set of data. It maximizes the correlation between the original variables to form new variables that are mutually orthogonal, or uncorrelated [26]. The new variables are ordered in such a way that the first new variable explains most of the variance in the data, and each subsequent one accounts for the largest proportion of variability that has not been accounted for by its predecessors. For this process, PCA was employed to transform the final inter-correlated effective predictor variables into principal components (PCs) in PAST software. Additionally, the redundancy information of effective predictor variables was removed by using the PCs instead of the original explanatory variables, and integrating the same variable information together.
- To avoid the over-fitting problem caused by too many PCs in the regression modelling, we used all-subsets regression to select the optimal subset of variables in R Studio. As one of the most common methods for selecting the final predictor variables from too many variables, all-subsets regression tests all possible subsets of the set of potential independent variables [34]. If there are k potential independent variables besides the constant, then there are 2 k distinct subsets of them to be tested, including the empty set which corresponds to the mean model [35,36]. Several measures with respect to the selection criteria have been proposed, such as the adjusted coefficient of determination (adjusted R2), Mallow’s Cp, and the Akaike Information Criterion (AIC) [37]. The adjusted R2 was used as the selection criteria to select the optimal subset of PCs in this study.
- After the pre-screening of multivariate variables, the package of “mgcv” in R Studio was used to fit GAM as implemented by the gam() function, which generalized multivariate regression by relaxing the assumptions of linearity and normality, replacing regression lines by smooth lines [38]. In this process, the linear or non-linear relationships between PM2.5 concentration with associated contributing factors, were fitted with thin plate regression splines by using the “GCV” method to automatically choose a smoothing parameter [39]. The one degree of freedom indicated that the predictor variable was fitted with a parametric linear term rather than a smoothed term. The finalized regression model presented in this article was determined such that the model AIC value is among the lowest of all the models [40]. Additionally, a significant test was also employed using the 0.05 level to check whether each term remaining in the finalized model was statistically significant [22].
2.2.3. Model Validation
2.2.4. PM2.5 Concentration Mapping
3. Results
3.1. Descriptive Effective Predictor Variables
3.2. Model Fitting and Validation
3.3. PM2.5 Concentration Mapping
4. Discussion
5. Conclusions
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Hu, L.; Liu, J.; He, Z. Self-Adaptive Revised Land Use Regression Models for Estimating PM2.5 Concentrations in Beijing, China. Sustainability 2016, 8, 786. [Google Scholar] [CrossRef]
- Krstic, G. A reanalysis of fine particulate matter air pollution versus life expectancy in the United States. J. Air Waste Manag. Assoc. 2012, 62, 989–991. [Google Scholar] [CrossRef] [PubMed]
- Silva, R.A.; West, J.J.; Zhang, Y.; Anenberg, S.C.; Lamarque, J.F.; Shindell, D.T.; Collins, W.J.; Dalsoren, S.; Faluvegi, G.; Folberth, G.; et al. Global premature mortality due to anthropogenic outdoor air pollution and the contribution of past climate change. Environ. Res. Lett. 2013, 8, 034005. [Google Scholar] [CrossRef]
- Lim, J.M.; Jeong, J.H.; Lee, J.H.; Moon, J.H.; Chung, Y.S.; Kim, K.H. The analysis of PM2.5 and associated elements and their indoor/outdoor pollution status in an urban area. Indoor Air 2011, 21, 145–155. [Google Scholar] [CrossRef] [PubMed]
- Hoek, G.; Krishnan, R.M.; Beelen, R.; Peters, A.; Ostro, B.; Brunekreef, B.; Kaufman, J.D. Long-term air pollution exposure and cardio- respiratory mortality: A review. Environ. Health 2013, 12, 43. [Google Scholar] [CrossRef] [PubMed]
- Giorginia, P.; Di Giosia, P.; Grassi, D.; Rubenfire, M.; Brook, R.D.; Ferri, C. Air pollution exposure and blood pressure: An updated review of the literature. Curr. Pharm. Des. 2016, 22, 28–51. [Google Scholar] [CrossRef]
- Pope, C.A.; Burnett, R.T.; Thun, M.J.; Calle, E.E.; Krewski, D.; Ito, K.; Thurston, G.D. Lung Cancer, Cardiopulmonary Mortality, and Long-term Exposure to Fine Particulate Air Pollution. J. Am. Med. Assoc. 2002, 287, 1132–1141. [Google Scholar] [CrossRef]
- Lakshmanan, A.; Chiu, Y.H.M.; Coull, B.A.; Just, A.C.; Maxwell, S.L.; Schwartz, J.; Gryparis, A.; Kloog, I.; Wright, R.J.; Wright, R.O. Associations between prenatal traffic-related air pollution exposure and birth weight: Modification by sex and maternal pre-pregnancy body mass index. Environ. Res. 2015, 137, 268–277. [Google Scholar] [CrossRef] [PubMed]
- Ross, Z.; Ito, K.; Johnson, S.; Yee, M.; Pezeshki, G.; Clougherty, J.E.; Savitz, D.; Matte, T. Spatial and temporal estimation of air pollutants in New York City: Exposure assignment for use in a birth outcomes study. Environ. Health 2013, 12. [Google Scholar] [CrossRef] [PubMed]
- Fang, X.; Zou, B.; Liu, X.; Sternberg, T.; Zhai, L. Satellite-based ground PM2.5 estimation using timely structure adaptive modeling. Remote Sens. Environ. 2016, 186, 152–163. [Google Scholar] [CrossRef]
- Briggs, D.J.; Collins, S.; Elliott, P.; Fischer, P.; Kingham, S.; Lebret, E.; Pryl, K.; Van Reeuwijk, H.; Smallbone, K.; Van Der Veen, A. Mapping urban air pollution using GIS: A regression-based approach. Int. J. Geogr. Inf. Sci. 1997, 11, 699–718. [Google Scholar] [CrossRef]
- Meng, X.; Fu, Q.Y.; Ma, Z.W.; Chen, L.; Zou, B.; Zhang, Y.; Xue, W.B.; Wang, J.N.; Wang, D.F.; Kan, H.D.; et al. Estimating ground-level PM10 in a Chinese city by combining satellite data, meteorological information and land use regression model. Environ. Pollut. 2015, 208, 177–184. [Google Scholar] [CrossRef] [PubMed]
- Jiao, L.M.; Xu, G.; Zhao, S.L.; Dong, T.; Li, J.Y. LUR-based simulation of the spatial distribution of PM2.5 of Wuhan. Geomat. Inf. Sci. Wuhan Univ. 2015, 40, 1088–1094. [Google Scholar]
- Zhai, L.; Zou, B.; Fang, X.; Luo, Y.; Wan, N.; Li, S. Land Use Regression Modeling of PM2.5 Concentrations at Optimized Spatial Scales. Atmosphere 2017, 8, 1. [Google Scholar] [CrossRef]
- Li, J.; Zhai, L.; Sang, H.Y.; Zhang, Y.; Yuan, J. Comparison of different spatial interpolation methods for PM2.5. Sci. Surv. Mapp. 2016, 41, 50–54. [Google Scholar]
- Esra, P.; Gunay, S. The Comparision of Partial Least Squares Regression, Principal Component Regression and Ridge Regression with Multiple Line Regression for Predicting PM10 Concentration Level Based on Meteorological Parameters. J. Data Sci. 2015, 13, 663–692. [Google Scholar]
- Vienneau, D.; de Hoogh, K.; Bechle, M.J.; Beelen, R.; van Donkelaar, A.; Martin, R.V.; Millet, D.B.; Hoek, G.; Marshall, J.D. Western European land use regression incorporating satellite and ground-based measurements of NO2 and PM10. Environ. Sci. Technol. 2013, 47, 68–77. [Google Scholar] [CrossRef] [PubMed]
- Beelen, R.; Hoek, G.; Vienneau, D.; Eeftens, M.; Dimakopoulou, K.; Pedeli, X.; Tsai, M.Y.; Künzli, N.; Schikowski, T.; Marcon, A.; et al. Development of NO2 and NOx land use regression models for estimating air pollution exposure in 36 study areas in Europe—The ESCAPE project. Atmos. Environ. 2013, 72, 10–23. [Google Scholar] [CrossRef]
- Zou, B.; Wilson, J.G.; Zhan, F.B.; Zeng, Y.; Wu, K. Spatial-temporal Variations of Regional Ambient Sulfur Dioxide Concentration and Source Contribution Analysis. Atmos. Environ. 2011, 45, 4977–4985. [Google Scholar] [CrossRef]
- Diem, J.E.; Comrie, A.C. Predictive mapping of air pollution involving sparse spatial observations. Environ. Pollut. 2002, 119, 99–117. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R. Generalized Additive Models. Stat. Sci. 1986, 1, 297–318. [Google Scholar] [CrossRef]
- Zou, B.; Chen, J.; Zhai, L.; Fang, X.; Zheng, Z. Satellite Based Mapping of Ground PM2.5 Concentration Using Generalized Additive Modeling. Remote Sens. 2017, 9, 1. [Google Scholar] [CrossRef]
- Jiao, L.M.; Jin, J.M. Regional PM2.5 Concentration Effect Factors Identification and Correlation Analysis Based on GAM. Environ. Sci. Technol. 2015, 38, 123–128. [Google Scholar]
- He, X.; Lin, Z.S. Interactive Effects of the Influencing Factors on the Changes of PM2.5 Concentration Based on GAM Model. Environ. Sci. 2017, 38, 22–32. [Google Scholar]
- Ul-Saufie, A.Z.; Yahaya, A.S.; Ramli, N.A.; Rosaida, N.; Hamid, H.A. Future daily PM10 concentrations prediction by combining regression models and feedforward backpropagation models with principle component analysis (PCA). Atmos. Environ. 2013, 77, 621–630. [Google Scholar] [CrossRef]
- Abdul-Wahab, S.A.; Bakheit, C.S.; Al-Alawi, S.M. Principal component and multiple regression analysis in modelling of ground-level ozone and factors affecting its concentrations. Environ. Model. Softw. 2005, 20, 1263–1271. [Google Scholar] [CrossRef]
- Vaidya, O.C.; Howell, G.D.; Leger, D.A. Evaluation of the Distribution of Mercury in Lakes in Nova Scotia and Newfoundland. Water Air Soil Pollut. 2000, 117, 353–369. [Google Scholar] [CrossRef]
- Debarchana, G.; Manson, S.M. Robust Principal Component Analysis and Geographically Weighted Regression Urbanization in the Twin Cities Metropolitan Area of Minnesota. J. Urban Reg. Inf. Syst. Assoc. 2008, 20, 15–25. [Google Scholar]
- Zou, B.; Pu, Q.; Bilal, M.; Weng, Q.; Zhai, L.; Nichol, J.E. High-Resolution Satellite Mapping of Fine Particulates Based on Geographically Weighted Regression. IEEE Geosci. Remote Sens. 2016, 13, 495–499. [Google Scholar] [CrossRef]
- Zou, B.; Xu, S.; Sternberg, T.; Fang, X. Effect of Land Use and Cover Change on Air Quality in Urban Sprawl. Sustainability 2016, 8, 677. [Google Scholar] [CrossRef]
- An, F.; Zhai, L.; Sang, H.Y.; Zhang, Y.; Zhou, Y.; Yuan, J. Multiple regression analysis on PM2.5 impact factors based on geographic conditions monitoring data. Sci. Surv. Mapp. 2015, 40, 58–63. [Google Scholar]
- Meng, X.; Chen, L.; Cai, J.; Zou, B.; Wu, C.F.; Fu, Q.; Zhang, Y.; Liu, Y.; Kan, H. A land use regression model for estimating the NO2 concentration in Shanghai, China. Environ. Res. 2015, 137, 308–315. [Google Scholar] [CrossRef] [PubMed]
- Pearson, K. On lines and planes of closest fit to systems of points is space. Philos. Mag. Ser. 1901, 62, 559–572. [Google Scholar] [CrossRef]
- Kabacoff, R.I. R in Action: Data Analysis and Graphics with R, 2nd ed.; Manning Publications Co.: Shelter Island, NY, USA, 2011; pp. 191–195. [Google Scholar]
- Bae, J.; Kim, J.T.; Kim, J.H. Subset selection in multiple linear regression: An improved Tabu search. J. Korean Soc. Mar. Eng. 2016, 40, 138–145. [Google Scholar] [CrossRef]
- Shi, N.; Cao, H.X. The Optimum Climate Forecasting Model Based on All Possible Rrgressions. J. Nanjing Inst. Meteorol. 1992, 15, 459–566. [Google Scholar]
- Draper, N.R.; Smith, H. Applied Regression Analysis, 3th ed.; John Wiley & Sons: New York, NY, USA, 1998. [Google Scholar]
- Ayón, P.; Swartzman, G.; Espinoza, P.; Bertrand, A. Long-term changes in zooplankton size distribution in the Peruvian Humboldt Current System: Conditions favouring sardine or anchovy. Mar. Ecol. Prog. Ser. 2011, 422, 211–222. [Google Scholar] [CrossRef]
- Chen, C.C.; Wu, C.F.; Yu, H.L.; Chan, C.C.; Cheng, T.J. Spatiotemporal modeling with temporal-invariant variogram subgroups to estimate fine particulate matter PM2.5 concentrations. Atmos. Environ. 2012, 54, 1–8. [Google Scholar] [CrossRef]
- Liu, Y.; Paciorek, C.J.; Koutrakis, P. Estimating regional spatial and temporal variability of PM2.5 concentrations using satellite data, meteorology, and land use information. Environ. Health Perspect. 2009, 117, 886–892. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rodriguez, J.D.; Perez, A.; Lozano, J.A. Sensitivity analysis of kappa-fold cross validation in prediction error estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 569–575. [Google Scholar] [CrossRef] [PubMed]
- Brown, D.G.; Goovaerts, P.; Bumlckl, A.; Li, M.Y. Stochastic Simulation of Land-Cover Change Using Geostatistics and Generalized Additive Models. Photogramm. Eng. Remote Sens. 2002, 68, 1051–1061. [Google Scholar]
- Moreno-Torres, J.G.; Saez, J.A.; Herrera, F. Study on the impact of partition-induced dataset shift on k-fold cross-validation. IEEE Trans. Neural Netw. Learn. Syst. 2012, 23, 1304–1312. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Bai, Z.P.; Di, S.; You, Y.; Li, H.M.; Liu, Q. Application of land use regression to simulate ambient air PM10 and NO2 concentration in Tianjin City. China Environ. Sci. 2009, 29, 685–691. [Google Scholar]
Model | Independent Variables | Adj_R2 | AIC | RMSE (µg/m3) | MPE (%) | MAPE (%) | CV Adj_R2 |
---|---|---|---|---|---|---|---|
OLS | PE, AOD, Cover1_8000m, Cover3_8000m, Cover8_8000m | 0.83 | 563.37 | 8.10 | −1.39 | 8.63 | 0.83 |
GAM | PE, AOD, Cover1_8000m, Cover3_8000m, Cover8_8000m | 0.90 | 528.23 | 5.50 | −0.72 | 5.78 | 0.92 |
PCA–GAM | PC1, PC2, PC4, PC5, PC8, PC17 | 0.94 | 495.52 | 4.08 | −0.39 | 4.10 | 0.92 |
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, S.; Zhai, L.; Zou, B.; Sang, H.; Fang, X. A Generalized Additive Model Combining Principal Component Analysis for PM2.5 Concentration Estimation. ISPRS Int. J. Geo-Inf. 2017, 6, 248. https://doi.org/10.3390/ijgi6080248
Li S, Zhai L, Zou B, Sang H, Fang X. A Generalized Additive Model Combining Principal Component Analysis for PM2.5 Concentration Estimation. ISPRS International Journal of Geo-Information. 2017; 6(8):248. https://doi.org/10.3390/ijgi6080248
Chicago/Turabian StyleLi, Shuang, Liang Zhai, Bin Zou, Huiyong Sang, and Xin Fang. 2017. "A Generalized Additive Model Combining Principal Component Analysis for PM2.5 Concentration Estimation" ISPRS International Journal of Geo-Information 6, no. 8: 248. https://doi.org/10.3390/ijgi6080248
APA StyleLi, S., Zhai, L., Zou, B., Sang, H., & Fang, X. (2017). A Generalized Additive Model Combining Principal Component Analysis for PM2.5 Concentration Estimation. ISPRS International Journal of Geo-Information, 6(8), 248. https://doi.org/10.3390/ijgi6080248