Jinghui Ma1,2,3, Zhongqi Yu 2,3, Yuanhao Qu2,3, Jianming Xu2,3,4, Yu Cao2,3 1 Fudan University, Shanghai 200433, China
2 Shanghai Typhoon Institute, Shanghai Meteorological Service, Shanghai 200030, China
3 Shanghai Key Laboratory of Meteorology and Health, Shanghai Meteorological Service, Shanghai 200030, China
4 Anhui Province Key Laboratory of Atmospheric Science and Satellite Remotes Sensing, Hefei 230000, China
Received:
August 23, 2019
Revised:
November 18, 2019
Accepted:
November 28, 2019
Download Citation:
||https://doi.org/10.4209/aaqr.2019.08.0408
Ma, J., Yu, Z., Qu, Y., Xu, J. and Cao, Y. (2020). Application of the XGBoost Machine Learning Method in PM2.5 Prediction: A Case Study of Shanghai. Aerosol Air Qual. Res. 20: 128-138. doi: 10.4209/aaqr.2019.08.0408.
Cite this article:
Air quality forecasting is crucial to reducing air pollution in China, which has detrimental effects on human health. Atmospheric chemical-transport models can provide air pollutant forecasts with high temporal and spatial resolution and are widely used for routine air quality predictions (e.g., 1–3 days in advance). However, the model’s performance is limited by uncertainties in the emission inventory and biases in the initial and boundary conditions, as well as deficiencies in the current chemical and physical schemes. As a result, experimentation with several new methods, such as machine learning, is occurring in the field of air quality forecasting. This study combined hourly PM2.5 mass concentration forecasts from an operational air quality numerical prediction system (WRF-Chem) at the Shanghai Meteorological Service (SMS) with comprehensive near-surface measurements of air pollutants and meteorological conditions to develop a machine learning model that estimates the daily PM2.5 mass concentration in Shanghai, China. With correlation coefficients that are higher by 50–100% and a standard deviation that is lower by 14–24 µg m–3, the machine learning model provides significantly better daily forecasting of PM2.5 than the WRF-Chem model. Thus, this research offers a new technique for enhancing air quality forecasting in China.HIGHLIGHTS
ABSTRACT
Keywords:
XGBoost algorithm; PM2.5; WRF-Chem; Machine learning.
Accurate air quality forecasting is important for both severe air pollution response and self-protection of human health (Bedoui et al., 2016). However, air quality forecasting is rather complicated and dominated by meteorological conditions, and emission inventory. Thus large uncertainties still exist in the current ambient air quality forecasting which does not meet the requirements of current air pollution mitigation in China. There are several approaches commonly used to predict ambient air quality: the numerical model forecasting method and the statistical forecasting method. In addition, numerical forecast modeling requires detailed emissions, and users need a deep understanding of the transformation mechanisms of various air pollutants to enable the selection of suitable physical and chemical schemes that are used in the model’s configuration (Yumimoto and Uno, 2006). However, it is difficult to accurately describe the spatial and temporal variations of urban pollutant emissions and to completely quantify them within the model. To improve the simulation accuracy of air quality models, Xu et al. (2008) found that the application of air pollutant measurements can effectively reduce the bias of the emission data and developed new method for estimating air pollution emissions based on a Newtonian relaxation and nudging technique. Just et al. (2018) demonstrated how machine learning technique with quality control and spatial features substantially improves satellite-derived AOD for air pollution modeling. Current numerical model predictions still have considerable deviations when making predictions on specific regions. The main reasons include: prediction deviation of numerical model on synoptic system, the model cannot describe real-time pollution emissions and errors in the parameterization scheme of the numerical model itself. The statistical forecasting method is relatively simple, economical, and easy to be implemented. However, the forecast effect is related to the quantity of variables and available data, and the statistical correlation between predictions and predictors varies with respect to predictors’ change. Nevertheless, non-linear regression prediction performance of the machine learning-based statistical prediction method is superior to that of traditional statistical methods (Chang et al., 2008). Machine learning makes a few assumptions about data, and the results are checked by cross-validation. It removes the classical statistical process, which consists of hypothesis distribution, a mathematical model fitting, hypothesis testing and determination of the P-value. The model prediction based on machine learning algorithms or programs performs well, and the results of cross-validation are readily understood by applicators. In this respect, the Extreme Gradient Boosting algorithm (XGBoost) was experimented in air quality forecasting. This method is an integrated learning model introduced by Chen et al. (2016) from the University of Washington in 2016 (Friedman et al., 2001), and has been widely used in the fields of finance (Wang et al., 2018; Yao et al., 2018), industry (Sun et al., 2018), energy (Li et al., 2018; Torres et al., 2018; Zhang et al., 2018), medicine (Torlay et al., 2017; Hong et al., 2018; Shimoda et al., 2018; Taylor et al., 2018; Turki, 2018; Zhong et al., 2018), traffic (Lin et al., 2018) and the internet (Verma et al., 2018; Zhang et al., 2018). Pan (2018) has applied the XGBoost algorithm to predict hourly PM2.5 concentrations in China and compared it with the results from the random forest, the support vector machine, linear regression and decision tree regression, and demonstrated the best performance of the XGBoost algorithm in air quality forecasting. Shanghai is a mega city in eastern China with a large and highly dense population. It is extremely important to conduct accurate PM2.5 forecasting in Shanghai. The current WRF-Chem operational model of the Shanghai Meteorological Service is a mesoscale atmospheric dynamical-chemical coupled online model that was developed by the National Center for Atmospheric Research, the United States Pacific Northwest National Laboratory, the United States National Oceanic and Atmospheric Administration, and other departments. It has been widely used to conduct air quality prediction in China. However, there are still large uncertainties in model predictions. Great efforts have been made by scientists to improve model capacity, including data assimilation, and emission inventory adjustment (Bedoui et al., 2016). In this study, a new model for PM2.5 prediction was established using the machine learning XGBoost algorithm and the Lasso linear regression technique (to reduce model over-fitting) based on WRF-Chem outputs and air pollutants and meteorological observations. The new model used two algorithms (XGBoost and Lasso) and was named “the modified XGBoost model” in this study. The modified XGBoost model prediction performance was also compared with the predictions of the Lasso model and the WRF-Chem model. The Regional Atmospheric Environmental Modeling System (RAEMS) for eastern China of SMS was centered at 31.5°N, 118°E with a horizontal resolution of 6 km (Fig. 1). The RADM2 mechanism was used for gas-phase chemistry and the ISORROPIA dynamic equilibrium inorganic aerosol mechanism and the SORGAM organic aerosol mechanism were used for aerosol chemistry. The physical scheme applications were presented in Table 1. The National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS) data were used for the initial and boundary meteorological conditions for WRF-Chem model. The previous 24 h prediction was used as the initial chemical condition. Gaseous chemical boundary conditions were based on monthly averages from global chemical transport model MOZART (Jordan et al., 1997). The MEGAN2 model was implemented to calculate the biogenic emissions. The integrated time step was set as 30 s for meteorology, and 60 s for chemistry. The Emission Inventory for China (MEIC) with 0.25° resolution of 2010 from Tsinghua University was applied. Depending on the monitoring results of each industry in Shanghai, emissions were also allocated hourly with the diurnal profile. Modeling and evaluation data used in this work covered the period of January 1, 2015, to December 31, 2018. The air pollutant measurements were taken from the National Urban Air Quality Real-Time Release Platform (http://113.108.142.147:20035/emcpubilish/). The meteorological measurements, including hourly atmospheric pressure (P), air temperature (T), relative humidity (RH), precipitation (Prs), wind direction (Wind_D) and wind speed (Wind_S), were obtained from the meteorological bureau’s national ground-based observational stations. Both meteorological and chemical forecasted variables were collected from the RAEMS outputs. WRF high altitude weather forecast data included meteorological variables at five standard layers (500 hPa, 700 hPa, 850 hPa, 925 hPa and 1000 hPa), among which about 5% of data were missing. In this study, only validated observed data and forecast data were used for evaluation. Because the geographical locations of air quality observational stations and meteorological observatories did not co-located, the air quality observational stations were accompanied with the nearest meteorological observatories. The input for the training set of the modified XGBoost model was the meteorological observational data of the current day, the air quality observational data for the previous day, the 24 h meteorological factors and the PM2.5 forecast data outputs from the WRF-Chem model. The final outputs of the modified XGBoost model was PM2.5 predictive value. The machine learning algorithm used in this study was the GBDT (Gradient Boosting Decision Tree), which was an iterative decision tree algorithm composed of a plurality of decision trees (Friedman et al., 2001), namely by iterating multiple trees together to make final decisions. Compared with the logistic regression, which can only be used for linear regression, the GBDT can be used widely for almost all regression problems (linear or non-linear). It can also apply to binary classification problems. XGBoost represents an efficient GBDT algorithm enabling gradient boosting “on steroids” (it is called “Extreme Gradient Boosting” for a reason). It combines software and hardware optimization techniques perfectly and yield superior results and use fewer computing resources than other methods (Chen et al., 2016). XGBoost approaches the process of sequential tree building using parallelized implementation. Therefore, to increase run time, the loop order is interchanged by employing initialization through a global scan of all instances and sorting using parallel threads. XGBoost uses the “max_depth” parameter as specified (instead of criterion first) and starts to prune trees backward. This “depth-first” approach significantly improves its computational performance. The algorithm has been designed to make efficient use of hardware resources. This is achieved by caching awareness via allocating internal buffers in each thread to store gradient statistics. Additional enhancements such as “out-of-core” computing enable optimization of available disk space while handling massive data frames that do not match into the memory. In addition, XGBoost contains algorithmic enhancements as follows. It penalizes more complex models through both LASSO (L1) and Ridge (L2) regularization to avoid over-fitting. XGBoost naturally admits sparse features for input by automatically “learning” best missing value depending on training loss, and it handles different types of sparsity patterns in the data more efficiently. XGBoost employs the distributed weighted quantile sketch algorithm to effectively find the optimal split points among weighted datasets. The algorithm comes with a built-in cross-validation method at each iteration, which negates the need to explicitly program this search and to indicate the exact number of boosting iterations required in a single run. With respect to machine learning, it is not sufficient to select the appropriate algorithm for use, and it is also necessary to choose the correct configuration of the algorithm for a dataset by tuning the hyper-parameters. There are likewise several other factors to consider when selecting a winning algorithm, such as computational complexity, applicability and ease of implementation. In order to reduce the risk of model over-fitting, Lasso regression model (Tibshirani, 1996) was used to analyze the importance of forecasting factors by retaining 36 most important factors, as showed in Table 2. Parameter details for model feature selection were shown in Table 3. The XGBoost model was trained and historical data observation and WRF-Chem prediction factors were treated as the inputs of the model. The modeling process was illustrated in Fig. 2. The significant features were extracted from the forecasting results of the pollutant concentrations and meteorological factors (such as T, RH, P and WIND_S) at different standard altitude layers from WRF-Chem. First, the WRF-Chem forecast data were directly treated as the basic forecasting factors, which were all independent at different layers. The distribution of each factor between different layers could reflect the stable state of the atmosphere, which directly affected the vertical diffusion of PM2.5. Therefore, the differences between different layers of the same factor were used as derived factors to represent the variation of the features in the vertical direction. These factors were used as input variables for the XGBoost model. In order to implement multi-step prediction using XGBoost, the 24 h data needed to be input into the model as one sample, which would lead to the significant reduction in the number of training samples. If the structure of the model was excessively complicated, then over-fitting easily occurred, which resulted in insufficient generalization of the model. Therefore, it was necessary to screen the above features to retain more important factors. The Lasso regression analysis method (Tibshirani, 1996) was prepared on the basis of the principle of the least square method. Its core idea is to regularize the parameter items while minimizing the sum of residual squares and to control the sum of absolute values of each parameter within an acceptable range by using the L1 norm. Formulas for calculating linear regression equations are described in detail in Eq. (1): where y = (y1, y2, …, yn)T, X = (x1, x2, …, xd)T, xj = (x1j, x2j, …, xnj)T, j = 1, 2, d. β and ε are the parameters to be estimated and the residual of the model, respectively. The Lasso model parameters are available from Eq. (2): where the term λ||β||1 is the L1 regularization term, of which the function is to limit the range of parameters and reduce the possibility of model over-fitting. Thus, we fitted the output of WRF-Chem by using the Lasso regression model first. λ is the regularization coefficient, which directly affected the complexity of the model. If λ is too low, it would cause excessive coefficients to become zero, thereby resulting in under-fitting of the model. If λ is too large, it would result in less influence of the regularization term, thereby resulting in over-fitting of the model, so it is necessary to select a reasonable value for λ. In this study, 30% of the samples from January 2015 to October 2017 composed the verification set by random sampling. The relationship between the prediction errors of the Lasso model and λ was analyzed (Fig. 3). With the increase in λ, the model prediction error first decreased and then increased, and the number of non-zero model coefficients decreased. When λ equaled 0.0001, the prediction error of the model was the smallest and the number of non-zero coefficients of the model was 36. Therefore, 0.0001 was selected as the regularization coefficient of the Lasso regression model. Thus, only 36 factors of non-zero coefficients were retained as the prediction factors (Table 2). 70% of the data were randomly selected out of the samples from January 2015 to October 2017, which composed the training set, and 30% of the remaining data was used as the verification set for the parameter adjustment of the prediction model. Meanwhile, the data from November 1, 2017, to December 31, 2018, were used as the verification set to evaluate the final prediction of the model. There were many critical parameters that were required to be adjusted in the XGBoost model. The line search optimization of each parameter was conducted based on the accuracy of the model in the verification set, and the decisive parameters determined were given in Table 3. In order to quantitatively evaluate the prediction accuracy of the model, mean bias (MB), mean error (ME), root mean square error (RMSE) and correlation coefficient (R) were calculated and the calculations of MB, ME, RMSE and R are shown in Eqs. (3)–(6): where N is the number of samples and yi and y̅i are the observed and predicted values of i samples, respectively. The prediction result of the modified XGBoost model and the residual of the true value is shown in Eq. (7): Based on the constructed modified XGBoost model, pollution observation data at different environmental observation stations, meteorological data at the corresponding meteorological observation stations and WRF-Chem output data were used as the sample data for model training for different environmental observation stations. In order to quantitatively evaluate the prediction accuracy of the model, the 25th percentile, 75th percentile, median, mean, MB, ME, RMSE and R were calculated. For analyzing the PM2.5 prediction effect of the modified XGBoost model, the modified XGBoost model performance was compared with those of the Lasso model and WRF-Chem model. As showed in Fig. 4, the WRF-Chem prediction had large fluctuation and the peak and valley values of the WRF-Chem prediction were in imperfect agreement with the observed values. Compared with the WRF-Chem and Lasso models, the PM2.5 concentration prediction results of the modified XGBoost model had better consistency with the observations. From Fig. 4, it could be seen that the predicted values of PM2.5 concentration of the Lasso, modified XGBoost and WRF-Chem models were consistent with the observed values in the forecast time series. The modified XGBoost model better reflected the variations of the observations over time and avoided the false peaks and valleys of the WRF-Chem model prediction to a certain extent. Scatter plots, which can directly reflect the linear relationship between simulated and observed values, were used to compare the consistency between the prediction results of the three different models and the observation data. The scatter distribution of the observation and forecasted points was illustrated in Fig. 5. Compared with the WRF-Chem model, scatter distribution of the Lasso and modified XGBoost models was concentrated diagonally, which showed that the two models had a better corrective effect on the WRF-Chem model prediction. From the Taylor plot (Fig. 6), the R values of the three models were 0.51 (WRF-Chem), 0.73 (Lasso) and 0.77 (modified XGBoost), and the standard deviations were 6.0 µg m–3, 5.6 µg m–3 and 5.0 µg m–3, respectively. When the observed PM2.5 concentration was greater than 50 µg m–3, the R values of the three models were 0.40 (WRF-Chem), 0.50 (Lasso) and 0.60 (modified XGBoost), and the standard deviations were 6.7 µg m–3, 5.3 µg m–3 and 5.1 µg m–3, respectively. When the observed PM2.5 concentration was greater than 75 µg m–3, the R values of the three models were 0.30 (WRF-Chem), 0.40 (Lasso) and 0.60 (modified XGBoost), and the standard deviations were 7.1 µg m–3, 6.1 µg m–3 and 5.1 µg m–3, respectively. Therefore, in different PM2.5 concentration ranges, the prediction effect of the modified XGBoost model was preferable to those of the Lasso and WRF-Chem models. The modified XGBoost model increased the R values of the WRF-Chem model predictions and actual values by 51.0%, 50.0% and 100.0% in three concentration ranges (full concentration range, greater than 50 µg m–3 and greater than 75 µg m–3), and reduced the standard deviations by 16.7 µg m–3, 23.9 µg m–3 and 14.1 µg m–3, respectively. For the range of high PM2.5 concentrations, the modified XGBoost model had a stronger predictive correction ability. To analyze the error source of the modified XGBoost model, the RMSE (Fig. 7) was calculated for the hourly forecasting results in the future 24 h. From the view of the RMSE changing over time, as showed in Fig. 7, the RMSE of the modified XGBoost model was less than those of the WRF-Chem and Lasso regression models at any time. The WRF-Chem model had a large RMSE for 24 h predictions, and often indicated false peak and valley values. The modified XGBoost model could significantly reduce the RMSE of the 24 h prediction and the false peak and valley errors. Correlations between the modified XGBoost model residual and the actual pollution and meteorological factors were derived. The calculation results were presented in Table 4. The prediction error of the modified XGBoost model for PM2.5 had a negative correlation with the exact values of PM2.5 in an R value of –0.65. The second largest R value was made for PM10, which was –0.35. R values between the model prediction residual and meteorological factors were clearly smaller than those of the pollutants. Fig. 8 showed the variation curves of the true value of PM2.5, the predicted value of the modified XGBoost model and the prediction error from March 20 to March 30, 2018. The modified XGBoost model successfully predicted the peak PM2.5 concentration on March 23, 2018. Compare with the observation, the variations of the prediction results of the modified XGBoost model were smaller than those of the observations. The observed PM2.5 concentration first increased and then declined in this period. The modified XGBoost model had a good prediction ability for the whole trend, but the turning point of concentration was not forecast accurately. On one hand, the turning point of the concentration might have been due to the change in the actual pollution emissions, and it was difficult for the model to find perfect regularity in the existing data. On the other hand, for part of the data used in the modified XGBoost model was from WRF-Chem predictions, such as PM25(WRF_chem), SO2(WRF_chem), O3(WRF_chem), Td_850(WRF_chem), Rhu(WRF_chem), Z_1000(WRF_chem), etc., the prediction error of WRF-Chem also led to the decline in modified XGBoost forecasting accuracy. In order to quantitatively analyze the prediction effects of the three models, evaluation indexes of the prediction results of different prediction models were estimated, and the consequences for observations greater than 50 µg m–3 were showed in Table 5. The RMSE of the modified XGBoost model was 26.1 µg m–3, which was about 41% lower than that of the WRF-Chem model. The R value between the modified XGBoost results and the observations reached 0.6, which was approximately 50% higher than that of the WRF-Chem model. Among three models, mean value difference between the Lasso model results and observations was the largest. For the 75th percentile, the modified XGBoost model was the closest to the observation, while the difference between the Lasso model and the observations was the largest, which indicated that the modified XGBoost model was more suitable than the other two models at high PM2.5 concentrations. From the view of ME, all three models overestimated the PM2.5 concentration. ME of the modified XGBoost model was the smallest (20.4 µg m–3), ME of the WRF-Chem model was the largest (34.5 µg m–3). MB of the modified XGBoost model was the smallest (3.6 µg m–3), while that of the Lasso model was the largest. In addition, for the mean value and median predicted by the three models, the modified XGBoost model forecast was the closest to the observational values. The XGBoost model also had the smallest values for median deviation, 25% quantile deviation, 75% quantile deviation, mean deviation/observed and ME/observed. In summary, under the condition that the observed value was greater than 50 µg m–3, the modified XGBoost model performed the best among the three models. To test the monthly forecasting performance of the modified XGBoost model, the monthly averaged forecast results from January 1 to December 31, 2018, were compared with WRF-Chem forecast results (Fig. 9). The forecast results were selected from 24–48 h, and the average value was regarded as the average daily concentration. The difference between the PM2.5 concentrations predicted by the modified XGBoost model and the observation were between –4.9 and 2.9 µg m–3, which was lower than the difference between the WRF-Chem prediction and observations (–19.3 to 10.7 µg m–3). The monthly average concentrations predicted by the XGBoost model and the WRF-Chem model were both consistent with the peak and valley of the observations. However, the predicted values of the two models in February, May, June, September and November were greater than observations, and the forecast values for the remaining months were smaller than the observations, especially in January and December (the WRF-Chem model and modified XGBoost forecast deviations were –14.8 µg m–3, –4.9 µg m–3, –9.77 µg m–3 and –3.7 µg m–3, respectively). Nevertheless, it was evident that the modified XGBoost forecasting model has a correcting effect on the WRF-Chem model forecast in all seasons, especially in winter. We developed a modified XGBoost model that incorporated WRF-Chem forecasting data on pollutant concentrations and meteorological conditions (the important factors was shown in Table 2, which could represent the spatiotemporal characteristics of pollution and meteorology) with observed variations in these two factors, thereby significantly improving the accuracy of PM2.5 forecasting in Shanghai, China. All of the comprehensive evaluation indicators, including the R and RMSE values, confirmed that the modified XGBoost model provided more accurate predictions of high PM2.5 concentrations (exceeding the standard of 75 µg m–3) than the WRF-Chem model. The modified model also improved on the monthly forecasts of the WRF-Chem model in every season, especially during heavy winter pollution. Since our study was restricted to Shanghai, China, the representativeness of the modified XGBoost model and the reliability of our conclusions are limited. It will be necessary to expand the model’s scope of application in future research. Furthermore, applying different machine learning algorithms for PM2.5 prediction in Shanghai and conducting a multi-model ensemble prediction would be useful. The research was supported by the National Key R&D Program (2016YFC0201900), Shanghai Natural Resources Fund (19ZR1462100), National Natural Resources Fund (41475040) and Shanghai Science and Technology Commission (16DZ120460). We sincerely thank the CMA for providing access to hourly surface data. The authors are grateful to the valuable comments and suggestions of the editor and the two anonymous reviewers, which have helped us improve the paper quality.INTRODUCTION
METHODS
Introduction of the WRF-Chem Numerical Model SystemFig. 1. Coverage area of atmospheric environment numerical forecast system in eastern China (Inside the black box), the red dot is the location of Shanghai.
Data Introduction
Data Pre-processing
XGBoost Model Introduction
Parallelization
Tree Pruning
Hardware Optimization
Regularization
Sparsity Awareness
Weighted Quantile Sketch
Cross-validation
Construction of the Model
Selection of Prediction FactorFig. 3. The relationship between the characteristic number of the coefficient non-zero, the model prediction error and the regularization term coefficient curve.
Model Training
Data Evaluation Methods
RESULTS AND DISCUSSION
Analysis of PM2.5 Concentration Forecast ResultsFig. 4. Comparison between three model predictions and observational data (a) hourly (b) daily average.
Fig. 5. Plots of predicted and observational data (a) WRF-Chem model (b) Lasso model (c) modified XGBoost model.
Fig. 6. Taylor plot of Shanghai PM2.5 24 h forecast by WRF-Chem model, Lasso model and modified XGBoost model (r pass 95% confidence test).
Analysis of Error Sources of PM2.5 Concentration ForecastFig. 7. Variation of forecast RMSE over time.
Fig. 8. The variation of modified XGBoost model prediction value, observation value and prediction error with time.
PM2.5 Concentration Prediction EvaluationFig. 9. Monthly comparison of monthly mean PM2.5 concentration and observed values by XGBoost and WRF-Chem model.
CONCLUSIONS
ACKNOWLEDGMENTS
REFERENCES