Mengwei Jia1,3, Xinghong Cheng 2, Tianliang Zhao3, Chongzhi Yin3, Xiangzhi Zhang4, Xianghua Wu5, Liming Wang5, Renjian Zhang6 1 Joint International Research Laboratory of Atmospheric and Earth System Sciences, School of Atmospheric Sciences, Nanjing University, Nanjing 210023, China
2 State Key Laboratory of Severe Weather, Institute of Atmospheric Composition, Chinese Academy of Meteorological Sciences, Beijing 100081, China
3 Collaborative Innovation Center on Forecast and Evaluation of Meteorological Disasters, Key Laboratory for Aerosol-Cloud-Precipitation of China Meteorological Administration, Nanjing University of Information Science and Technology, Nanjing 210044, China
4 Jiangsu Provincial Environmental Monitoring Center, Nanjing 210029, China
5 School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China
6 Key Laboratory of Regional Climate-Environment Research for Temperate East Asia, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China
Received:
May 29, 2019
Revised:
June 5, 2019
Accepted:
June 26, 2019
Download Citation:
||https://doi.org/10.4209/aaqr.2019.05.0275
Jia, M., Cheng, X., Zhao, T., Yin, C., Zhang, X., Wu, X., Wang, L. and Zhang, R. (2019). Regional Air Quality Forecast Using a Machine Learning Method and the WRF Model over the Yangtze River Delta, East China. Aerosol Air Qual. Res. 19: 1602-1613. https://doi.org/10.4209/aaqr.2019.05.0275
Cite this article:
A statistical forecasting method of air quality based on meteorological elements with high spatiotemporal resolution simulated by the Weather Research and Forecasting (WRF) model and a back-propagation (BP) neural network was established to predict 72 h PM2.5 mass concentrations over the Yangtze River Delta (YRD) region of eastern China. Short-term statistical forecasting of air quality in 25 major cities in the YRD region was conducted and the PM2.5 forecast was validated using the corresponding surface PM2.5 observational data in this study. Results indicate that the short-term air quality forecasting system has a ability to accurately forecast PM2.5 concentration in the major cities in the YRD region. The average index of agreement (IA) between PM2.5 forecasts and observations in the four seasons ranges from 74% to 77%, and the root mean square error (RMSE) fall between 15.2 µg m–3 and 33.0 µg m–3. The data with PM2.5 concentration greater than 115 µg m–3 are selected to establish the EXP-Polluted model and then used to predict PM2.5 concentration during heavy haze periods in 2017. The RMSEs of PM2.5 forecasts during severe haze periods are improved by 44.1%, which compared to predictions using the EXP-All Time model constructed by the full-year data.HIGHLIGHTS
ABSTRACT
Keywords:
Regional air quality forecast; BP Network; WRF model; heavy haze; Yangtze River Delta
Statistical forecasting of air quality is a method based on statistical principles to build a model between observed concentrations of atmospheric pollutants and meteorological parameters, and predict the temporal and spatial variations of concentrations of pollutant in the future. Compared with numerical forecasting, statistical forecasting methods are simple and easy to implement and are commonly used for air quality prediction. Statistical forecasting can be divided into two types: conventional and dynamic-statistic. Conventionalstatistical forecasting involves building a linear or nonlinear statistical model based on historical measurements of pollutants and various meteorological elements in the previous or at the same period. For example, Li et al. (2017) used a statistical model and incorporated historical data of pollutants, meteorological elements and time stamp into the training model. He et al. (2016) built a statistical model by analyzing the relationship between concentrations of atmospheric pollutants and several factors including potential temperature decline rate, stable energy, Froude number, gradient Richardson number, and so on. Constructing meteorological conditions for air pollution and studying the feasibility of index forecasting are also belong to the conventional statistical forecasting method (Wang et al., 2013; Chen et al., 2015; Yang et al., 2015). The dynamic-statistical forecasting method are based on Chemical Transport Models (CTMs) and linear or nonlinear statistical methods to build a statistical model using the observed or modeled concentrations of pollutants and meteorological elements. Forecasts of pollutants using the dynamic-statistical model can be corrected objectively, so the forecasting accuracy is significantly improved. In recent years, many researchers (Huang et al., 2012; Irina et al., 2015; Cheng et al., 2016) have employed BP neural networks, adaptive partial least squares regression, analog ensemble bias correction, Kaman filter techniques and other statistical methods to correct the forecasts of air quality and systematic error can be remarkably reduced. Artificial Neural Network (ANN) is a computer modeling method that imitates human brain thinking (Edwards et al., 2004). ANN can realize real-world pattern recognition, associative memory, optimization calculations, etc. As a reliable statistical method, ANN has been widely used in the field of environmental monitoring due to their high adaptability and nonlinear mapping capabilities (Al-Alawi et al., 2008; Li et al., 2010, 2011; Kaimian et al., 2018; Perez et al., 2018). In the 1990s, an ANN was first used in urban air quality monitoring. Boznar et al. (1993) used ANN to create a multilayer perceptron (MLP) model to predict SO2 concentrations near thermal power plants. Today, ANN has become a powerful tool for air quality prediction (Kolehmainen et al., 2001; Jiang et al., 2004; Hooyberghs et al., 2005; Grivas et al., 2006; Fernando et al., 2012). According to the network structure, ANN models can be divided into two types: feedforward type and feedback type. The most representative of the feedforward type is the BP network (Edwards et al., 2004) using an error back-propagation algorithm. It has strong self-learning, self-adaptation and anti-interference characteristics and is suitable for pattern recognition and classification. In this study, a BP neural network is applied to regional air quality forecast over the YRD region, east China. The Yangtze River Delta (YRD) is a special region with a well-developed economy, dense population and high degree of urbanization. Serious heavy haze occurred in China since 2013 (Chen et al., 2017; Wang et al., 2017b; Zhang et al., 2017), especially in North China and the YRD Region (Lang et al., 2017; Wang et al., 2017a). The increase in anthropogenic emissions and disadvantageous meteorological conditions have caused frequent regional air pollution and greatly restricted the accuracy of air quality forecast in this region (Jia et al., 2017; Zhang et al., 2008; Parrish et al., 2009). At present, research on statistical forecasting of air quality in the YRD region is limited to some cities, or only historical meteorological observation data are used to the statistic model. However, uneven distributions of meteorological and pollutant data will lead to regional air quality prediction error. At the same time, surface observation data can not include some vertical information affecting air quality, such as boundary layer height. To eliminate the forecast error, we developed a statistical forecasting model with the BP neural network and meteorological elements in the atmospheric boundary layer simulated by the WRF model. Then it is used to predict 72 h PM2.5 concentrations in the YRD region and evaluate the forecasting results. This paper is organized as follows: Section 2 describes the method and used data, including the statistic training model and selecting process of meteorological factors; Section 3 presents forecast results in the whole year and heavy haze episodes, and evaluation of forecasting accuracy; and the findings are then summarized and discussed in Section 4. The key of predicting air quality in this paper is to establish the relationship between measured concentrations and meteorological elements with the BP neural network model. The forecasting procedure includes the following steps: Firstly, preparing the environmental and meteorological data. Secondly, selecting the dependent and independent variables in WRF model including the following meteorological elements: temperature (T2) ,which affects the chemical conversion rate of atmospheric pollutants; specific humidity (Q2) at the height of 2 m, that reflects the moisture absorption of atmospheric pollutants; latitudinal component (U10), longitudinal component (V10) and horizontal wind speed (UV10) of the wind field at the height of 10 m, which represent the transport capacity of different wind field near the surface; boundary layer height (PBLH), that reflects the vertical mixing of the atmosphere; sea level pressure (P) ,which reflects atmospheric stability. Thirdly, selecting the average PM2.5 in the previous 24 hours and meteorological element data in the last step for the training model. Lastly, using the previous 24 hour PM2.5 and meteorological element data exported by WRF as the input for the forecast model and predicting the next hour’s forecasts. In this way, the 72 hours short-term forecast cycles of PM2.5 are continuously completed. Hourly PM2.5 mass concentration data observed at 25 cities in the YRD region from January 1, 2013, to December 31, 2017 are used in this paper. Distribution of 25 observation sites in the YRD region is shown in Fig. 1. Conventional meteorological observation data cannot meet the requirements of refined regional statistical forecasting due to the lack of the vertical structure information of atmospheric boundary layer, so the refined meteorological elements at the above-mentioned 25 sites are simulated by the WRF model and used as the independent variables for statistical forecasting model of air quality, which including air temperature, specific humidity, air pressure, wind field and planetary boundary layer height. This study focuses on the YRD region at a spatial resolution of 10 km × 10 km with 33 vertical layers of varying thickness (between the surface and 10 hPa) using the double-nested simulation technique (Fig. 1), and a temporal output interval of 1 h. The WRF simulations were driven by the FNL/NCEP analysis data every 6 h with a spatial resolution of 1° × 1°. The simulation was conducted from 00:00 UTC on January 1, 2013 to 23:00 UTC on December 31, 2017, with a 12-h spin-up time. The following parameterization schemes of physical processes within the WRF model were used in this study: the WRF single-moment 3-class microphysics scheme (Hong et al., 2004), the RRTM longwave (Mlawer et al., 1997) and Dudhia shortwave radiation (Dudhia, 1989) scheme, the revised Monin-Obukhov surface layer scheme (Jimenez et al., 2012), the Kain-Fritsch convective parameterization scheme (Fritsch and Kain, 1993), the thermal diffusion land-surface scheme (Dudhia, 1996), and the Yonsei University (YSU) atmospheric boundary layer scheme (Hong, 2006). Some studies (Borge et al., 2008; Banks et al., 2016) have shown that the WRF model has good performance in the plain area, such as the Yangtze River Delta, and the simulated meteorological fields are basically consistent with the observations. The refined meteorological fields simulated by mesoscale models were evaluated. Fig. 2 presents air temperature and relative humidity at the height of 2 m and wind speed at the height of 10 m simulated by WRF model and observations at Shanghai and Nanjing for four months (February, May, August and November). In general, the temporal variation tendency of simulated air temperature, humidity and wind speed are basically consistent with observations, and simulations are close to observations, except for slight underestimation for air temperature in August and November, and overestimation for wind speed in February. The stability and accuracy of statistical forecasting of air quality depends on the selection of forecasting factors. The meteorological parameters determining atmospheric physical and chemical processes mainly include wind, temperature, humidity, air pressure, boundary layer height, etc. (Giorgi et al., 2007; Zhang et al., 2014; Jia et al., 2016). Significance tests were performed for the correlation coefficients between the hourly meteorological elements during 2013–2016 and PM2.5 observations in four seasons (Table 1). We select the following meteorological elements with good correlations as forecasting factors: wind speed at the height of 10 m, air temperature and specific humidity at the height of 2 m, longitudinal and latitudinal wind speed at the height of 10 m, sea level pressure and planetary boundary layer height. In addition, considering the emission and cumulative transmission of atmospheric pollutants, 24 h PM2.5 concentrations data in the previous day are used as a prediction factor for the next 72 hours’ forecasts. The topology of BP neural network model consists of the input, hidden and output layers; each layer has a number of nodes (Fig. 3). A complete node contains an adder and an activation function. The adder is a linear combination of its own input: where i is the number of independent variables, j is the number of dependent variables, and X is the input vectors, which include the PM2.5 measurements and simulated meteorological elements data. W is the weight, θ is the deviation. In this study, the neural network toolbox in the software, MATLAB is used to build the statistic model. A single hidden layer structure is selected. The trainscg and purelin functions are utilized as the transfer function of the hidden and output layers. The network is trained by adjusting the number of nodes in the hidden layer, and 12 nodes are selected with the minimum system errors. After the training process is finished and then used to forecast 72 h PM2.5 concentrations in the future. Root mean square error (RMSE), mean absolute error (MAE), correlation coefficient (r) and accuracy rate are employed to evaluate the accuracy of prediction. The RMSE represents the dispersion between forecasts and observations, and the MAE represents the forecasting errors. The index of agreement (IA) (Willmott, 1981; Feng et al., 2015; Oprea et al., 2016) is also used to evaluate the forecasting accuracy. The IA formula is as follows: where Fi and Oi are the forecast and observation for day i, respectively; n is the sample numbers; and O̅ is the average observation for all days. The statistical forecast model is used to forecast the 72 h PM2.5 concentrations in 25 cities over the YRD region in 2017. The 25 cities are Shanghai, Nanjing, Suzhou, Nantong, Lianyungang, Xuzhou, Yangzhou, Wuxi, Changzhou, Zhenjiang, Jiangsu Taizhou, Huai'an, Yancheng, Suqian, Hangzhou, Ningbo, Wenzhou, Shaoxing, Huzhou, Jiaxing, Zhejiang Taizhou, Zhoushan, Jinhua, Quzhou and Lishui. Table 2 shows the seasonal average IA and RMSE of PM2.5 forecasts in the 25 cities over the YRD region in 2017. The average IAs in the four seasons range from 74% to 76%, and the RMSEs fall between 15.2 µg m–3 and 33.0 µg m–3. The IA for PM2.5 is the lowest in spring, followed by autumn, and the highest in winter. Specifically, seasonal average PM2.5 forecasts are consistent with observations and forecasting errors are small for the whole year. In addition, consistence of PM2.5 forecasts and observation is good in winter, but forecasting errors are larger than other seasons due to frequently occurring of heavy haze. The monthly variation of standardized RMSE for PM2.5 forecasts is shown in Fig. 4. Yellow, light and dark gray color respectively represent RMSE standardization for the first day (day1), the second day (day2), and the third day (day3). The RMSEs of the first and second day are closer with 0.38. In most months, the RMSE of day1 is little less than that of day2. Meanwhile, the RMSE of day3 is significantly higher than the first 2 days. The increase of RMSE with forecasting time is related with forecasting deviations of meteorological elements for different period. Because the synoptic system is unpredictability in some degree and the accuracy of weather forecast is remarkably impacted by the turbulence in the initial filed, errors of meteorological elements in the initial field of the WRF model will increase with forecasting duration. So in general, forecasting accuracy of meteorological elements in the first day is higher than the second and third day. It means that 24 h forecasting accuracy of PM2.5 is higher than 48 h and 72 h forecasts due to errors in the initial field of the WRF model. The following sections mainly put focus on the forecast accuracy for day1. Fig. 5 shows the monthly variation of correlation coefficients of PM2.5 forecasts and normalized RMSE on the first day for 25 cities over the YRD region. We calculate the correlation coefficient between PM2.5 hourly forecasts of day1 in 25 cities for every month in 2017 and observations, and conduct the significance test. The monthly average correlation coefficients for day1 range from 0.40 to 0.59, and all results passed the 99% significance test. The correlation coefficients in Jan. and Dec. are higher than other months. The lower normalized RMSE for day1 is 0.360 in spring, and higher in winter. The correlation of winter forecast values is high, but forecasting errors of PM2.5 is larger in winter over the Yangtze River Delta. It may be related with impacts of emission source from surrounding areas. Influenced by topography, transmission and emission, the spatial distribution of PM2.5 forecasts in the YRD region is significantly different. Fig. 6 shows spatial distribution of the annual average of PM2.5 forecasts and observations for the first day in 25 cities over the YRD region in 2017. Spatial distribution of PM2.5 forecasts and observations are very consistent, namely they are higher in the north and lower in the south. In addition, those on the coast are both lower than the inland area. The annual average of PM2.5 forecasts and observations in Xuzhou are both the largest over 65 µg m–3 and those in Ningbo are the smallest, which is less than 30 µg m–3. PM2.5 forecasts at some cities in the north and central part of the YRD region are overestimated and it may be caused by transmission from surrounding areas with high emission and forecasting bias of meteorological elements. From Fig. 2, we can see that the simulation of wind speed is overestimated in winter, spring and autumn. Due to impacts of the winter monsoon, air pollutants from high emission areas such as the North China Plain may be transported to the north and central part of the YRD region. The training model is established with all time PM2.5 observation and forecast of meteorological elements data in the above section (here in after referred to as the EXP-All Time model). The extreme static and stable weather in eastern China leads to frequent occurrences of heavy haze with high PM2.5 concentrations. It’s difficult to forecast and the CTMs or statistic model has poor performance in predicting the severe haze. In this section, we developed a statistical forecasting model for the heavy pollution process to improve the forecast effects during heavy haze. The meteorological elements are classified according to pollution grades (HJ 633-2012, Technical regulation on ambient air quality index). Specifically, forecast meteorological data are selected for the training set when corresponding PM2.5 concentrations is higher than 115 µg m–3. The neural network model is constructed with these selected data is called the heavy pollution model (EXP-Polluted model). In this study, two types of data from 2013 to 2016 are used as the training set, and data in 2017 are utilized as the forecasting set. The PM2.5 concentrations during all pollution events in 2017 are predicted, and the results are compared with the results of EXP-All Time model. Fig. 7 shows the RMSE values of hourly PM2.5 forecasts using the EXP-All Time and EXP-Polluted model respectively. Forecasting errors using the EXP-Polluted model in 25 cities are smaller than those by the EXP-All Time model, with an average of 33.0 µg m–3 and 59.1 µg m–3, which reduced 44.1% in EXP-Polluted. RMSE values of hourly PM2.5 forecasts in all 25 cities using the EXP-Polluted model fall between 20 µg m–3 and 40 µg m–3. Fig. 8 presents scatter plots between hourly PM2.5 forecasts using the above two models and observations in Hangzhou, Nanjing and Shanghai in 2017, respectively. PM2.5 forecasts using the EXP-All Time model is obviously underestimated. Compared with the EXP-All Time model, the RMSE values of PM2.5 forecasts using the EXP-Polluted model in Hangzhou, Nanjing and Shanghai respectively decreases by 21.2 µg m–3, 19.9 µg m–3, 34.3 µg m–3, the MAE reduces by 17.9 µg m–3, 22.4 µg m–3, 27.3 µg m–3, and IA increases by 5.0%, 4.4%, 12.1%. Of three cities, the improvement in Shanghai is the most significant. Fig. 9 shows the probability distribution function of forecasting deviations using the EXP-All Time and EXP-Polluted model in Hangzhou, Nanjing and Shanghai. Most forecasting bias in three cities using the EXP-All Time model fall between –150 µg m–3 and 0 µg m–3, but those using the EXP-Polluted model range from –50 µg m–3 to 50 µg m–3. PM2.5 forecasting bias in the EXP-All Time model in Hangzhou, Nanjing and Shanghai corresponding to the maximum relative frequency is –39.3 µg m–3, –39.1 µg m–3, –48.0 µg m–3, while those using the EXP-Polluted model is 3.4 µg m–3, 6.9 µg m–3, 20.9 µg m–3 respectively. Compared with EXP-All Time model, the maximum relative frequency in Hangzhou, Nanjing and Shanghai using the EXP-Polluted model increase 42.5 µg m–3, 46.0 µg m–3, 68.9 µg m–3. This indicates that forecast error with the EXP-All Time is far greater than the EXP-Polluted, and selecting the key meteorological elements is very important to forecast air quality accurately during the heavy haze. A heavy haze episode took place in Nanjing from December 23 to December 24 in 2017, with the peak PM2.5 concentration reaching 246 µg m–3. Fig. 10 shows temporal variation of PM2.5 forecasts using the EXP-All Time and EXP-Polluted model and observations. Temporal variation characteristics of PM2.5 forecasts using two models are similar to observations Compared to the EXP-All Time model, RMSE values of PM2.5 forecast using the EXP-Polluted model decreases from 59.1 µg m–3 to 38.2 µg m–3 and MAE values decreases from 56.0 µg m–3 to 32.2 µg m–3. The forecasts using the EXP-Polluted model improved significantly. Forecast errors of PM2.5 at the peak moments are still larger and it is related with impacts of emission and transmission from surrounding areas. In this study, statistical forecasting model using a BP neural network and the WRF model is developed to predict 72 h PM2.5 concentration in 25 cities over the YRD regions in 2017. 72 h forecasts of fine meteorological elements in the atmospheric boundary layer are conducted by the WRF model and used to improve meteorological conditions in the statistic forecasting model of air quality. Evaluation of forecasting improvements for the whole year and heavy haze processes are then performed using surface measurements in 25 major cities. The results show that the BP neural network performs well at predicting PM2.5 concentrations over the YRD region, east China. The average IAs for PM2.5 forecasts in the four seasons range from 74% to 77%. In the case of heavy haze, the EXP-All Time model has poor performance, while the EXP-Polluted model still maintain outstanding performances on PM2.5 forecasts. Predicting errors with the EXP-All Time model is far greater than the EXP-Polluted model, and selecting the key meteorological elements is very important to forecast air quality accurately during the heavy haze. Further investigations are needed to improve spatialdistribution of PM2.5 forecasts at some cities in the north and central part of the YRD region and PM2.5 forecasts at the peak moments. We intend to employ data assimilation method to enhance meteorological elements forecasting, and emission constrained method to improve emission inventory which is the key input to the CTMs. We will combine the meteorological and chemical forecasts from the CTMs with surface and high layer measurements of PM2.5 to improve statistic forecasting model considering the impacts of emission. In addition, we will improve the training model using more data at one city and other sites to consider the impacts of air pollutants transmission and use different models in combination with synoptic processes analysis to improve the forecasts during inflexion synoptic processes in the future work. This study was jointly funded by the National Natural Science Foundation of China (41830965; 91744209), the National Key R & D Program Pilot Projects of China (2016YFC0203304). The authors acknowledge Jiangsu Provincial Environmental Monitoring Center for providing surface PM2.5 observation data. The authors would also like to thank the anonymous reviewers for their constructive suggestions and comments.INTRODUCTION
DATA AND METHOD
Observation Data and Simulation DataFig. 1. Model domain (left) and the location of the simulation site in YRD, China (right).
Fig. 2. Comparison of 2 m temperature, 2 m relative humidity and 10 m wind speed between WRF simulations and observations of (a) winter, (b) spring, (c) summer and (d) autumn in YRD region.
BP Neural Network Methodology and Statistical Indicators
Fig. 3. Schematic diagram of the BP neural network.
RESULTS
Validation of One-year ForecastsFig. 4. Normalized RMSE of PM2.5 forecasts for days1 (the first day, yellow bars), days2 (the second day, light gray bars) and days3 (the third day, dark gray bars).
Fig. 5. Monthly variations of correlation coefficient (red boxes) and normalized RMSE (gray line) for PM2.5 forecasts on day1 (the first day).
Fig. 6. Spatial distributions of the annual average of observation (a) and forecast (b) for day1 (the first day) in 25 cities over the YRD region in 2017.
Validation for Heavy HazeFig. 7. The RMSE values of 25 cities for EXP-All Time and EXP-Polluted. Red is EXP-All Time, and gray is EXP-Polluted. The dotted lines are the average values of the 2 experiments.
Fig. 8. PM2.5 forecast results for heavy pollution in (a) Hangzhou, (b) Nanjing and (c) Shanghai. The red and gray dots respectively denote forecasts using the EXP-Polluted and EXP-All Time model.
Fig. 9. Probability distribution function of forecasting bias using the EXP-All Time and EXP-Polluted model in (a) Hangzhou, (b) Nanjing and (c) Shanghai.
Fig. 10. Time serial of PM2.5 forecasts using All Time model (full line), Polluted model (dash dot line) and observations (yellow bars) from December 23 to December 24 in 2017 in Nanjing.
CONCLUSION AND DISCUSSION
ACKNOWLEDGEMENT