Prediction of Potentially High PM2.5 Concentrations in Chengdu, China

Daily exposure to high ambient PM2.5 increases the mortality rate and contributes significantly to the burden of disease. In basin-situated cities with high local emissions of air pollutants, meteorological conditions play a crucial role in forming air pollution. One such city is Chengdu, which is located in the Sichuan Basin and serves as the economic, educational, and transportation hub of western China. Particulate matter with an aerodynamic diameter of < 2.5 μm (PM2.5) is the most critical pollutant in this city. Although the annually averaged PM2.5 concentrations declined from 92 to 57 μg m–3 between 2013 and 2017, the city still suffers from haze and smog, with 85 days during 2017 displaying 24-h PM2.5 concentrations > 75 μg m–3. To better understand the influence of meteorological factors on PM2.5 pollution with the goal of easily and reliably predicting the latter, we examined the relationships between the 24-h concentration and a variety of meteorological parameters in Chengdu. We found that the strongest predictors of the PM2.5 concentration were the temperature, precipitation, wind speed, and trajectory direction and distance. Furthermore, although the same-day sea-level pressure (SLP) was a weak predictor, the SLP 5 days in advance performed better. We developed generalized additive models (GAMs) that predicted the PM2.5 concentration as a function of multiple meteorological parameters. One of the GAMs developed in this study exhibited an adjusted correlation coefficient (R2) of 0.73 and captured up to 73.9% of the variance in the daily averaged PM2.5 concentrations. The model performance was improved by using the ΔSLP (i.e., mean pressure difference) for 5 days instead of the SLP, suggesting that ΔSLP5d is a good predictor of high concentration days in Chengdu. This study provides a useful tool for controlling emissions in advance to prevent heavy pollution days and issuing outdoor activity warnings to protect public health.


INTRODUCTION
With rapid economic growth and concomitant increase in anthropogenic emissions during the recent decades, PM 2.5 (i.e., particulate matter (PM) with an aerodynamic diameter of < 2.5 µm) pollution has become a great environmental and health challenge in China (Miao et al., 2019;Zeng et al., 2019). PM 2.5 not only harms human health but also influences climate (Di et al., 2009;Mahowald, 2011;Tao et al., 2014;Bo et al., 2015). In 2017, nearly 3 million deaths worldwide were attributed to PM 2.5 exposure (HEI, 2019). China and India lead the world in the disease burden attributable to PM 2.5 (Lelieveld et al., 2015;Liu et al., 2016;Cohen et al., 2017;Liu et al., 2017). Short-term exposure to high PM 2.5 concentrations is associated with specific health effects, such as respiratory and cardiovascular diseases, and increased mortality (Hoek et al., 2013;Chen et al., 2018). Children, females, the elderly (> 65 years old), and those people with pre-existing health conditions are most sensitive to PM 2.5 pollution (Luo et al., 2018;Lei et al., 2019). In general, the effects of PM 2.5 on daily respiratory and non-accidental deaths are more evident in the cold season than other seasons in China Singh et al., 2019).
In addition to emissions, topography and meteorology also play a significant role in temporal variations in regional air quality (Boznar et al., 1993;Palau et al., 2005;Saide et al., 2011;Hu et al., 2014;Gustin et al., 2015;Miao et al., 2015). Basin landform facilitates the accumulation of air pollutants to high concentrations (Miao et al., 2019). As the most populous basin in China, with a population of over 83 million in 2018 (Sichuan Statistics Bureau, 2018), the Sichuan Basin (SCB) has been identified as one of four most air-polluted regions in China, along with the Beijing-Tianjin-Hebei region, the Yangtze River Delta, and the Pearl River Delta (Lin et al., 2012;Tian et al., 2017). As one of the two megacities in the basin, Chengdu is located in the western boundary of the SCB (Fig. 1), with a population of over 16 million, over 4 million motor vehicles, and intensive industrial and agricultural activities, leading to high anthropogenic emissions . To the west and east of Chengdu are the Longmen-Qionglai Mountains and the Longquan Mountains, respectively. Climate in Chengdu is subtropical humid monsoon climate, with a rainy season from June to August. Previous studies have investigated the impacts of meteorological parameters on air pollution, including ambient air temperature, air pressure, precipitation, winds, and boundary layer structure (Li et al., 2015;Zeng and Zhang, 2017;Miao et al., 2018). Many studies have revealed that PM 2.5 pollution is highest in winter in Chengdu due to unfavorable meteorological conditions for air pollutant to disperse and deposit, such as shallow mixing layer, high relative humidity, and strong thermal inversion (Tao et al., 2013;Miao et al., 2018Miao et al., , 2019. Although the critical role of meteorology in PM pollution in Chengdu has been recognized, a comprehensive analysis of the overall relationship between meteorology and PM 2.5 concentration has not been conducted. There are three major types of methods to forecast air quality, including statistical models (e.g., multiple linear regression (Ul-Saufie et al., 2012;Ng and Awang, 2018), artificial neural networks (Hooyberghs et al., 2005;Grivas and Chaloulakou, 2006;Wang et al., 2014), support vector machines (Lu and Wang, 2005;Wang et al., 2014), hidden Markov models (Sun et al., 2013a, b), grey system model (Lee et al., 2007;Pai et al., 2013), chemical transport models (CTMs; e.g., WRF-Chem and Community Multiscale Air Quality [CMAQ]) (Ryu et al., 2018;Wilkins et al., 2018;Goldberg et al., 2019;Yang et al., 2019), and a combination of statistical models and CTMs (Carmichael et al., 2008;Wang et al., 2016;Xue et al., 2019). Compared with statistical models, CTMs are much more complex and time-consuming and depend on high computing capability and a lot of input data, such as emissions inventories, topography, land use and cover, and meteorological simulations. The performance of CTMs is greatly influenced by the aforementioned input data and the chemical mechanisms used . Statistical models directly fit the patterns derived from the observed data, which are well suited for quantifying the nature of pollutant response to meteorological parameters, and are simpler and might be more accurate compared with chemical transport air quality models (Schlink et al., 2006;Sun et al., 2013;Zhang et al., 2015). The generalized additive model (GAM) is a combination of a generalized linear model and an additive model (Hastie and Tibshirani, 1990) and has been widely used to explore the impacts of meteorological parameters on air quality (Aldrin and Haff, 2005;Asimakopoulos et al., 2012;Tian et al., 2014;Gong et al., 2017). The GAM can be used to understand the mechanisms that cause atmospheric pollution and to forecast air pollutants' concentrations (Schlink et al., 2006;Hübnerová and Michálek, 2014).
In this study, a GAM was used to understand the relationships between 24-h PM 2.5 concentrations and critical meteorological factors for the five years from January 2013 to December 2017 in Chengdu. It is important to understand the mechanisms of air pollution formation before we can provide recommendations for government to effectively control emission and to release early warnings for the public. The main objectives of this study are: (1) to find the key  SI Table S1. meteorological factors affecting PM 2.5 concentrations by using GAM and (2) to investigate whether the GAMs built with meteorological data can be used to predict 24-h PM 2.5 concentrations.

Data Sources and Variables
Hourly PM 2.5 concentrations from January 2013 to December 2017 were obtained from seven national air quality monitoring stations in the urban areas of Chengdu (via the China National Environmental Monitoring Centre: http://www.cnemc.cn/) and one monitoring site at the United States General Consulate in Chengdu. For data quality control, we used an automatic outlier detection method, which was based on the probability of residuals to exclude outliers due to power outages, instrument calibration, and/or other reasons (Wu et al., 2018). The results of the Pearson correlation analysis show that hourly PM 2.5 concentrations at the eight sites were significantly related, with R 2 values ranging from 0.892 to 0.971 and slopes close to 1 (Table S1). Therefore, the hourly PM 2.5 concentrations measured at the eight stations were averaged to calculate citywide daily PM 2.5 concentrations.
Standard hourly surface meteorological data (including air temperature, relative humidity, precipitation, sea-level pressure, and wind speed and direction) measured at the national meteorological station (Wenjiang Station: 30°44′N, 103°51′E, 547.7 m above sea level [a.s.l.]) were obtained from the China Meteorological Administration. Radiosonde meteorological data for Wenjiang Station at 8 a.m. and 8 p.m. local time (LT) were obtained from the website of the University of Wyoming, the United States of America (http://weather.uwyo.edu/upperair/sounding.html). The Hybrid Single-Particle Lagrangian Integrated Trajectory (HYSPLIT) model (version 4.9) was run for each day using the Global Data Assimilation System (GDAS) data (1° × 1°) to calculate 12-h and 24-h backward trajectories, respectively. The trajectories were initialized for times of 10 a.m. and 10 p.m. each day. Detailed descriptions of all the meteorological variables used in this study are provided in Table S2.

GAM Method and Model Development
GAMs are an expanded form of the generalized linear model (GLM). Instead of linear coefficients, GAMs involve a combination of flexible smooth spline functions of the covariates. The advantage of GAMs is that they do not need pre-set parameters and they can automatically select appropriate polynomial fits. GAM techniques have been previously applied to explore complex non-linear correlations with air pollution (Dominici et al., 2002;Camalier et al., 2007;Pearce et al., 2011;Bertaccini et al., 2012;Gong et al., 2017;Gong et al., 2018). We used the package mgcv in the R environment to develop a GAM as implemented by the gam() function (Wood, 2006): where y i is the i th concentration of a given air pollutant, β 0 is the overall mean of the response, s j (x ij ) is the smooth function of i th value of covariate j, n is the total number of covariates, and ε i is the i th residual. The smoothness of each function in the equation is controlled by the number of knots or the effective number of degrees of freedom. As the number of knots increases, the observed PM 2.5 concentrations fit better. However, it will become less smooth and effectively over-fit the data. The default method of cross-validation was used to choose the degree of smoothing (penalization) of k = 10 dimensional regression splines (Wood, 2006). In order to select the key parameters out of a large number of parameters, the Akaike Information Criteria (AIC) was evaluated for each model specification. A variable remained in the final model if the fit yielded a lower AIC (Wood, 2006;Pearce et al., 2011). P-values could approximate confidence for inclusion in GAM, and F-values could be used to identify the strongest meteorological predictors. We chose year, month, and trajectory direction and distance as the basic parameters for the model, then added the other variables one by one to obtain the models with a lower AIC. The trajectory direction and distances on a single day were calculated from 12-h or 24-h LT between the starting and end point. We used logtransformed PM 2.5 concentrations in the model to reflect the log-normal distribution of PM 2.5 data. The model used for 24-h PM 2.5 in Chengdu can be described as: where Year and Mon are year and month, respectively. TEM, TMAX, and TMIN are daily average, maximum, and minimum of air temperature, respectively. RHMIN is daily minimum relative humidity. WS and WD are 24-h wind speed and wind direction. PRC is daily total precipitation. SLP 5d is SLP 5 days earlier. WD8 is wind direction in lowest 1000 m height at 8 a.m. T1Km8 is air temperature in the lowest 1000 m at 8 a.m. RH1Km8 is relative humidity in the lowest 1000 m height at 8 a.m. MLTheta8 is mean mixed layer potential temperature in K at 8 a.m. MLMR20 is mean mixed layer mixing ratio in g kg -1 at 8 p.m. Tradist10a and Tradist22a are the distances between end points (point to point) after 12 hours of transport for a back trajectory initialized at 10 a.m. and 10 p.m., respectively. The R gam.check function was used to assess the GAM's performance. Specifically, the Q-Q plots (sample quantiles against predicted quantiles), scatter plots (residuals against linear predictor), histogram of the residuals, and scatter plots (response against fitted values) were generated. In addition, we randomly selected 10% of the datasets as a crossvalidation dataset and 90% of the data as a training set to build the GAM model, and did model prediction and evaluated model performance.

Inter-annual Changes
The PM 2.5 concentrations and the air quality levels (AQLs) based only on PM 2.5 concentrations in each year from 2013 to 2017 are summarized in Table 1. The annual average PM 2.5 concentrations declined from 92 to 57 µg m -3 from 2013 to 2017. PM 2.5 pollution was the heaviest in 2013, with 24-h PM 2.5 concentrations > 250 µg m -3 (AQL: Grade VI) and > 150-250 µg m -3 (Grade V) on 7 and 49 days, respectively. The number of days with PM 2.5 concentrations < 75 µg m -3 (Grade I and Grade II) increased between 2013 and 2017, but heavy PM 2.5 pollution days (24-h PM 2.5 > 115 µg m -3 ) gradually decreased. The decreased annual average PM 2.5 concentrations reflect the great achievement in air pollution mitigation through relevant policies, regulations, and technologies (Jin et al., 2016;Zeng et al., 2019). However, 24-h PM 2.5 concentrations still exceeded the national ambient air quality standard (NAAQS) (Grade II: 75 µg m -3 ) on 85 days in 2017, thus more efforts should be made to further prevent PM 2.5 pollution in Chengdu.

Seasonal and Diurnal Variations
The diurnal patterns of PM 2.5 concentrations during 2013-2017 in each season are presented in Fig. 2. The hourly averages of PM 2.5 concentrations in the four seasons were in the order: winter (99-126 µg m -3 ) > spring (54-78 µg m -3 ) > autumn (50-66 µg m -3 ) > summer (37-50 µg m -3 ). The diurnal patterns were similar in the four seasons: PM 2.5 concentrations peaked at 10-11 a.m., about 2-3 hours after the morning traffic rush hour. Then, PM 2.5 concentrations decreased and reached a daily minimum at 5-6 p.m. The diurnal patterns probably reflect the impacts from inversion layer and air pollutant emissions. From midnight to 9-11 a.m., the city is affected by an inversion layer, which inhibits air pollutant dispersion. Thus, PM 2.5 concentrations increase slightly during this period. Also, the morning traffic rush contributes to PM 2.5 concentrations. As the air temperature increases, the inversion layer disappears at 10-11 a.m., thus PM 2.5 concentrations start to decrease until 5-6 p.m. In parallel with the formation of a new inversion layer in the evening, air pollutants from the afternoon traffic rush around 6 p.m. and other sources cause PM 2.5 concentration to increase from 6 p.m. (Tang et al., 2015;Tang et al., 2016).  Fig. S1 presents the correlation of 24-h PM 2.5 concentrations with the effect of each meteorological factor. We found that all the selected meteorological factors have a non-linear relationship with PM 2.5 concentrations. Among the 24 meteorological parameters, air temperature, wind speed, relative humidity, precipitation, and ΔSLP (see definition below), mean mixed layer potential temperature, and trajectory direction and distance are good predictors for PM 2.5 in Chengdu, with F-values between 3.533 and 210.939, and P-values < 0.01 in the model fit. PM 2.5 concentrations were positively correlated with daily maximum temperature and the radiosonde air temperature in the bottom layer (< 1,000 m) at 8 a.m. PM 2.5 concentrations were negatively correlated with air temperature, daily minimum relative humidity, wind speed, and precipitation. High wind speed and high daily average temperature favor the dispersion of air pollutants, and high relative humidity and precipitation facilitate PM 2.5 deposition (Fang et al., 2017;Zhao et al., 2018;Lin et al., 2019).

Correlation of 24-h PM2.5 with Meteorological Factors
We performed a correlation analysis on the relationship between sea-level pressure and PM 2.5 concentration (Fig. 3). The results showed that the correlation coefficient between PM 2.5 and the sea-level pressure on the same day is 0.1237. In contrast, the correlation coefficient between PM 2.5 and the  I  38  61  83  76  148  35-75  Grade II  130  175  169  172  132  75-115  Grade III  84  66  74  71  47  115-150  Grade IV  27  31  14  33  15  150-250  Grade V  49  30  25  14  20  250-500  Grade VI  7  2  0  0  3  Missing  30  0   sea-level pressure 5 days earlier is 0.3603. To further explore the relationship between sea-level pressure and PM 2.5 , we selected the 30 highest PM 2.5 concentration days in each year (total 150 days in five years), and then calculated the mean sea-level pressure difference (∆SLP: ± 95% confidence). We define ΔSLP as follows: We conducted a statistical analysis on the relationship between the PM 2.5 concentrations and sea-level pressures for 20 days (10 days in advance and 10 days' lag, respectively) on the highest PM 2.5 concentration days (Fig. 4). Mean ΔSLP 2-10 days in advance were always significantly different from zero, furthermore, the difference was most significant at a lag of 5-7 days, but mean ΔSLP for 1-10 days' lag were not significantly different from zero. In other words, sea-level pressure 2-10 days in advance is a predictor for a high PM 2.5 day in Chengdu, and the strongest predictive power was 5-7 days in advance.
We also chose randomly 30 days in each year (150 days in total) and calculated mean ΔSLP using the same method (Fig. S2). In this case, mean ΔSLP does not show a significant difference on these randomly selected days. These results confirm that sea-level pressure 5-7 days in advance is a strong predictor for elevated PM 2.5 in Chengdu. Lin et al. (2019) found that air pressure 1-2 days earlier showed a higher positive correlation with the current particle concentrations (except in winter) in the city of Lin'an in the Yangtze River Delta region. The average surface wind speed decreases with increasing sea-level pressure, which is not favorable for the horizontal dispersion of pollutants. Moreover, the high sealevel pressure may cause a weakening of the vertical diffusion of PM 2.5 , favoring the formation of heavy PM 2.5 pollution gradually (Ning et al., 2018;Lin et al., 2019). The different lag days of our study are probably due to different distance from the sea, climate, elevation, and terrain characteristics. Further study on the relationship among cities with different topography and distance from the sea might be needed.

GAM Fitting
The GAM results were based on the training dataset, which consisted of random selection of 90% of all days. We compared the results of the two model specifications, as shown in Table 2. The only difference was that Model 1 included SLP as one of the variables whereas Model 2 included ΔSLP for 5 days in advance (SLP 5d ). Model 1 had the adjusted R 2 = 0.70 and the variance interpretation rate was 71.7%, and Model 2 had the adjusted R 2 = 0.73 and the variance interpretation rate was 74.2%. In addition, F-value of ΔSLP was 208.7 in Model 2, which is significantly higher than that in Model 1 (3.5). This indicates that ΔSLP 5 days in advance is a more important meteorological factor for PM 2.5 concentrations in Chengdu compared to sea-level pressure on the same day and demonstrates that Model 2 is superior to Model 1. Fig. 5 shows the observed vs. model fit PM 2.5 concentrations for Chengdu. It demonstrates that the GAM results are unbiased across the full range of PM 2.5 concentrations. Fig. 6 shows   an example of daily results from January 17, 2015, to February 10, 2015. During this period, the trends of the GAM fit values and the observed values are very similar. The square dots represent sea-level pressure of the same day. After the SLP rises to a peak, PM 2.5 would also peak 5 days later.

Validation of the GAM Models
To evaluate the reliability of the GAMs we used, the R language gam.check function was used to verify the fitting results of the model. Fig. 7 shows that the points in the Q-Q plot is approximately a straight line, which indicates that the sample data and predictors were from a log-normal distribution. The frequency of residual distribution near zero is high, indicating that the fitting degree of the model is good. Fitted PM 2.5 concentrations matched well with observed values. All PM 2.5 values on the y-axis in Fig. 7 are on a log scale.
We used cross-validation (CV) to further evaluate the GAM results. The cross-validation dataset was a random selection (10%) of the entire dataset. The training data were used to build GAM Model 2, and then the remaining data were used to predict PM 2.5 concentrations from the meteorological parameters. We compared R 2 of training data and crossvalidation data, and the residuals of predicted and observed values. We found that R 2 and model residuals of the training and CV data have similar resultant values and the average of all residuals are close to zero (Table 3).

CONCLUSIONS
We compared the daily PM 2.5 concentrations at eight monitoring sites in Chengdu against a suite of meteorological parameters and discovered significant non-linear relationships. We used these meteorological parameters to develop a GAM that achieved an R 2 of 0.73 in predicting the daily PM 2.5 concentrations. Residual testing showed that the model is unbiased. We found that the sea-level pressure from 5 days earlier is a good predictor for PM 2.5 in Chengdu, especially on days with high concentrations, enabling the government to modify emission patterns in advance and mitigate the risks of heavy pollution. Because it does not utilize emission source data, our method is easier to use and less expensive than other predictive models, such as WRF-Chem and CMAQ. However, reliable predictive variables are required for estimating the PM 2.5 concentrations. Although forecasts of meteorological variables can generally be obtained on a regular basis, their uncertainty probably affects the total uncertainty. The GAM we have developed is suitable for predicting the daily PM 2.5 concentration in Chengdu, a megacity in a large basin; further studies will examine its suitability for other cities with a similar topography as well as the necessary meteorological parameters for adapting it to cities with a different topography. Our novel finding that the surface pressure can predict the PM 2.5 concentration 5 days later can be used to craft policies for controlling emissions in advance, thus reducing PM 2.5 levels on the worst air quality days and improving public health. backward trajectory data; to the University of Wyoming, USA, for providing radiosonde meteorological data; and to the U.S. Embassy in China for providing PM 2.5 monitoring data at the General Consulate in Chengdu.

SUPPLEMENTARY MATERIAL
Supplementary data associated with this article can be found in the online version at http://www.aaqr.org.