**Hamed Karimian ^{1}^{,2}, Qi Li ^{2}, Chunlin Wu^{2}, Yanlin Qi^{2}, Yuqin Mo^{2}, Gong Chen^{2}^{,4}, Xianfeng Zhang^{2}, Sonali Sachdeva^{3}**

^{1 }School of Architecture, Surveying and Mapping Engineering, Jiangxi University of Science and Technology, Jiangxi 341000, China^{2 }School of Earth and Space Science, Peking University, Beijing 100871, China^{3 }Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China^{4 }Laboratory of Ocean Environmental Big Data Analysis and Application, Shenzhen 518055, China

Received:
January 12, 2019

Revised:
April 1, 2019

Accepted:
May 10, 2019

Download Citation:
||https://doi.org/10.4209/aaqr.2018.12.0450

Cite this article:

Karimian, H., Li, Q., Wu, C., Qi, Y., Mo, Y., Chen, G., Zhang, X. and Sachdeva, S. (2019). Evaluation of Different Machine Learning Approaches to Forecasting PM2.5 Mass Concentrations. *Aerosol Air Qual. Res.* 19: 1400-1410. https://doi.org/10.4209/aaqr.2018.12.0450

**HIGHLIGHTS**

- Three ML approaches were implemented and evaluated in PM
_{2.5}forecasts. - Advanced ML model (without neurons) showed better performance than deep FNN.
- Proposed hybrid LSTM is robust in forecasting PM
_{2.5}and air pollution levels. - Temporal dependencies of data should be considered in air pollution predictions.

**ABSTRACT**

With the rapid growth in the availability of data and computational technologies, multiple machine learning frameworks have been proposed for forecasting air pollution. However, the feasibility of these complex approaches has seldom been verified in developing countries, which generally suffer from heavy air pollution. To forecast PM_{2.5} concentrations over different time intervals, we implemented three machine learning approaches: multiple additive regression trees (MART), a deep feedforward neural network (DFNN) and a new hybrid model based on long short-term memory (LSTM). By capturing temporal dependencies in the time series data, the LSTM model achieved the best results, with *RMSE* = 8.91 µg m^{–3} and *MAE* = 6.21 µg m^{–3}. It also explained 80% of the variability (R^{2} = 0.8) in the PM_{2.5} concentrations and predicted 75% of the pollution levels, proving that this methodology can be effective for forecasting and controlling air pollution.

Keywords:
Air pollution; Machine learning; Neural networks; Deep learning; Prediction.

**INTRODUCTION**

As one of the major air pollutants, atmospheric aerosols are groups of solid or liquid particles suspended in the air and come from different sources and in various shapes and sizes. Moreover, the large portion of particulate matter is produced in the lowest layer of the atmosphere. In general, the finer the size of the particulate matter, the deeper it can penetrate inside the respiratory system where adsorption is more efficient. Particles reaching deep inside the lung are deposited by diffusion to the surface of alveoli, and water-soluble components can pass through cell membranes by simple passive diffusion. Previous works have investigated the linkage between exposure to fine particles (i.e., with diameters less than 2.5 µm (PM_{2.5})) and premature mortality (Di *et al.*, 2017; Hung *et al.*, 2018). This highlights the importance of predicting air pollution concentrations so as to aid the generation of appropriate responses.

Methods for predicting air pollution concentrations canbe broadly classified into two major categories: simulation-based and data mining-based methods. Simulation-based method incorporates physical (for generating meteorological and background parameters) and chemical models to simulate emission, transport and chemical transformation of air pollution (Grell *et al.*, 2005; Emmons *et al.*, 2010). However, this method suffers from numerical model uncertainties, and due to the lack of data, the parameterization of aerosol emissions is restricted (Karimian *et al.*, 2016). Data mining-based approach exploits statistical or machine learning techniques to detect patterns between predictors and dependent variables in the time series data. Linear mixed-effects regression (Wang *et al.*, 2017), multiple linear regression (Dimitriou, 2016), geographically weighted regression (Karimian *et al.*, 2017) and land use regression (Huang *et al.*, 2017) are some of the widely used models that have been developed by assuming a linear correlation between explanatory and response variables. However, sometimes the linear assumption in these models may not reflect the direct relation between the explanatory variables and the air pollutant concentrations (Huang *et al.*, 2017). In addition, aerosol optical depth (AOD) is one of the satellite-based products which has been used to produce surface distribution of PM_{2.5}. However, application of AOD for studies that aim to forecast hourly PM_{2.5} concentrations is limited due to low temporal resolution and the issue of missing values (Qi *et al.*, 2019).

In recent years, machine learning methods have demonstrated their feasibility in nonlinear regimes. Thus, their applications in air pollution concentration forecasts are increasing (Peng, 2015). Gardner and Dorling (1998) presented the advantages of artificial neural networks (ANN) in dealing with nonlinear systems. Kukkonen *et al.* (2003) compared the performances of five ANN models, a linear model and a deterministic model to forecast NO_{2} and PM_{10}. They concluded that the ANN-based models perform better, particularly in NO_{2} forecasts. In addition to that, including meteorological data, especially planetary boundary layer height (PBL), can improve the accuracy of predictions (Hooyberghs *et al.*, 2005). Some recent works have proposed hybrid models and have claimed their robust performances in severe pollution scenarios (Feng *et al.*, 2015; Kumar, 2015; Tamas *et al.*, 2016; Perez and Menares, 2018).

Air pollution forecasts have attracted interest from scholars in Iran, as a region which suffers from notable air pollution problems. Kamali *et al.* (2015) proposed a hybrid model including the Kolmogorov–Zurbenko filter and an ANN to forecast PM_{10} concentrations over a station in Tehran. They claimed *R*^{2} = 0.90 between predicted and observed values. Memarianfard and Hatami (2017) presented daily averaged PM_{2.5} forecasts using a three-layer feedforward neural network (FNN) with daily averaged meteorological data (wind speed, relative humidity and temperature) obtained from a meteorological station. The performance of random forest feature selection in predictions of PM_{2.5} was investigated by Shamsoddini *et al.* (2017). The authors claimed that better performance was achieved in comparison to linear regression model and FNN. Jamal and Nabizadeh Nodehi (2017) found improvement in PM_{2.5} and air quality index forecasts (*RMSE* = 21.26 µg m^{–3}) through a combination of decision trees and FNN. However, these predictions have been carried out using historical meteorological data observed from a station without considering the conditions at forecast time, which can have an influence on accuracy of the forecasts. Moreover, although multiple machine learning frameworks have been proposed for air pollution forecasts, it should be noted that most of these models have not been tested in developing countries with significantly higher PM_{2.5} levels and different emission source profiles (Liu, 2013). To the best of our knowledge, considering the effect of temporal dependencies in air pollution data and deep neural networks (DNN) has not been investigated over the study area. In this study, we implement and evaluate three methods to forecast PM_{2.5} concentrations, including a model based on machine learning (no neurons) and two models based on deep neural networks. Details of each of these techniques are presented in the following sections.

**DATA AND METHODS**

Study Area

Study Area

Study Area

Tehran, the capital of Iran (~51.1–51.6°E, 35.6–35.8°N), is located in arid region with a population of 13.3 million (native) plus 10 million (diurnal migration). It is one of the most polluted cities of Iran, where haze scenarios are reported frequently (Kamali *et al.*, 2015).

Ground Level PM_{2.5}

Ground Level PM

Ground Level PM

_{2.5}This study used hourly concentrations of PM_{2.5} provided by Tehran Air Quality Control Company. This data was collected from 9 stations for a period of 4 years from 1 January 2013 to 31 December 2016 (Fig. 1). In order to improve the predictive performance of our models, we have omitted anomalies by performing an initial check. In addition to that, to account for the missing data due to instrumental malfunctions, we applied the interpolation method (before and after mean), as described in Ghasemifard *et al.* (2019). However, to ensure the accuracy of the process, time sequences with more than 3 hours of consecutive missing data were excluded from the interpolation and discarded from the dataset. This was done through partitioning the data in different blocks, where length of the blocks was based on the duration for which forecasting was desired. It was ensured that each block consists of required data. Our models have been trained using 60% of the dataset, and remaining 40% has been used for validation (20%) and testing (20%).

**Fig. 1.** Tehran metropolitan area and the air pollution monitoring stations (green pin).

Meteorological Data

Meteorological Data

Meteorological Data

As described before, meteorological parameters are an important determinant of air pollution concentrations. Due to 700 m of altitude difference between the highest and lowest geographical points of Tehran, the weather conditions vary across the city (Habibi *et al.*, 2017). In consideration of that, improving on previous studies which used ground-based meteorological data from only one station, we collected hourly meteorological data, including temperature (T), PBL, surface-level pressure (P), east and north components of 10-m wind (U, V) and relative humidity (RH) from European Centre for Medium-Range Weather Forecasts (ECMWF) for the period of our study. This center provides high spatial resolution (0.1° × 0.1°) data with 10 days’ forecast. A statistical summary and features of input data are given in Table 1. The values (maximum, minimum, mean and standard deviation) are based on hourly data of all stations. The mean value of PM_{2.5} is 3 times higher than WHO guidelines for the annual average (10 µg m^{–3}). It has been found that usage of explanatory variables which are highly correlated reduces the applicability of models (Feng *et al.*, 2015; Karimian *et al.*, 2017). Variance inflation factor (VIF) detects the severity of redundancy by examining the correlation coefficient (*R*) between each pair of explanatory variables (Eq. (1)). A VIF lying between 5 and 10 indicates high correlation that may be problematic (Akinwande *et al.*, 2015). Table 2 shows the VIF values for different auxiliary variables. It can be seen that meteorological variables are weakly correlated and no redundancy is observed.

Multiple Additive Regression Trees

Multiple Additive Regression Trees

Multiple Additive Regression Trees

Multiple additive regression trees (MART) is a machine learning method proposed by Friedman (Friedman, 2002; Friedman and Meulman, 2003). It is an extension and improvement for classification and regression trees (Beriman *et al.*, 1984) that utilizes stochastic gradient boosting (SGB) to convert a sequence of weak learners into a complex predictor. The idea of SGB came from bagging procedure (Breiman, 1996), in which the author claimed that better results could be obtained by injecting randomness into a model (Friedman, 2002). Therefore, at each iteration (*M*), a subsample of the training set is selected randomly, and a tree partitions the pseudo residuals (*ỹ*) into *J* disjoint regions R* _{JM}* (Eq. (2)), where

*I*is an indicator function (it is 1 if the condition is true and 0 otherwise). Consequently, the final model is composed of small trees ranging from hundreds to thousands, where each of them has brought an improvement to the overall model (Elish and Elish, 2009).

Considering least-squares loss function (*L*(*y*, *F*(*x*)) = (*y* – *F*(*x*))^{2}/2), pseudo-residuals in Eq. (4) can simply be calculated as the difference between ground truth values and corresponding predicted ones. The initial value of *F _{M}*

_{–1 }in the first step of algorithm (

*F*

_{0}) is defined as the average of the target variable in training dataset for the least-squares loss function.

Following the modification proposed by Friedman and Meulman (2003) to conventional boosting, the current prediction, *F _{M}*

_{–1}, is then separately updated by randomly selected subsamples (instead of whole sample), which can be written as:

For least-squares loss function, *γ _{jM}* is the mean of current residuals in

*J*

^{th}terminal node. Hastie

*et al.*(2009) demonstrated that 4 ≤

*J*≤ 8 works well for boosting. In this work, our trees have maximum of 6 terminal nodes. Considering the fact that slowing the learning stage leads to a model with better performance, the shrinkage factor,

*α*(0 <

*α*< 1), regularizes the process and allows more (and different) trees to fit into the residuals. The optimal learning rate can be estimated through cross validation or a testing sample. However, it was found that small values, such as

*α*= 0.1, lead to the better results (Friedman and Meulman, 2003). Table 3 provides the details of our model. To forecast PM

_{2.5}concentrations, our model gets PM

_{2.5}concentration and meteorological data at current time (present) as well as forecasted meteorological data up to a desirable time as predictor variables (Table 1).

Recurrent Neural Network

Recurrent Neural Network

Recurrent Neural Network

Artificial neural networks are a class of machine learning methods in which collections of connected units (neurons) enable the machine to learn patterns of different complex circumstances (responses) for their future predictions. Through proposing different structures of networks (units and functions), there is a large class of ANN models. Feedforward neural networks are one of the widely used ANN patterns which are formed by fully connected layers of neurons (at least three layers). Moreover, these connections follow the same direction, and there is no cycle or loop in their connectivity graph (Fan *et al.*, 2017). According to universal approximation theorem, an FNN with a single hidden layer can learn any function (Cybenko, 1989; Hornik *et al.*, 1989). However, to achieve this ability, the size of hidden layer may need to be unfeasibly large. Empirically, it was found that adding more layers to an FNN can improve the performance of the networks in different tasks (Goodfellow *et al.*, 2016). Therefore, in comparison with a normal FNN, deep neural networks contain more hidden layers or have different structures (connectivity between neurons) and learning methods.

Recurrent neural networks (RNN) are a class of DNN that, through applying cyclic (loop) connections, allow information to persist (a loop allows information to be passed from one step of the network to the next). Thus, they are specialized in time series processing. However, in practice, as the time sequence gets longer, the network forgets to train primary inputs. This issue is called “exploding” or “vanishing” gradients and arises due to the architecture of RNN (Bengio *et al.*, 1994). Considering a loss function, *L*, and a linear activation function, the gradient of *L* for the first hidden state with respect to the weight of input *w _{x}* can be written by chain rule as Eq. (8).

In the above, *w _{x}* is the weight matrix that multiplies with input

*x*in different time steps. It can be inferred that the overall gradient of loss function in the RNN (sum of the error gradient at each time step) contains the exponents of transposed weight matrix

_{t}*w*, which is the weight multiplied against the hidden state. As a result, the gradient will explode (if weights > 1.0) or shrink (if weights < 1.0) exponentially. This makes it difficult for RNN (i.e., it will diverge, be very slow or stop) to learn long dependencies. One solution to overcome this problem is to utilize the long short-term memory (LSTM) model architecture as a special class of RNN (Hochreiter and Schmidhuber, 1997).

_{h}Fig. 2 illustrates the schematic of an LSTM block at time step *t*, which we used in this study. It features four main elements, including the input gate (*i*), forget gate (*f*) and output gate (*o*). The fourth element, which is the key to an LSTM model, is the memory cell (cell state), which can be understood as a straight connection passing through a set of LSTM blocks with some minor linear interactions. Similar to the RNN structure, at each time step, the inputs of an LSTM block are input (*x _{t}*) and the hidden state output (

*h*

_{t}_{–1}) from the prior time step block. In addition to that, the cell state (

*c*

_{t}_{–1}) is the third input of a block. It is worth mentioning that the output of each gate is a vector with similar size to the hidden vector (

*h*). The LSTM version implemented in this paper is similar to the one suggested by Graves (2013). However, to reduce the computational cost of network without malfunctioning (Greff

_{t}*et al.*, 2017), peephole connections were removed.

**Fig. 2.** Schematic of an LSTM block at time step *t*.

As can be seen in Eqs. (9)–(11), for the forget, input and output gates, the nonlinearity is brought through the sigmoid activation function (*σ*). The output values of these gates range between 0 and 1, which allow them to control the flow of data inside and between LSTM blocks. For example, by looking at current inputs and through pointwise multiplication (Eq. (13)), the forget gate decides the portion of the data that should be removed from the previous memory cell. For potential values of memory cell (pc) and hidden state (Eqs. (12) and (14)), the tangent activation function (tan*h*) is used; its outputs are between –1 and 1, and its derivative can sustain for a long range before vanishing. The values of current cell memory (*c _{t}*) and hidden state (that are revealed outside a block) are calculated through element-wise multiplication. This multiplication is carried out against the output of forget gate (

*f*varies in different time steps) rather than the repetitive matrix multiplication of weights in RNN (Eq. (8)). This can be inferred as one of the advantages of LSTM, as it avoids extreme vanishing or exploding of the gradients. It is also recommended to initialize the bias of the forget gate to 1 (Jozefowic

_{t}*et al.*, 2015). These make LSTM-based models feasible to learn dependencies in long sequences and prevent vanishing of the gradients.

To forecast PM_{2.5} concentrations over desirable time spans, we propose two types of DNN: One is based on deep FNN with three hidden layers (DFNN), and the other is a hybrid model comprising two LSTM layers and a DFNN with three hidden layers (Fig. 3). For simplicity, we call this model “LSTM” hereinafter. Our LSTM model gets the meteorological data (current (*t*_{0}) and hourly forecasted (*t*_{1}–*t _{n}*) as input for the two layer LSTM, in which the final output of the second layer is concatenated with current (

*t*

_{0}) PM

_{2.5}concentration and is treated as the input layer for the FNN. The final output of the FNN is forecasted PM

_{2.5}concentration at t

*. To train our models, mini batch root mean square prop (RMSprop) algorithm is employed. It is one of the optimization algorithms based on gradient descent method. The main idea of this algorithm is to speed up the process of gradient descent (Hinton*

_{n}*et al.*, 2012b). As a feature of most sophisticated models, overfitting, which refers to the acceptable performance of a model in training stage and failure over unobserved datasets, may occur. There are several techniques to avoid overfitting, which are known as

*regularization*(Nielsen, 2015). We used the dropout technique, in which the architecture of a network is modified in each iteration by randomly removing hidden units and connections from a network. Through the dropout procedure, since a neuron cannot rely on the presence of other neurons, it is expected to learn more robust features. This improves the performance of different neural networks remarkably (Hinton

*et al.*, 2012a). Early stopping is another technique that prevents overfitting. It stops training stage if the performance of a model on validation dataset fails to improve after an optional number of epochs (here, 100). The details of our proposed DNN models are provided in Table 4. The values of hyper-parameters in this table were selected through trial and error.

**Fig. 3.** Structure of the proposed LSTM model.

RESULTS AND DISCUSSION

RESULTS AND DISCUSSION

To evaluate the performances of our proposed models, root mean square error (*RMSE*) and mean absolute error (*MAE*) were computed, which measure the closeness of the forecasts (*F _{i}*) to the observed values (

*O*) over the test dataset. To analyze the prediction strength of different models, overall coefficient of determination (

_{i}*R*

^{2}) was derived as well. Note that these time intervals were selected for evaluation purpose and models are able to make forecasts over any time intervals.

As can be seen in Table 5, the LSTM model showed better performance, with overall *RMSE* = 9.58 µg m^{–3}, than other models (*RMSE* = 19.53 and 13.32 µg m^{–3} for DFNN and MART, respectively) for different time intervals and over all stations. In contrast to DFNN and MART (where inputs from different times are all fed together to the models), these results demonstrate the importance of sequential feeding in LSTM. As an advanced machine learning method, our proposed MART model performed better than DFNN, and overall, it can explain over 50% of variability in PM_{2.5} concentrations (*R*^{2} = 0.53). We have discussed in Section 2.5 that to attain better results with an FNN, the hidden layer might be unfeasibly large. From our results, it can be inferred that sophisticated machine learning models (without neurons) such as MART give better results than a DFNN in PM_{2.5} predictions. Increase in the length of forecasts leads to better performance of all three models studied here. This is especially seen in the case of our LSTM model (10.41 and 7.43 vs. 8.91 and 6.21 for 12- and 48-h predictions, respectively). Our understanding is that this improvement is caused by the fact that we also consider the meteorological conditions at the time of forecast. In general, models used historical data for training purposes; thus, accuracy of their predictions decreases as the length of prediction time gets longer (Qi *et al.*, 2019). This shows that our methodology is crucial for getting more accurate predictions. We will investigate this further with higher number of parameters involved. As shown in Table 5, the performances of models varied spatially. This may have been caused by data (PM_{2.5}) availability, which differs among stations. It highlights the role of other factors (e.g., emissions or land cover), which should be considered in future studies. Our LSTM model exhibited better performance than other studies, e.g., by Memarianfard and Hatami (2017) and Shamsoddini *et al.* (2017), which achieved *R*^{2} = 0.30 for observed and daily predicted PM_{2.5} and* RMSE* = 18.13 µg m^{–3} for daily averaged PM_{2.5} predictions, respectively.

Fig. 4 shows the scatter plots of observed and the LSTM-forecasted PM_{2.5} concentrations over 12-, 24- and 48-h intervals. There is a reasonable agreement between ground truth and predicted values, and our LSTM model is able to explain 80% (*R*^{2} = 0.8) of the variability in observed PM_{2.5} concentrations. For different time intervals, the slopes of linear regression are less than 1, and intercepts are positive. This depicts the tendency of model to underestimate and overestimate high and low concentrations, respectively.

**Fig. 4.** Scatter plots of the observed and forecasted PM_{2.5} concentrations by the LSTM model for 12-h (left), 24-h (middle) and 48-h (right) intervals.

Since the ultimate goal of PM_{2.5} forecasts in urban areas is to improve air quality and health, we validated the feasibility of our LSTM model (the most predictive model in this study) in air pollution level (APL) estimations. For this purpose, we sampled the data from 21 December 2014 to 27 December 2014. Note that this period was selected randomly and was not included in training or test datasets. The air quality index (AQI) and air pollution levels are reported based on the highest AQI derived from six major air pollutant concentrations. Table 6 illustrates the air quality and air pollution subindex level based on daily averaged PM_{2.5} concentrations (Tamura and Tateishi, 1997). Fig. 5 illustrates the averaged surface distribution of 48-h predicted and observed PM_{2.5} concentrations for the sample period using the inverse distance weighted interpolation as well as corresponding air pollution levels. It is worth mentioning that the PM_{2.5} data for Modares station was missing for this period. Thus, this station is not shown in the figure. Generally speaking, the predicted distribution pattern follows the observed one, with *RMSE* = 10.32 µg m^{–3} and* R*^{2} = 0.76. However, similar to the trend seen with the test dataset, for the stations with high concentrations, our predicted surface shows underestimation tendency. Except for two stations (Aghdasie and Sharif) with air pollution level unhealthy for sensitive groups, our model has the ability to estimate true APL. Consideration of more explanatory variables may give better results, especially for the stations where PM_{2.5} concentrations exceed 140 (µg m^{–3}) in some of the hours during sample period (e.g., Sharif).

**Fig. 5.** Comparison between the forecasted (upper) and the observed (lower) surface distribution of PM_{2.5}. The correspondent air pollution levels are shown on the right.

CONCLUSIONS

CONCLUSIONS

In recent years, researchers have been proposing advanced models for forecasting air pollutant concentrations. In this study, we evaluated three methods employed in PM_{2.5} forecasting by implementing models based on machine learning (MART) and deep neural network (DFNN and LSTM) concepts. The results showed that the LSTM model, which obtained the lowest *RMSE* (8.91 µg m^{–3}) and *MAE* (6.21 µg m^{–3}) values in combination with the highest *R*^{2} (0.8) value for 48-h predictions over the entire study area, outperformed the other two models. Additionally, the LSTM model accurately forecasted 75% of the air pollution levels based on the PM_{2.5} concentrations, demonstrating the importance of sequential feeding in time series modeling. We also found that the advanced machine learning models, such as MART, produced better PM_{2.5} estimates than the DFNN model.

In summary, the LSTM model was able to capture temporal dependencies in time series data, which increased the accuracy of its PM_{2.5} forecasting. Therefore, this methodology can be used to predict the concentrations of different air pollutants. Furthermore, adding explanatory variables in the future will enhance this model’s performance, opening the door to investigating new variables and methods.

ACKNOWLEDGMENTS

ACKNOWLEDGMENTS

The study was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDA19030301).