Nearest Neighbour Based Forecast Model for PM 10 Forecasting : Individual and Combination Forecasting

Air quality forecasting using nearest neighbour technique provides an alternative to statistical and neural network models, which needs the information on predictor variables and understanding of underlying patterns in the data. k-nearest neighbour method of forecasting that does not assume any linear or nonlinear form of the data is used in this study to obtain the next step forecast of PM10 concentrations. Various function approximation techniques such as mean, median, linear combination and kernel regression of nearest neighbours are evaluated. It is observed that kernel regression of nearest neighbours outperforms the other individual models including bench mark persistence model for obtaining the next step forecasts. As the data may involve both linear and nonlinear patterns and any individual model cannot capture both types of patterns, combination forecasting is suggested as an alternative. The forecast error showed the outperformance of combination forecasting over individual forecast, which is quite obvious as it assigns more weightage to the model with minimum error. The study is useful when the data on predictor variables that influence the air pollutant concentrations is not available. The assumption on the underlying distribution of the data is also not required for the approach.


INTRODUCTION
Air quality forecasting using statistical techniques has been performed in several studies.Statistical models including autoregressive models (Zennetti, 1990) and neural networks (Gardner and Dorling, 1998;Baawain and Al-Serihi, 2014) have been extensively used in the air quality forecasting.knearest neighbour method of forecasting is a simple machine learning based method, which determines the nearest neighbours of an object in question and use those in estimating the object.The function to estimate the object may be well defined (e.g., mean or median) or can be estimated.Although the method looks linear in nature, it captures the nonlinear patterns underlying the data.The nearest neighbours also include the inherent nonlinear fluctuations in the data, which makes estimations accurate and reliable (Yankov et al., 2006).Moreover, the algorithm does not require aprior assumption of model and also does not need data preprocessing.These features make nearest neighbour method to be more attractive over traditional statistical techniques, which require the knowledge of correlation structure in the data and assumption of parametric distribution.
Before carrying out the estimation, several issues such as the number of nearest neighbours, distance measure to be used for computing the nearest neighbours and the function to be used for estimation need to be addressed.Too many nearest neighbours may give the biased estimation and too few neighbours may give the results with large variance (Yankov et al., 2006).In order to avoid this, p-fold crossvalidation with varying k is suggested to get the optimal k for a particular data set (http://www.cs.sun.ac.za/~kroon/ courses/machine_learning/lecture2/kNN-intro_to_ML.pdf).Varying the number of nearest neighbours however requires lot of computational efforts.One of the methods to improve the forecasting performance is to use the combination forecasting (Yankov et al., 2006).The Euclidean distance measure is often used to compute nearest neighbours (Dragomir, 2010).The selection of function to approximate the output may be performed by evaluating the performance of several functions including median or mean of the nearest neighbours (Bhulai et al., 2005), linear combination (Atkeson et al., 1996) or kernel regression (Atkeson et al., 1996).Although the best performing model can be obtained using some error analysis, these methods however do not perform well for all types of patterns.For example for linear patterns, linear combination of nearest neighbours may work well but for nonlinear fluctuations, it may not.Hence the models need to be developed that consider both linearity and nonlinearity involved in the time series (Chelani and Devotta, 2006).This helps in improving the forecasting ability of the model.Another problem with the use of individual model is handling model error (Westerlund et al., 2014).This is accomplished with the use of hybrid models, which first obtains the estimations of an object and then again model the error of estimation.Using the modelled errors, forecasts are generated with minimized prediction error (Chelani and Devotta, 2006).One approach to obtain the forecast with minimum error is to use combination forecasting of all the methods (Newbold and Granger, 1974;Makridakis et al., 1982;Clemen, 1989).In air quality literature also, combination forecasting approach is used (Perez, 2012;Westerlund et al., 2014), which is usually done with the linear combination of all the models.
Particulate matter of size less than 10 micron (PM 10 ) poses serious risk to human health due to its inhalable characteristics.Several models have been used in the literature to forecast PM 10 concentration including regression, neural networks and support vector machines (Tzima et al., 2007;Sfetsos and Vlachogiannis, 2010).In this study, nearest neighbour approach is adapted to forecast PM 10 concentration at an urban location.Several functions of nearest neighbours including median, mean, linear combination and kernel regression are used to obtain next-step forecasting.Combination forecasting of k-nearest neighbour models which uses different function approximation techniques is also obtained and compared with the individual models.

K-NEAREST NEIGHBOUR METHOD OF FORECASTING
Nearest neighbour forecasting models have been found to perform well for predicting complex nonlinear behaviour due to the assumption that 'the object to be predicted has close neighbours in the historical set'.With the help of nearest neighbours, one can predict the object using appropriate estimation techniques.The practice is usually to divide the data observed over a period of time into two groups, of which the first group is used to obtain the estimates of the second group.For the time series x(t) of sequence of observations over equal intervals of time t = 1---n, the knearest neighbours of object x(l), where x(l) is the continuation of the time series x(t), can be obtained using the distance or norm D as; The distance matrix D is then ranked in the ascending order and the k values of x(t 1 ) with minimum distance are noted down.This gives the k-nearest neighbours of x(l) in x(t 1 ).The appropriate function of k-nearest neighbours gives the estimate of x(l + h), where h = 1 for next step forecasting.
where x' k (l) is the k-nearest neighbours of x(l).Function f can either be median, mean, linear combination or kernel function of k-nearest neighbours.The linear combination of k-nearest neighbours is given as; where w is the coefficient matrix to be determined by using ordinary least squares technique (Atkeson et al., 1996).Kernel regression function f can be obtained by using the kernels such as Gaussian, radial basis function, polynomial or uniform.Gaussian kernel is most widely used for kernel regression modelling (http://people.revoledu.com/kardi/tutorial/Regression/KernelRegression/KernelRegression.htm).The kernel function between the values x to be estimated and the input x' k (l) is given as, The forecasts are then obtained as, where σ is the bandwidth to be selected.Further details of kernel regression are given in Smola and Scholkopf (1998).

COMBINATION FORECASTING
With the above four functions, the combination forecast can be obtained by utilizing the strengths of forecasts of each individual model.The idea is to use the forecast of the model with minimum error for a case.But for prediction over a time beyond the available data period i.e. extrapolation, the observed values are not available.In that case, the model with minimum error cannot be chosen.Hence a linear combination of individual models is used in this study, which utilizes the estimated coefficients using the available data to obtain the forecasts.Mathematically, let f j be the set of j ( = 4 in this study) available forecasts for the time series.The linear combination of j forecasts can be obtained as; There are several methods to estimate the weights b in the above Eq.( 6).One can assign equal weights to each forecasts, which however may not give appropriate output due to equal importance to all the forecasting functions.Weights are also assigned to individual forecasts by adopting the approaches such as inverse of the estimated forecast error variance, Baysian information criterian etc.More details for weight assignment are given in Westerlund et al. (2014).Ordinary least squares (OLS) weight estimation is most simple and widely used method for large or moderate data sets, which is used in this study.The intercept a is therefore used in the above Eq.( 6).The OLS method is based on the performance of the model for historical observations (Granziera et al., 2013).Larger weights are assigned to the more accurate forecasts.Theoretically the estimated weights are optimal using this method (Timmermann, 2006).

STUDY AREA AND DATA
Nagpur (21°08′N, 79°10′E) is one of the major cities in central India lying on the Deccan plateau of the Indian Peninsula at a mean altitude of 310 meters above sea level.It has a tropical wet and dry climate with dry conditions prevailing most of the year.An annual rainfall of 1205 mm mostly during June to September has been recorded.The population of the district has risen to 4.65 million as estimated in 2011 from 2 million observed in 2001 as per the Census of India, which resulted in the increase in traffic population to approximately 5 lakhs.The area is fast growing with increase in infrastructural facilities and the number of thermal power plants.The unique features of the area are; location on main mineral belt of coal and manganese, basaltic rock base, nearby mining activities and nearby power plants.Due to the power plants and traffic emissions, particulate matter pollution is increasing in the area.Maharashtra pollution control board is monitoring PM 10 concentration at four locations across the city since 2005.24 hourly PM 10 data during 2010 to 2013 is considered for modelling (www.mpcb.gov.in/envtdata/envtair.php).Lot of data gaps were observed at the sites, which compelled the use of data only at one site i.e., Civil Lines for further analysis.At this site, around 15% missing gaps were observed during 2010-2013.These data gaps were replaced with the preceding values to account for the seasonality in the data.The time series is then plotted in Fig. 1.

RESULTS AND DISCUSSION
PM 10 concentration during 2010-2013 plotted in Fig. 1 shows the stationary behaviour with seasonal oscillations of different cycles.An average of 59.8 ± 24.1 µg/m 3 is observed with minimum of 12 and maximum of 216 µg/m 3 .Around 6% values have crossed the standard limit of 100 µg/m 3 .The percentage of exceedence is observed to be higher in 2010 followed by 2011.For modelling purpose, the data is divided into two parts; the data during 2010-2012 is considered as training set and the data during 2013 is considered as testing set.The data observed during 2010 is then considered as neighbourhood set for estimating the continuation of the time series.For estimating the PM 10 concentration on first day of 2011, the neighbours are search out in neighbourhood set i.e., in 2010 data using distance matrix.The nearest neighbours are then ranked.To retain the nearest neighbours, the choice of k is crucial as explained earlier.p-fold crossvalidation technique is used to select the optimum k with minimum forecast error.For this, k is varied from 2 to 30.Only one function i.e., median of nearest neighbours is used to obtain the next step forecasts.Mean absolute percentage forecast error is considered as the error statistic to compare the forecasts.With arbitrary value of p = 10, forecasts are obtained for the unseen dataset and MAPE is computed.It is observed that k = 8 gives minimum mean absolute percentage error.Hence k = 8 is considered for further use.Further, with k = 8, the forecasts are obtained for the training and test sets using function f as median, mean, linear combination (LR) and kernel regression (Kernel) of nearest neighbours.Bench mark persistence model (Persist) of time series forecasting is also applied to compare the performance of the models.For LR model, the weights specified in Eq. ( 3) are obtained using OLS as w 1 = 0.1941 w 2 = 0.1743, w 3 = 0.0631, w 4 = 0.0835, w 5 = 0.1881, w 6 = 0.1233, w 7 = 0.0363 and w 8 = 0.1010.For kernel regression with Gaussian kernel function, the bandwidth is estimated to be σ 1 = 9.1993, σ 2 = 9.1993, σ 3 = 9.5749, σ 4 = 9.5749, σ 5 = 9.5749, σ 6 = 9.5749, σ 7 = 9.5749, σ 8 = 9.9363 for 8 nearest neighbours.The results are given in Figs.2(a)-2(b) for training and test set separately.In order to assess the performance of the models, the error statistics such as mean absolute percentage error (MAPE), relative error (RE) and Nash-Sutcliffe coefficient of efficiency (CE) is used (Chelani and Devotta, 2006; http://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient).For a perfect fit, MAPE and RE should be close to 0 whereas CE should be close to 1.
It can be seen from Table 1 that the correlation between observed and predicted PM 10 concentration for the training set is higher for kernel regression model as compared to other models.MAPE and RE is observed to be lowest for kernel regression along with median model.High CE value of 0.86 and 0.89 is observed for kernel regression and median model.Persistence model performs better than only LR model in terms of the select error statistics.For test set, however MAPE and RE is lower for kernel regression than median and other models.CE is also observed to be high for kernel regression model than other models.This suggests that although forecasts obtained by 'median' function are better for the training set, it does not perform better than kernel regression model of nearest neighbours for test set in terms of forecast error statistics.In case of testing set also, the persistence model performs better than only LR model as seen from MAPE and RE.But CE is quite low for the persistence model, suggesting that it is not able to capture the magnitude and patterns underlying the time series.Combination forecasts (termed as Comb) are then obtained using Eq. ( 6).It utilizes the best forecasts among the class of selected forecast models by assigning the maximum weights to the nearer forecasts.The weights and intercept are estimated as, a = -6.4253,b 1 = 0.5337, b 2 = 0.0347, b 3 = -0.1163,b 4 = 0.6657 for four forecast functions as median, mean, LR and kernel, respectively.As can be seen more weightage is assigned to kernel regression and then to median model.The error statistics of combination forecasts  Although combination model performs better than individual models and kernel model performs better among other individual models, it is desirable to assess the performance of the models for high and low values.For high values, PM 10 higher than standard limit of 100 µg/m 3 and for low values, PM 10 less than 10 th percentile, which is 33 µg/m 3 is considered.PM 10 concentration time series.The models based on nearest neighbours capture the inherent nonlinear fluctuations in the data, which make estimations accurate and reliable.Various techniques of function approximation of nearest neighbours for estimating the continuation of the time series are evaluated.It is observed that Kernel regression approximation performs the best for next step forecasting of PM 10 time series as compared to mean, median and linear regression approximations.Bench mark persistence model also under performs as compared to the kernel regression model.As individual model cannot capture both the linear and nonlinear fluctuations in the data, combination forecasting is suggested.The improvement of combination forecasting over best performing individual Kernel model is observed to be about 16% in terms of MAPE and 23% in terms of RE.The improvement of combination forecasting over persistence model is observed to be quite high i.e., about 66% and 46% in terms of MAPE and RE.The nearest neighbour model provides the forecasts without having an understanding of the patterns underlying the time series and even does not require any additional inputs such as information on meteorology and emissions.The promising performance of combination forecasting model in terms of low forecast error encourages its use to obtain the PM 10 forecasts using the past data.The study is useful when the data on predictor variables that influence the air pollutant concentrations is not available.The developed model can also be applied to predict other pollutants.

Fig. 1 .
Fig. 1.Time series of PM 10 concentration during 2010-2013 at an urban site in Nagpur.Horizontal line indicates CPCB standard of 100 µg/m 3 .
Fig. 3(a).Performance of different nearest neighbour functions for low values in the testing set.

Fig. 3
Fig. 3(b).Performance of different nearest neighbour functions for high values in the testing set.

Table 1 (a).
Performance of nearest neighbour models for PM 10 forecasting: training set.Tables 1(a)-1(b) for training and testing set.Comb has less MAPE and RE and high CE than other individual models for training and testing sets.The improvement over Kernel i.e., the best performing individual model is only about 16% in terms of MAPE and 23% in terms of RE.Over persistence model, the improvement is about 66% and 46% in terms of MAPE and RE.Westerlund et al. (2014) also observed the outperformance of Combination forecasting over neural networks in predicting air quality in Bogota.Perez (2012) observed optimal forecasts when combining neural network and nearest neighbour model for PM 10 in Santiago, Chille.