An Integrated SSA-ARIMA Approach to Make Multiple Day Ahead Forecasts for the Daily Maximum Ambient O3 Concentration

An integrated approach that combines a signal extraction method SSA (Singular Spectrum Analysis) with ARIMA (Autoregressive Integrated Moving Average) has been introduced to make multiple day ahead forecasts for daily maximum ambient O3 concentrations. The results are compared with the forecasts obtained by more commonly used signal extraction method FFT (Fast Fourier Transforms) in combination with ARIMA. The data from six different European AIRBASE stations have been analyzed to test their skills in one-day ahead to five-day ahead forecasts. In SSA-ARIMA, the SSA has been used to extract the different structural components of the time series. The first few principal components have been selected based on their dominance in the eigenvalues spectrum. Each component has been modelled separately and their sum is called SSA-component. The residuals of the “SSA-component subtracted from the actual data” exhibited stationarity and have been modeled using ARIMA process. In the second approach called by FFT-ARIMA, FFT has been used to model the periodicities exhibited by the time series. The first few predominant frequencies have been selected based on the power-spectrum characteristics of the time series. The residuals of the “FFT-component subtracted from the actual data” were modeled using ARIMA process. The performances of both the SSA-ARIMA and FFT-ARIMA were evaluated for 30 out of sample days for one to five day ahead forecasts. The statistical indicators RMSE (root mean square error) and MAE (mean absolute error) reveal that the SSA-ARIMA consistently outperforms FFT-ARIMA. Moreover, SSA-ARIMA has also been found to perform better especially for multiple day ahead forecasts. The results clearly indicate that the SSA-ARIMA approach can be a potential alternative to make short term air quality forecasting.


INTRODUCTION
The forecasting of ambient O 3 concentration has been a matter of great interest for most of the air quality research works, mainly, because of its critical role in the formation of urban smog and also on account of its direct adverse impact on human health, plants and materials.A number of statistical-time series modelling approaches have been applied by numerous researchers in order to forecast ambient O 3 concentrations.Slini et al. (2002) applied ARIMA (Autoregressive Integrated Moving Average) modelling technique to forecast daily maximum O 3 concentrations in Athens, Greece.Their results show a good index of agreement, accompanied by a weakness in forecasting alarms.Heo and Kim (2004) developed fuzzy expert and neural network systems to forecast daily maximum ozone concentrations at four monitoring stations of Seoul, Korea.The accuracy of their forecast system improved continuously through verification and augmentation.Duen and Liger (2005) also used ARIMA models to predict hourly ozone concentrations in an urban and a rural area of Southeastern Spain.The prediction agrees reasonably well with the measured values at both the sampling points and the predictions at rural areas were found to be better than the urban area.Corani et al. (2005) used feed-forward neural networks, pruned neural networks and lazy learning to assess the air quality prediciton in Milan.Lin and Cobourn (2007) applied Takagi-Sugeno fuzzy system and a nonlinear regression model for ozone predictions in six Kentucky urban areas.They conclude that the performance of the fuzzy models was equivalent to that of nonlinear regression model for all practical purposes.Mintz et al. (2005) applied an automated fuzzy logic method to hourly ozone concentrations during summer months in the city of Edmonton, Canada.The model captured the trends in ozone concentrations and showed a good agreement with the measured data.Chelani and Devotta (2006) used a hybrid autoregressive and nonlinear model to predict nitrogen dioxide concentration at a site in Delhi.Their results indicate hybrid model outperforms the individual linear and nonlinear models.Coman et al. (2008) applied neural networks to two versions of the model "dynamic" versus "static" one for the prediction of ozone concentrations for a 24-h horizon.The results indicated a rather good applicability of these models for a short-term prediction of ozone.Tsi et al. (2009) applied two cost sensitive artificial neural network (ANN) methods to forecast O 3 -episode.They found that the cost sensitive ANN methods perform better than the standard ANN.Kumar and Jain (2009) applied ARIMA modelling approach to forecast daily mean ambient air pollutants (O 3 , CO, NO and NO 2 ) concentration at an urban traffic site of Delhi, India.The study concluded that the one day ahead prediction is reasonably good for all of these pollutants for the study site.Carnevale et al. (2011) applied neural networks and cokriging technique to develop an integrating forecasting system for O 3 and PM 10 concentrations.Further, Singh et al. (2011) applied cokriging based approach to reconstruct air pollution maps for O 3 and PM 10 .
Because of the versatility of the statistical models, they have been successfully applied to identify the sources/causes of the ambient air pollutants (see, e.g., Lu et al. (2002); Lee et al. (2002); Gomez-Sanchis et al. (2006); Lu et al. (2011); Dominick et al. (2012Dominick et al. ( , 2012b)); Li et al. (2013), Latif et al. (2014)).Dominick et al. (2012) applied multivariate analysis such as Hierarchical Agglomerative Cluster Analysis (HACA) to access the spatial patterns, Pricipal Component Analysis (PCA) to determine the major sources of each air pollutant.D' Urso et al. (2013) applied Fuzzy clustering in time series of pollutants to detect the possible information redundancy in the monitoring networks and then, decreasing the number of monitoring stations, to reduce the monitoring costs and then to increase the monitoring efficiency of the networks.Im et al. (2013) analyzed the diurnal, monthly and seasonal variations of O 3 and nitrogen oxides (NO x ) and studied its local and regional impacts.Lu and Wang (2008) proposed standard support vector classification model (S-SVC) in order to make prediction for the high ozone days.Kumar and DeRidder (2010) applied GARCH modelling in association with FFT-ARIMA to forecast ozone episodes at four European stations.The results indicate that it has not only been able to make good forecasts for one-day ahead daily maximum O 3 concentration but it also successfully predicts the possible ozone episodes and at the same time reduces the no. of false alarms.Prakash et al. (2011) applied wavelet-based neural network to predict ambient O 3 , NO, NO 2 , CO, SO 2 and PM 2.5 .The model results demonstrated that a judicious selection of wavelet network design may be employed successfully for air quality forecasting.Moustris et al. (2012) applied Multiple Linear Regression against a forecasting model based on ANN approach.Their results suggest that the ANN model can be used to issue warning for the general population and the sensitive groups.
As evident that the statistical methods in air pollution modelling and forecasting is an active area of research and applied for the policy purpose as well, however, the multiple day ahead forecasting of air pollutants still remains a challenging domain for the forecasters.In most of the above studies, the emphasis is mostly on one day ahead forecasting.The quality of results quickly decline as the prediction horizon increases, i.e., for multiple day ahead forecasts.In the present study, an integrated SSA-ARIMA time series forecasting approach has been applied and compared with FFT-ARIMA approach for six European AIRBASE stations for one to five day ahead forecasting of the daily maximum O 3 concentrations.A time series might consist of deterministic and stochastic components.In SSA-ARIMA approach, the deterministic part of the time series has been modelled using singular spectrum analysis (SSA) while in the second approach it has been modelled using Fast Fourier Transform (FFT).The stochastic part of the time series has been modelled using autoregressive integrated moving average (ARIMA) process in both the cases.The results have been verified using 1-day ahead, 2-day ahead, 3-day ahead, 4-day ahead and 5-day ahead forecasts for the last one month that accounts for 30 out of sample days.It is pertinent to mention that one-day ahead forecast means forecast for the one day ahead of the last day of the measured data.The "out of sample" means test data set.The last 30 days of the time series has been kept as "test data set", while rest of the time series has been used as the "training data-set".
The rest of the article is organized as follows.Section 2 presents the description of the data and the stations used in the study.Singular Spectrum Analysis (SSA), Fast Fourier Transform (FFT) and ARIMA methodology have been described in section 3. Section 4 presents the results and discussion.Section 5 concludes the study.

DATA AND STATIONS
The long term time-series of daily maximum ambient O 3 concentrations from six European AIRBASE stations have been analyzed in the present work.The daily maximum represents the daily one hour maximum at all the stations.Table 1 presents the details of these stations: name, characteristics, source of the data and the period of the data under study.Out of the six stations, two stations are of the "Background rural" characteristics, three are of "Background urban" nature and the one site has been categorized as "traffic urban".The purpose of choosing varied site characteristics is to assess the utility of the new approach across different site-characteristics.

METHODOLOGY
Time series are realizations of underlying data-generating processes over a time span, occurring at regular points in time.As such, time series have identifiable stochastic or deterministic components (Yaffee, 1999).In this work, the deterministic components have been identified, separated and modeled by two different approaches, SSA (Singular Spectrum Analysis) and FFT (Fast Fourier Transform).The stochastic components have been modeled using the stationary stochastic model ARIMA (Autoregressive Moving Average).

SSA (Singular Spectrum Analysis)
SSA is a powerful technique for nonparametric time series analysis (Hassani et al., 2009).SSA decomposes the original time series into the sum of a small number of independent and interpretable components such as a slowly varying trend, oscillatory components and noise.This is followed by a reconstruction of the original series.In recent times, SSA has found its applications in many different areas from physics and meteorology to economics and finance (e.g., Ghil et al., 2002;Hassani et al., 2009;Hassani et al., 2010;Kumar and Jain, 2010).
Four basic steps can be identified in SSA: (I) Embedding the sampled time series in a vector space of dimension M also known as window length; (II) Computing the M × M lagcovariance matrix C D of the data; (III) Diagonalizing C D ; and (IV) Recovering the time series.The procedure described in this section has been adapted after Elsner and Tsonis (1996).
(I) Embedding the sampled time series in a vector space of dimension M; One-dimensional time series (x 1 , x 2 , …, x N ) is represented as multidimensional time series as follows: its dimension is known as window length (M).The matrix X is called trajectory matrix.

(II) Computing the M × M lag-covariance matrix C D of the data;
Broomhead and King (1986) proposed Alternatively, Vautard and Ghil (1989) suggested 2) is a Toeplitz matrix, meaning that all elements along each of diagonals are the same.
(III) Diagonalizing C D For an orthonormal matrix, the inverse of the matrix is equal to its transpose.Since C D is real and symmetric, there exists a diagonalizing matrix E whose columns are orthonormal and a diagonal matrix Λ (Press et al., 2002) such that which is called the spectral decomposition of C D .Also Λ is a diagonal matrix whose nonnegative entries are the eigenvalues of C D .Using the definition of C D we have The matrix XE is the trajectory matrix projected onto the basis E. The components of X aligned along the basis E are uncorrelated since E is composed of orthogonal vectors E k called the singular vectors of X.The diagonal matrix Λ consists of ordered values 0 ≤ λ 1 ≤ λ 2 ≤…≤ λ M whose square roots are called the singular values that are referred to collectively as the singular spectrum.These terms give SSA its name.
(IV) Recovering the time series: The Principal Components (PCs) of original time series can be calculated as follows: for i = 1, 2, …, N and k = 1, 2, …, M where represents the jth component of the kth eigenvector.
To get back to the original time series we convolve these principal components with their associated eigenvectors, which amount to where i = 1, 2, 3, …, N and j = 1, 2, 3, ..., M The original time series can be filtered this way by using only a selection of the possible principal components.
In this work, principal components have been selected on the basis of their significance based upon eigenvalues and subjective evaluation, i.e., whether it represents a systematic pattern or noise (Kumar and Jain, 2010b).As long as the eigenvalue of a component is greater than 1 and itstime series appear to represent a systematic pattern, the component has been retained for forecasting purpose.The reconstructed series for each of the selected components have been forecasted individually based upon its pattern (trend or oscillatory components).The sum of all the selected reconstructed series together is called as SSAcomponent of the time series.
SSA is also capable of dealing in short-term time series (Golyandina et al., 2001), though in this work, long time series has been taken into account so as to make a better comparison with the FFT-ARIMA.

FFT (Fast Fourier Transform)
FFT is a variant of Discrete Fourier Transform (DFT) with only difference that FFT is computationally faster (Press et al., 2002).Computationally, DFT is of the order of O(N 2 ) while FFT is of the order of O(N.log 2 N).If (x 0 , x 1 , x 2 , …, x N-1 ) denote a time series, its DFT H n is given by which can be inverted by inverse Fourier transform as follows: Eq. ( 11) is periodic in n with period N. Thus, the frequency range varies from -1/2 to 1/2 at discrete interval n/N.Now the periodogram estimates of the power spectrum at different frequencies are given as (Press et al., 2002) where f k (=k/N) is defined only for the zero and positive frequencies (also Fourier transform (1) is symmetric, i.e., H k and H -k have the same value).In the present study, we've plotted power vs. period.The inspection of power spectrum helps us to identify and select the frequencies (periods) which are supposedly dominant in the time series.The dominance of a particular frequency in the time series has been decided by inspecting the power spectrum of the time series.The peak of the power spectrum identify the most dominant frequency.The next peak identifies the next dominant frequency.Those frequencies have been retained which has a clear peak and the period is less than 1/3rd of the size of time series, so that, the viability of the period can be verified.After taking only the selected number of dominant frequencies for a particular time series, we construct the Fourier transform and take inverse of it.In the present study, the time series obtained after inverse Fourier transform under selected frequencies has been called as FFT component of the timeseries (Kumar and DeRidder, 2010).

ARIMA Modelling
After subtracting the deterministic components (either SSA-component or the FFT-component) from the original time series, the residuals part is often found to follow a stationary process.Thus, the stationary stochastic model ARMA (Autoregressive Moving Average) is applied to forecast the residual time series.A time series {x t ; t = 0, ± 1, ± 2, ...} is ARMA (p, q) if it is covariance stationary and can be represented as x t = ϕ 1 x t-1 + ... + ϕ P x t-P + w t + θ 1 w t-1 + ... + θ q w t-q ; (14) (Shumway and Stoffer, 2006) with ϕ P ≠ 0, θ q ≠ 0, and σ w 2 > 0. The parameters p and q are called the autoregressive [AR(p)] and the moving average [MA(q)] orders, respectively.When a time series doesn't appear covariance stationary, the differencing procedure may be applied to make it stationary.Then, the ARMA(p, q) model can be applied to the stationary differenced time series and model so constructed is called ARIMA(p, d, q) model where d denotes the order of differencing (Brockwell and Davis, 2002;Shumway and Stoffer, 2006).An inspection of autocorrelation function (ACF) and partial autocorrelation function (PACF) helps in identifying the orders AR(p) and MA(q).In addition, more objectively defined criterions such as Akaike information criterion (AIC), Hannon-Quinn Information Criterion (HIC) and Bayesian Information Criterion (BIC) can also be used to identify the correct orders p and q (Brockwell and Davis, 2002;Kumar and Jain, 2009).The parameters ϕ and θ have been estimated using maximum likelihood method (Brockwell and Davis, 2002) in the present study.More details about the way ARIMA model applied above can be found in Kumar and Jain (2009).

RESULTS AND DISCUSSION
The data for long term time series of daily maximum O 3 concentrarions have been analysed for six European stations as detailed in Table 1.The forecasts analysis is mainly focussed for the spring and summer season as the observed O 3 concentrations are higher and of more concern for this season.Each of the time series has been divided into training data-set and validation data-set.In each of the case, last one month (30 days) data has been kept completely out of model fitting procedure to treat them as "out of sample" or the validation data-set.Forecasts results have been analysed for one-day ahead, two-day ahead, three-day ahead, fourday ahead and five-day ahead forecasts.
Fig. 1 presents the time series of daily maximum O 3 concentrations at the six selected stations.Similar techniques have been applied to all the time series although model fitting parameters differ in each case.Therefore, for sake of brevity and clarity, it has been decided to present figures for each step for one time series only, while model fitting parameters for all the time series have been presented in Table 2.

FFT-ARIMA Modelling
The daily maximum O 3 concentration time series (Fig. 1) at each station is clearly marked by seasonal cycles.The annual periodicity is quite distinct in each of the case.These periodicities can readily be extracted using FFT method.The FFT has been applied on each of the time series to extract these periodicities.Fig. 2 presents the power vs. period plot for the daily maximum O 3 concentration time series at London (Brent).The predominant frequencies are marked.The first three predominant frequencies have been chosen to reconstruct the periodic (FFT) component of the time series.Fig. 5 shows the reconstructed FFT-  1. component alongwith the original time series.In the next step, FFT-component has been subtracted from the original time series and the resulting residual time series has been analysed.The left panel of the Fig. 6 presents the FFTresiduals time series.A visual inspection as well as unit root test reveals that this residual time series can be considered stationary.Thus, no differencing of time series is required here.The lower two panels of Fig. 6 also present the ACF and PACF pattern for the same time series.After a detailed examination of ACF and PACF pattern and the criterions like AIC, BIC and HIC, AR(1) model (Table 2) has been selected to represent the residual time series and to make forecasts in future.
FFT-components and FFT-residuals have been forecasted separately for one-day ahead, two-day ahead, three-day ahead, four-day ahead and five-day ahead for the last 30 "out of sample" days (validation data set).Final forecasts have been made by combining these individual forecasts.Fig. 7 presents the time series of validation data set together with FFT-ARIMA forecasts.Figs. 8 and 9 present the RMSE and MAE for these out of sample forecasts.

SSA-ARIMA Modelling
SSA (Singular Spectrum analysis) can decompose a time series into the sum of a small number of independent and interpretable components such as a slowly varying trend, oscillatory components and noise.In SSA, one of the important parameters that need to be decided is the "windowlength (L)".The idea behind selecting an appropriate window length is that possible variations or periodicity can be covered by using this length of time series because SSA captures mainly the patterns in a time series that are contained within the window length.Therefore, it can't be set too low, at the same time, it can't be put arbitrarily large as then lag-covariance matrix might lose its statistical significance.Hence, care has been taken to select an appropriate window length that can be exploited by SSA to capture the periodicity and trend present in the time series.The chosen window for all the stations has been shown in Table 2.After selection of appropriate window length, principal components have been calculated and then each of the components has been reconstructed into a separate time series.First 19 reconstructed components for the time series at London (Brent) have been shown in Fig. 3. Based upon the eigen-values and the exhibited pattern of reconstructed component time series, first six components have been selected.Each of the components has been fit separately based upon their individual pattern.In general, the first component is the trend component and it has been modeled as linear trend (Fig. 4) for the London (Brent).Other components exhibit somewat oscillatory pattern with varying amplitues and periods, therefore, they've been modeled using a combination of sine and cosine functions just like Fourier series (Fig. 4).The fits were accompolished using nonlinear least square method (Press et al., 2002).Each of these components has been forecasted separately using their corresponding fit-functions.The sum total of all the components together is called SSA component.The time series of SSA-component at London (Brent) has also Table 2. Model fitting parameters at different stations: AR denotes the autoregressive and MA denotes the moving average part of ARIMA model.The parameter p refers to the order of AR while q refers to the order of MA.The parameter d denotes the order of differencing required to make the time series stationary.series.The ARIMA parameters p and q has been decided based upon the examination of ACF, PACF pattern and the criterions AIC, BIC and HIC.The selected ARIMA model parameters have been presented in Table 2. Thereafter, one-day to five-day ahead forecasts have been made using the fitted ARIMA (p, d, q).
As a final step, forecasts made by SSA-components fitting and the ARIMA fitting of the SSA-residuals have been summed up to make the final forecasts for one to five day ahead for the 30 out of sample days.

Comparison of SSA-ARIMA and FFT-ARIMA Forecasting
The step-wise procedures described above have been followed for each of the six stations.Although model parameters differ in each case but the basic approach essentially remains the same.Thus, applying both SSA-ARIMA and FFT-ARIMA at all the six stations, the oneday to five-day ahead forecasts have been obtained for 30 out of sample days in each case.Fig. 7 presents the daily maximum O 3 time series of observed, FFT-ARIMA and SSA-ARIMA one-day ahead forecasts with their corresponding 95% confidence intervals at all the six stations for the last 30 days which have been treated as "out of sample" in the study.A close examination of both the forecasts pattern suggests that the SSA-ARIMA one-day ahead forecasts are not only close to the observations but its 95% confidence interval ( the standard deviation) is also less as compared to FFT-ARIMA forecasts.This suggests that SSA-ARIMA forecasts has greater reliabilty than the FFT-ARIMA.To evaluate the model forecasts further, the error analysis for both the approaches has been carried out.The error statistics RMSE and MAE have been calculated for all the forecasts at all the six stations.The RMSE and MAE are standard and reliable tool to evaluate and compare the different models performances (Willmott, 1981(Willmott, , 1982;;Thunis et al., 2012).Figs. 8 and 9 present the RMSE and MAE for one-day to five-day ahead forecasts for 30 out of sample days obtained by applying these two different approaches at all the six stations.A RMSE comparison at London (Brent) clearly shows that SSA-ARIMA forecasts have outperformed the FFT-ARIMA.It should be noted that RMSE for SSA-ARIMA gets lowered by 13.2%.24.4%, 30.7%, 34.9% and 38.2% for one-day, two-day, three-day, four-day and five-day ahead forecasts, respectively.Similar trend of decrease in MAE error can also be observed (Fig. 9).An inspection of RMSE errors at all the stations (Fig. 8) clearly implies that SSA-ARIMA always has lower RMSE values.It should be noted that RMSE decrease for SSA-ARIMA gets more and more significant as one goes from one-day ahead forecasts to five-day ahead forecasts (e.g., from 13.2% to 38.2% decrease at London (Brent)).This trend is also distinctly visible in MAE pattern.Thus, the error analysis clearly suggests that SSA-ARIMA outperforms FFT-ARIMA especially for multiple day ahead forecasts.

CONCLUSION
This study introduces an integrated SSA-ARIMA approach to forecast daily maximum ambient O 3 concentrations that  has been applied at six European monitoring stations.These stations consist of "background urban", "traffic urban" and "background rural" characteristics.At all the stations, the results obtained by SSA-ARIMA have been compared with the FFT-ARIMA approach.The out of sample forecasts error analysis using the metrics RMSE and MAE clearly reveals that SSA-ARIMA has performed consistently better at all the stations.More strikingly, SSA-ARIMA performed much better for multiple day ahead forecasts.The capability of SSA to extract signals from the time series, whether it be trend or oscillatory component, offers more practical insight into the pattern of the time series and makes it easier to model them.The SSA is a highly capable method and can be applied to extract the signals from any time seires, therefore, the application of this method can easily be extended to forecast other pollutants such as NO 2 , PM 10 , VOCs.The results of this study suggest that the SSA can be a powerful method to extract signals from the air pollutants time series and the SSA-ARIMA approach can be a very good potential alternative to make short term air quality forecasting.

Fig. 2 .
Fig. 2. Power vs. period diagram for the daily maximum O 3 concentration time-series at London (Brent).

Fig. 3 .
Fig. 3. First 18 resconstructed SSA-components using a window length 1095 at London (Brent).The x-axes has the number of days from 1-May-96 to 31-Aug-2007.The y-axes denotes the values of the components extracted from the daily maximum O 3 concentration (µg/m 3 ) time series.

Fig. 4 .
Fig. 4. Curve-fitting for the first six selected SSA-components for the daily maximum O 3 concentration time series at London (Brent).

Fig. 5 .
Fig. 5. Plots of original time series (in blue), fitted FFT-components (in red) and fitted SSA-components (in green) for the daily maximum O 3 concentration time series at London (Brent).

Fig. 6 .
Fig. 6.Left: Plot of FFT residuals at London (Brent) (top panel).Bottom two panels show the ACF and PACF pattern for the FFT-resuduals time series.Right: Plot of SSA-residuals at London (Brent) (top panel).Bottom two panels show the ACF and PACF pattern for the SSA-residuals time series.

Fig. 7 .
Fig.7.The daily maximum O 3 concentration time series of observed, FFT-ARIMA forecasts and SSA-ARIMA forecasts with their corresponding 95% confidence intervals at all the six stations for the last 30 days that have been treated as "out of sample" in the study.

Table 1 .
Details of the study-stations.