Hourly Ozone and PM 2.5 Prediction Using Meteorological Data – Alternatives for Cities with Limited Pollutant Information

Using statistical models, the average hourly ozone (O 3 ) concentration was predicted from seven meteorological variables (Pearson correlation coefficient, R = 0.87–0.90), with solar radiation and temperature being the most important predictors. This can serve to predict O 3 for cities with real time meteorological data but no pollutant sensing capability. Incorporating other pollutants (PM 2.5 , SO 2 , and CO) into the models did not significantly improve O 3 prediction (R = 0.91–0.94). Predictions were also made for PM 2.5 , but results could not reflect its peaks and outliers resulting from local sources. Here we make a comparative analysis of three different statistical predictor models: (1) Multiple Linear Regression (MLR), (2) Support Vector Regression (SVR), and (3) Artificial Neuronal Networks (ANNs) to forecast hourly O 3 and PM 2.5 concentrations in a mid-sized Andean city (Manizales, Colombia). The study also analyzes the effect of using different sets of predictor variables: (1) Spearman coefficients higher than ± 0.3, (2) variables with loadings higher than ± 0.3 from a principal component analysis (PCA), (3) only meteorological variables, and (4) all available variables. In terms of the O 3 forecast, the best model was obtained using ANNs with all the available variables as predictors. The methodology could serve other researchers for implementing statistical forecasting models in their regions with limited pollutant information.


INTRODUCTION
Understanding the patterns of the complex relationships between meteorology and air pollution is of great interest in air quality prediction (Zeng et al., 2020). Many of the studies predict the daily concentrations of pollutants (Weizhen et al., 2014;Franceschi et al., 2018;Mehdipour et al., 2018;Sihag et al., 2019). However, understanding the hourly variation and their relationship with the meteorological variables (e.g., solar radiation and temperature) would be important for studying the emission patterns and prediction (Sekar et al., 2015a;Franceschi et al., 2018;Cuesta et al., 2020). The option of using only meteorological variables to forecast pollutant concentrations could provide concentrations of pollutants in areas where only meteorological variables are monitored, or as an alternative when air pollutant measurements fail; indeed, according to the World Meteorological Organization (WMO) (2020), the measurement of meteorological variables has better geographic coverage around the world.
Numerical (deterministic) and statistical approaches are the two most common methods for air quality forecasting. The deterministic approach involves the use of chemical transport models (CTMs), which simulate dynamics of the atmosphere from the mathematical representation of from meteorological predictors. On the contrary, PM2.5 predictions are most sensitive to variations in SO2 and CO concentrations, nonetheless, the models follow the average trend of PM2.5 concentrations but are unable to forecast the peaks of concentrations associate with particular emission scenarios.
The statistical exploratory techniques, such as spearman correlations and PCA reduce the complexity of the models and improve accuracy. Furthermore, the results show the sensitivity of the different regression techniques to changes in the predictor variables and assess which combination of predictors-regression model offers better performance. This study can guide the development of future statistical forecasting models, which can be applied to deal with missing data in emerging countries with low budgets in air quality management, and to developed forecasting tools in areas that could not afford a more robust system, such as deterministic forecasting models. Fig. 1 shows the methodology framework proposed to develop the statistical forecasting models for O3 and PM2.5 concentrations (hourly scale) in Manizales, Colombia. First, the initial preparation of the raw air quality and meteorological data was performed, including the deletion of outliers and a min-max normalization (Wilks, 2019). The next step consisted of the selection of the predictor variables used in the models. For this purpose, four different methodologies were employed: (1) variables that exhibit Spearman coefficients higher than ±0.3, in relation to O3 and PM2.5, (2) variables with loadings higher than ±0.3 and related to O3 and PM2.5 concentrations, during a principal component analysis (PCA), (3) only meteorological variables, and (4) all available variables.

METHODS
Each group of predictor variables was evaluated separately as inputs for the statistical models using three regression techniques: (1) MLR, (2) SVR, and (3) ANNs. All models underwent a training, validation, and testing stage. Hence, a partition of the original dataset was required for training, validation, and test, as specified in Fig. 1. Each model, with its corresponding input variables, was developed and tested for forecasting an entire month (January 1, 2020, to January 30, 2020) of hourly O3 and PM2.5 concentrations. Model predictions were compared against ground observations to assess the predictive capabilities of the different models. Finally, the models that offered the best performance were used to predict an additional month of O3 and PM2.5 concentrations (May 1, 2020, to May 31, 2020 to evaluate the models over a different chronological period, with different emission scenarios (e.g., first month of COVID-19 mobility restrictions in the city).
The comparisons were made base on different performance statistics such as Mean Bias (MB), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Normalized Root Mean Squared Error (NRMSE), and Pearson correlation coefficient (R). See the definition of this metrics in Equations S1 to S5.

Study Area and Raw Data Collection
Fig. S1 shows the study area covering the urban area of Manizales, Colombia. It also shows the location of monitoring stations and an orographic illustration of airflows patterns (Cuesta et al., 2020). The air masses around the city flow upslope during the day due to convective heat and buoyancy forces, and downslope at night (CORPOCALDAS and UNAL, 2020;Cuesta et al., 2020). Manizales, Caldas-Colombia is a medium-sized Andean city located on the western slope of the Cordillera Central (2150 m.a.s.l), 28 km away from the Nevado del Ruiz volcano (5321 m.a.s.l), which is located to the Southeast of the city. By 2018, the city had a population of 400434 inhabitants (DANE, 2019), in an urban area of 54 km 2 . The city is characterized by tropical weather with low average wind speed (< 2 m s -1 ), high relative humidity (> 60%), low variation of temperature (12-24°C), atmospheric pressure around 785-795 h Pa, maximum solar radiation around ~1300 W m -2 and total annual rainfall around 1755 mm.
Air quality five-minute records were obtained from a central station of the city (5°4'6.5''N, 75°31'1.7''W, 2155 m.a.s.l. See flag "GOB" in Fig. S1). The nearby area is influenced by street cannons, high circulation of private and public vehicles, and a near industrial source (food industry). According to the last local emission inventory (base year 2017), on-road vehicle emissions were the most significant source of pollution in the city, contributing to more than 80% of emissions of all criteria pollutants; only for the case of SO2 this contribution was 12%. (CORPOCALDAS and UNAL, 2019).
On the other hand, meteorological five-minute records were extracted from a nearby meteorological station (~2 km) (5°3'46.6"N, 75°30'2.1"W, 2183 m.a.s.l. See the point called "HOS" in Fig. S1). The variable records showed stability during the whole period of analysis. The information was downloaded from The Caldas Environmental Data and Indicators Center (CDIAC), a public online platform (http://cdiac.manizales.unal.edu.co). The meteorological variables used were temperature (T), solar radiation (SR), relative humidity (RH), precipitations (Ppt), wind speed (ws), the meridional (wd_i) and zonal (wd_j) components of the wind direction, estimated from its measurement in degrees. Finally, air quality and meteorological data records were collected from October 1, 2019, to January 30, 2020, and May 1, 2020, to May 31, 2020. A full description of the periods of time used in each stage of the process can be consulted in Table S1.

Data Preparation
The raw data were averaged to an hourly time scale only if over 75% of the five-minute records were captured during each hour. Fig. 1 shows the total number of hourly average records obtained during the study period. For air quality data, the values were converted to µg m -3 units at standard conditions (25°C and 1.013 h Pa). The atypical values were deleted following a univariate approach, in which observations laying outside 3 times the interquartile range (IQR) were removed from the dataset.
Finally, the dataset was scaled using a min-max normalization to guarantee that all the variables studied vary in a range from 0 to 1, following the approach of other similar studies (Voukantsis et al., 2011;Franceschi et al., 2018). The normalization helps to compare all data on a similar scale and get the same initial weight during the mathematical analysis (Wilks, 2019).

Selection of Predictor Variables
Different statistical techniques were applied to the dataset to explore the relationship among air quality and meteorological variables, and thus, define which of these variables are suitable to be used as predictors in subsequent regression models. The methods applied were the following:

Spearman coefficient
A preliminary analysis of the concentration histograms (Fig. S2) shows that pollutants presented a non-normal distribution, and even the transformation of data by using a base 10 logarithm did not guarantee a normal distribution. For this reason, the Spearman coefficient was used to evaluate the strength and direction of relationships between two sets of non-normal distributed variables. Its value ranges from -1 (a perfect negative correlation between variables) to +1 (a perfect positive correlation) (Schober et al., 2018).
Spearman coefficients were estimated for all variables in the design dataset, to establish the relationships among meteorology and pollutant concentrations. In this study, variables that exhibited coefficients greater than ± 0.3 in relation to O3 and PM2.5 concentrations were selected as predictor variables for the subsequent regression models. Correlation coefficients lower than ± 0.3 were considered weak or negligible correlations.

Principal component analysis (PCA)
PCA is a technique for reducing the dimensionality of large datasets, increasing interpretability, and minimizing the loss of information (Jollife and Cadima, 2016). In addition to reducing data dimension, PCA has been applied in other air quality studies for identification of inter-correlations within pollutants and meteorological variables. This technique could support the recognition of the most dominant parameters and mechanisms affecting pollutants concentration (Voukantsis et al., 2011;Binaku and Schmeling, 2017;Sharma et al., 2017).
A PCA analysis was applied to the centered (mean = 0) and scaled (standard deviation = 1) data set, for obtaining the different principal components (PCs) with their respective values of variable loadings, eigenvalues, and the represented variance of the original dataset. The number of PCs to retain was defined using Kaiser's rule, according to which PCs with eigenvalues lower than one should be discarded since they do not add significant information (Wilks, 2019).
Only the variables with a loading greater than ±0.3 in the retained PCs, related to O3 and PM2.5, were considered for interpretation and selected as candidates to become the predictor variables, as suggested by Binaku and Schmeling (2017). Since smaller values of loadings represent insignificant relationships or are affected by noise in the data.

Use of all meteorological variables
For comparison purposes, a third alternative was implemented using meteorological variables as predictors. The measurement of these variables has better geographic coverage and are presented as an option in areas where only meteorological variables are monitored.

Use of all available variables
Finally, the fourth alternative represents the scenario in which all variables are available to be used, which include both air pollutants and meteorological variables. This scenario is proposed as a complement to the variables pre-selected using Spearman correlations and PCA. These methodologies might not identify nonlinear representative interactions between variables. Therefore, by using all possible predictor variables, it is possible to have a point of comparison to judge the performance of the models developed with limited features.

Statistical Forecasting Models
The first step consisted of performing several sensitivity analyses to determine the optimal parameter for the non-linear regression techniques. Then, using the appropriate parameters, the models were trained and validated using a 10-fold cross-validation procedure on the design data (October 1, 2019, to December 31, 2019). This procedure aimed to improve accuracy and avoid over-fitting of the models, as suggested by Weizhen et al. (2014). At the end of this process, the model that presented the lowest RMSE on the validation set was selected as the best model for each regression technique. The regression techniques that were implemented and further details regarding its structure definitions are given below.

Multiple linear regression (MLR)
MLR was used to determine the correlation between one dependent variable and two or more independent variables having a cause-effect relationship. Then, a linear equation that represents those relations is formulated to make predictions (Uyanık and Güler, 2013). This regression technique has been used in different air-quality-related studies (Vlachogianni et al., 2011). This technique was applied to have a baseline comparison with the other non-linear methods tested.

Support vector regression (SVR)
SVR is a machine learning algorithm specially developed for pattern recognition and classification (Luna et al., 2014). The model consists of several support vectors and non-linear model coefficients, requiring parameters of insensitivity zone (ε) and penalty (c). Both parameters determine a tradeoff between the training error and the model complexity (Kecman, 2001).
SVR was applied using a radial basis function (RBF) kernel. In order to define the values of ε and c parameters, a grid search was made, varying ε from 0 to 1 (in 0.01 increment) and c from 0 to 100 (0.1 increments). The best parameters were chosen based on which data pair offered the lowest RMSE value on the cross-validation procedure.

Artificial neural networks (ANNs)
ANNs are a machine learning algorithm which simulates the function of a biological nervous system. This method is useful in a wide variety of applications, including the control of complex non-linear systems, optimization, system identification, and pattern recognition (Jiang et al., 2017). The basic structure of an ANN consists of an input layer, hidden layers, and an output layer. Every layer is composed of a certain number of neurons.
A feedforward backpropagation neural network with a linear activation function was developed for this study. The problem was limited to a single hidden layer and the number of neurons within the hidden layer was selected through a sensitivity analysis, in which different models were developed varying the number of neurons from 2 to 10. Based on RMSE values, the elbow method was used to define the optimal number of neurons.

Models Application and Evaluation Criteria
All the combinations of predictor variables and regression models, that were selected after the training and validation stage, were implemented to forecast O3 and PM2.5 hourly concentrations for a whole month (January 1, 2020, to January 30, 2020). The results obtained when applying the different models were compared against ground measurement with different performance measures. For this purpose, the following measures were used: R coefficients, MB, RMSE. NRMSE and MAE. The definitions of these metrics can be found in equations S1 to S5.
The best models for forecasting O3 and PM2.5 concentrations were used to forecast an additional month (May 1, 2020, to May 31, 2020) of hourly concentrations, to test the ability of the models in a different period. This exercise could indicate if the regression techniques applied are suitable for forecasting O3 and PM2.5 concentrations in Manizales. Furthermore, the results could indicate if the procedures undertook for selecting features as input parameters were useful. Table S2 shows the statistics summary of hourly average measured concentration for the four pollutants and some meteorological variables during the study period (October 01, 2019, to January 30, 2020) in Manizales. Overall, the daily mean concentration of CO, O3, PM2.5, and SO2 are 839.3 µg m -3 , 15.2 µg m -3 , 12.6 µg m -3 and 4.3 µg m -3 , respectively. Fig. 2 shows the measured hourly concentration time series plot, boxplot, and hourly average of pollutants during the study period. The pollutant concentrations showed a constant pattern around the whole period, with concentrations ranging from 399-1618 µg m -3 , 4-37 µg m -3 , 4-21 µg m -3 and 2-8 µg m -3 for CO, O3, PM2.5 and SO2 respectively, according to the 5 th and 95 th percentile. The variability of concentrations could be explained by local effects, influenced by urban emission patterns and daily meteorological forcing (Cuesta et al., 2020).

Statistics Summary of Ground Measurements for the Period of Analysis
Daily patterns showed similarities of diurnal variability for CO, PM2.5, and SO2 which generally exhibited a flat "W" shape, influenced by the traffic rush-hour, and suggesting that in Manizales these pollutant concentrations are influenced mainly by vehicular emissions. These patterns are concordant with the local emission inventory of the city (CORPOCALDAS and UNAL, 2019).
On the contrary, the diurnal variability of O3 exhibited a flat "U" shape. This pattern is caused by the solar radiation intensity that catalyzes photochemical reactions from hydrocarbons and nitrogen oxides precursor in the air (Khedairia and Khadir, 2012). The night peak (observed around 3 AM) can be influenced by factors such as horizontal transport of cold air from the mountain to the city. Furthermore, as the night progresses the atmosphere becomes more stable and contributes to O3 accumulation. Finally, in the nighttime and early morning, human activities are reduced. Consequently, the emission of NOx drops, stabilizing the chemistry of the atmosphere generating no net production or destruction of O3 (Yusoff et al., 2019).
On the other hand, the meteorological parameters exhibited a low variability during the period of analysis with small standard deviations. The local weather is characterized by two sessional periods of high (Mar. to May. and Sep. to Nov.) and low (Dec. to Feb. and Jun. to Aug.) rainfall (CORPOCALDAS and UNAL, 2020).

Spearman coefficient
Results from the Spearman coefficient are presented in Table 1(a). The best features to forecast PM2.5 are CO and SO2 concentrations; and the best features to forecast O3 are SO2, CO, T, RH, SR, ws, wd_i, and wd_j (see Table 2).
PM2.5 concentrations presented a direct relationship with pollutants CO and SO2 with coefficients higher than +0.3. This suggests the effect of emissions from on-road sources near the monitoring station. Furthermore, PM2.5 did not exhibit a significant correlation with any meteorological variable, as coefficients were lower than ±0.3. This indicates that PM2.5 dynamics are mostly affected by direct emissions and not by meteorological drivers.
O3 concentrations presented an indirect relationship with other criteria pollutants such as CO  Spearman coefficients X X X X X X X X 0.15 1.4 5 PCA X X X X X 0.19 2 4 Meteorology X X X X X X X 0.35 1.2 4 All available variables X X X X X X X X X X 0.12 1.2 5 PM2.5 Spearman coefficients X X 0.77 1.5 3 (1 b ) PCA X X X X 0.52 1.2 3 (1 b ) Meteorology X X X X X X X 0.6 1.5 4 (1 b ) All available variables X X X X X X X X X X 0.45 4 5 (1 b ) a X symbol correspond to selected variable as a predictor. b An additional model was developed using only one neuron in the hidden layer to use as a reference. and SO2, with coefficients lower than -0.45. These pollutants, as well as NO, are commonly related to vehicular combustion. High values of CO and SO2 represent high levels of NO which react to consume O3 (Voukantsis et al., 2011). Other authors have observed similar patterns, in which the primary pollutants such as CO intervene in the photochemical reactions of precursors, decreasing the total amount of O3 (Sharma et al., 2017).
On the other hand, O3 showed a direct relationship with air temperature (0.61), solar radiation (0.65), and wind speed (0.39). Solar radiation catalyzes the photochemical reactions, and the air temperature enhances the production of biological volatile organic compounds (Leung et al., 2018). These meteorological variables were also related to the increase of the mixing layer, and consequently, the increase in the dispersion of precursors to be transformed in O3 by photochemical reactions (Cuesta et al., 2020). Finally, the negative effect of relative humidity and components of wind direction with O3 stands out, as these parameters are related to wind patterns, cloud cover, and rainfall.

Principal component analysis (PCA)
According to the PCA results (Table 1(b)), the best input variables to forecast PM2.5 are SO2, CO, T, and RH, whereas the best features to forecast O3 are SO2, T, RH, SR, and wd_i (see Table 2). It is noteworthy that through PCA, two new features are suggested for forecasting PM2.5, compared to the Spearman correlation approach (T and RH). In the case of O3, some features were removed (CO, ws, Ppt, wd_j). This result suggests that different significant relationships can be found depending on the statistical approach selected.
A total of 10 PCs were obtained during PCA, from which the first four ones (PC1-PC4) were retained for interpretation due to having eigenvalues greater or close to one, to avoid loss of significant information. Particularly, these four PCs represent 76.6% of the total variation of the original data. Table 1(b) shows the PCs that were retained and the loadings of the original variables within each PC. As in the Spearman correlations approach, the loadings greater than ±0.3 are highlighted in boldface, since they represent the most significant parameters and interactions within each PC.
The first PC (PC1) corresponds to 41.9% of the total data variations. This PC consists mainly of positive contributions of O3, T, and SR, and negative contributions of SO2, RH, and the wd_i. PC1 clearly illustrates the influence of O3 formation mechanisms; a higher downward solar radiation promotes the formation of O3. Similarly, greater temperatures enhance O3 formation, by promoting the propagation of the radicals involved in the formation reactions (Voukantsis et al., 2011;Sharma et al., 2017). The interactions among O3 and SO2 become an important factor in data variation due to the O3 consumption by liquid-phase oxidation reactions with SO2 (Abdul-Wahab et al., 2005).
The second PC (PC2) represents 17.3% of the total data variation. It is composed primarily of positive contributions of PM2.5, CO, SO2, and temperature; and negative contributions of relative humidity. PC2 illustrates the processes that boost the sink and consumption of SO2 and PM2.5 in the atmosphere. The relative humidity is a factor related to precipitation which decreases the concentration of PM2.5 by scavenging effects. Moreover, relative humidity is associated with the humidity of particles, the consequent growth of the mass, and faster deposition (Leung et al., 2018).
The positive correlation exhibited between PM2.5 and temperature can be attributed to the fact that higher temperatures enhance the formation of secondary aerosols, due to faster SO2 and volatile organic compounds (VOC) oxidations reactions . Finally, the positive correlation among PM2.5, CO, and SO2 could be attributed to the fact that these pollutants are mostly emitted by a common source, which for the case of Manizales corresponds to the internal combustion vehicles.
For the third and four PCs (PC3 and PC4) which correspond to 9.3% and 8.0% of the data variance respectively, no relationships between O3 and PM2.5 with other variables were displayed, so these PCs did not add significant information for the scope of this study. To sum up, only PC1 and PC2 convey key information about different relationships between the variables of interest.

Only meteorological variables
In this scenario, T, RH, SR, Ppt, ws, wd_i, and wd_j are used as predictor variables for forecasting O3 and PM2.5 hourly concentrations. This exercise is implemented as a complement to the methods described previously (Spearman and PCA); in order to assess the possibility for developing air pollutant forecasting models based exclusively on meteorological predictors.

All available variables
PM2.5, SO2, CO, T, RH, SR, Ppt, ws, wd_i, and wd_j were used as predictors to forecast O3 concentrations, whereas O3, SO2, CO, T, RH, SR, Ppt, ws, wd_i, and wd_j were used as predictors to forecast PM2.5 concentrations. This method is implemented to provide a comparison to the models developed using limited predictor variables, in order to identify the accuracy. Table 2 presents the optimized parameters of the non-linear regression techniques for each set of predictor variables for O3 and PM2.5. For SVR, ε values obtained ranged from 0.12 to 0.77, and c values range from 1.2 to 4.0. On the other hand, for ANNs, the number of neurons in the hidden layer ranges from 3 to 5. In the ANNs tuning, a larger number of neurons involves more computational time and, sometimes, a tradeoff between computing time and accuracy is required. Furthermore, a larger number of neurons not necessarily will be better, models can be over or under fitted which always needs to be checked (Madhiarasan and Deepa, 2017). Table S3 shows the RMSE, on the training and validation set, for the best models obtained during the cross-validation procedure. Table 3 shows the performance statistics obtained on the test set, using the different regression techniques and sets of predictor variables. In general, it was found that all the models developed to predict O3 concentrations present a satisfactory performance since all of them exhibit R values between 0.84 and 0.94, evidencing a strong linear relationship between the observed and modeled values. It was also found that the SVR model with only meteorological predictors, tends to slightly overestimate the predicted values (MB = 0.16). On other hand, all other models show a tendency to underestimate O3 concentrations with MB between -0.48 and -3.08. Concerning RMSE, the models present values that vary between 4.63 and 6.35 µg m -3 , which corresponds to normalized values of 0.40-0.54; these values reveal a good ability of the models to predict ozone hourly concentrations. Fig. 3(a) shows the Taylor diagram that summarizes the performance statistics obtained in the O3 regression models. This graph shows the models with greater performance. The best results are obtained when using ANNs, and the best model was obtained when using ANNs and all the available variables as predictors. This model presents the highest R (0.94), a normalized standard deviation close to the original data, and the lowest RMSE (4.63).

Evaluation of the Models and Comparison
Although the best model performance was achieved when using all variables as predictors, the . This shows the applicability of the methods for selecting features, as it is possible to obtained simpler models, that require fewer variables without significatively affecting the forecasting accuracy. The models developed from exclusively meteorological variables as predictors to forecast O3, present slightly lower performance, but still satisfactory, in contrast with those obtained when using all variables as predictors, as shown by the values of R and NRMSE (ANNs-Meteorological variables: 0.90-0.46, ANNs-All variables: 0.94-0.40). Therefore, using only meteorological variables as predictors is also a good choice to forecast O3 hourly concentrations if other pollutants cannot be measured. This is noteworthy, as meteorological stations are more extensively used compared to air quality stations. Therefore, it would be possible to establish a statistical forecasting model to indicate O3 concentrations in areas with few information of pollutants. The model could be developed (trained and validated) from a reduced number of O3 and meteorological measurements in the area of interest, through a monitoring campaign, and continue to be operated using only the continuous measurement of meteorological variables.
The fact that adequate estimations of O3 concentrations could be made even when the values of the pollutants that interact in chemical reactions to form or consume O3 are unknown, suggests that the main drives of O3 concentrations in the city of Manizales are variations of the different meteorological variables, with a minimum impact in variations of pollutant concentrations of precursors. Similar results were obtained by Zeng et al. (2017), which identified that T and RH are Volume 21 | Issue 9 | 200471  Table 2. the dominant factors for O3 variability in the surface layer of the atmosphere. In addition, it was previously discussed in this study that solar radiation, and indirectly temperature, affect the rates of O3 production by enhancing the photochemical formation of this pollutant.
On the other hand, for the PM2.5 forecasting models, using only meteorological variables as predictors is not an adequate strategy, as all developed models exhibited low R values ranging between 0.16 and 0.27. Moreover, these models present high RMSE values (5.90-6.39), corresponding to NRMSE values from 1.03 to 1.17, which indicate a low prediction ability. This suggests that PM2.5 concentrations are not strongly influenced by meteorological drivers. Instead, its variations are mainly dependent on emission patterns. indicating that PM2.5 concentrations in the area are caused mainly by primary emissions.
Alternatively, the other PM2.5 developed models present R values that vary between 0.36 and 0.51, showing a medium-strength linear correlation between observed and modeled values. They also have lower deviations with MB ranging between -0.99 and 1.04.
The best performance statistics for PM2.5 are obtained using the selected predictors from Spearman correlations, with the three regression techniques providing similar results. Overall, the lowest RMSE (4.85) was obtained with the SVR-Spearman model. The predictor variables, in this case, are only pollutants concentrations of SO2 and CO. This situation shows that the variations in PM2.5 are highly related to SO2 and CO variations, as explained by the Spearman correlations and the PCA analysis; these are primary pollutants and are directly related with a common source: vehicle emissions. These results show that the methods for selecting features to use as predictors (Spearman and PCA), in the case of O3 for example, are also useful for developing PM2.5 forecasting models. As it is possible to identify the most relevant predictor variables and exclude features that could create noise in the data or lead to a more complex model with a reduced ability of generalization.
In fact, it is striking that the performance of predicting PM2.5 decreases when adding more variables, such as the case of the models using all the variables as predictors. These results are caused by the major complexity of the models, leading to overtraining. Therefore, these models presented a better performance on training sets but lost the ability to generalize in different situations on the test data. As previously discussed, PM2.5 concentrations have stronger associations with emission patterns throughout the day and do not exhibit a strong sensitivity to the variation of meteorological conditions. Additionally, when observing Fig. 3(b), by comparing regression techniques, the worst performances are obtained when using ANNs. This could be caused by the risk minimization principle applied by ANNs. As discussed by Mogollón-Sotelo et al. (2021), that causes ANNs to better fit the training data. However, the training data has noise, then affecting the generalization ability of the models.
The lower results obtained when trying to predict PM2.5 compared to O3 might be associated with the complex pattern of this pollutant, which makes it necessary to include other types of predictors in the models to improve the estimates (e.g., Concentration of PM10 and NO2, PBL, temporal and spatial disaggregated emission inventories).
Figs. S3 and S4 present the scatterplots that relate the observed concentrations and the forecasted values of O3 and PM2.5, respectively. In the case of O3, the data present a good linear relationship throughout the range of studied concentrations, a fact that is also supported by the statistics summarized in Table 3.
In contrast, when observing the PM2.5 scatterplots, the linear relationship is not so clear. It is also striking that all models presented difficulties in representing the peaks of maximum concentration since, in general terms, these models are not capable of predicting concentrations greater than 20 µg m -3 . The episodes in which concentrations exceeded 20 µg m -3 can be attributed to specific situations, which are not derived from variations in the meteorology or changes in the concentration of CO and SO2. For instance, higher values of PM2.5 could be attributed to unusual scenarios in which emissions increase, such as an activity related to the industries near the air quality monitoring station or a possible natural ash emission from the nearby volcano. Finally, it is noteworthy that the ANNs-Spearman models for PM2.5 exhibit predictions with negative concentration values, which have no physical meaning; these values can be explained due to overtraining of the neural networks. Fig. 4(a) shows the hourly average profile of O3 during the test period for the observed and modeled values. Similarly, Fig. 5 presents the time series of hourly concentrations for the period from January 6 to January 13. The figures showed that all the models accurately represent maximum concentration peaks around noon and early-morning.
Similarly, Fig. 4(b) shows the hourly average profile of PM2.5 for the observed and modeled values, and Fig. 6 presents the hourly time series, again for the period from January 6 to 13. Figures showed that the pattern of PM2.5 is well represented in most of the models, except for the one developed using ANNs-PCA, in which the decrease in concentrations during non-peak  Table 2. vehicle hours is largely underestimated. The one developed using MLR-Meteorology, which does not follow the desired profile during any period of the day.
In addition, Fig. 6 showed that none of the models can forecast the maximum concentration peaks, and the negative values estimated through the ANNs models are once again evident. For the sake of comparison, the ANN technique was also developed using only one neuron in the hidden layer, (denoted as ANN(1) in Fig. 6). This design aimed to evaluate the possible overtraining effect on the neural networks when using a larger number of neurons. It was found that the performance achieved when using a single neuron in the hidden layer is superior to that of the neural network with a higher number of neurons, and it is similar to the one obtained in the models developed with the MLR technique. To sum up, for the case of PM2.5, the simplest models offer the best predictions.
Other studies have developed statistical forecast models for hourly PM2.5 concentrations. For instance, Franchesci et al. (2018) used ANNs to forecast PM2.5 concentrations at two points in the city of Bogota, Colombia, obtaining performance statistics such as an R of 0.51-0.68 and RMSE  Table 2. of 5.79-7.87 µg m -3 . Similarly, Sekar et al. (2015a), developed different forecast models for the city of Delhi, India, obtaining performance statistics such as R of 0.47-0.89 and RMSE of 56.20-103.48 µg m -3 . Compared to these results, the models obtained through the present study failed to obtain R as high as these studies, however, the RMSE values were improved. On the other hand, Sekar et al. (2015b) also studied the forecast of hourly concentrations of O3 in Delhi, India, obtaining performance statistics such as an R of 0.51-0.82 and RMSE of 18.31-25.51 µg m -3 . In this case, the performance statistics obtained in this study were superior, indicating models with enhanced predictive capabilities.  Table 2.
It should be noted that the comparisons made with the models developed in other regions are merely indicative, since different predictors, statistical models, and different procedures were used in each study.

Case Study: Evaluation of the Best Models in a Different Period
To assess the generalization ability of those models that presented the best performance on test set 1 (Jan 1 to Jan 30, 2020), a new test was defined and named as set 2 (May 1 to May 31, 2020). Those sets are defined over a different chronological period, with different meteorological patterns and emission scenarios. For instance, test set 2 was dominated by low emission patterns due to the pandemic scenario where mobility restrictions were adopted. Moreover, May was wetter (83.4%) than January (77.1%) and have the lowest solar radiation (269 W m -2 compared to 364.8 W m -2 ). Table 4 shows the performance statistics obtained on test set 2. For the O3 case (ANNs-All variables), predictions present a satisfactory performance with an R value of 0.9, similar to the performance obtained for test set 1. This configuration underestimates O3 concentrations with MB value of -1.39, and the RMSE (4.41) and NRMSE (0.47) ratified the good ability of the model to predict O3 hourly concentration even when the scenario is different. These results suggest that ANNs model with all variables as predictors are suitable to provide initial concentration estimates and even predict or fulfill O3 hourly concentrations on short periods of interest.
The inclusion of all variables in the model improves the prediction accuracy due to the compensation on variables by the different phenomena of the atmosphere. For instance, Abdul-Wahab and Al-Alawi (2002) exposed that meteorology has a contribution in ANNs model in a range of 33.1 to 40.6%, and the inclusion of pollutants such as SO2 and CO can improve the model performance between 6 to 10 percent due to it is participation in the photochemical formation of O3. Other authors such as Zhen et al. (2017) include physical phenomena such as stratospheric/ tropospheric exchanges, dominated by T and RH as a variable to improve the model performance.
For the case of PM2.5 (SVM-Spearman predictor), predictions present unsatisfactory performance with an R value of 0.28 and high RMSE (4.91), corresponding to a normalized value of 4.66. Moreover, the configuration selected overestimate PM2.5 concentrations with MB value of 2.33. These results indicated the low predictive ability of the model. Similar overestimated concentrations with SVM models were obtained by Mogollón-Sotelo et al. (2021) in Bogota, Colombia, where SVM errors were related to the quality of data, such as the quality of training and validation data and its relationship with testing data. Moreover, these models have a limited ability to represent the sudden changes, mainly affected by emission patterns and generating an accumulative error in forecast values of the model.
The main explanation for the lower results of the model on test set 2 could be attributed to the fact that PM2.5 concentrations in the study area are mainly emitted by primary emissions. These emissions change drastically for test set 2, due to the mobility restrictions adopted as a response to the pandemic scenario, causing a reduction of emissions (Corpocaldas and UNAL, 2020).

CONCLUSIONS
Results obtained show that the linear and nonlinear models represent the patterns of hourly O3 concentrations, offering the best performance when using the ANNs regression technique in O3 forecasting. The best model for O3 prediction involved the use of all available variables. Models with variables selected by Spearman coefficients and PCA offer similar performance. Hence, these techniques identified the key predictor variables, reducing the number of variables monitored, without compromising forecasting accuracy. On the other hand, by using only meteorological variables as predictors it is possible to obtain an adequate estimate of O3 concentrations with R values higher than 0.8. These results suggest that, for the case of Manizales, variation in O3 concentrations is mostly dominated by meteorological variables such as temperature and solar radiation, and is less sensitive to variations of the measured pollutants (PM2.5, SO2, and CO). This suggests that it could be possible to develop a forecasting model for O3 concentrations in areas where only meteorological stations are present.
On the other hand, forecasting PM2.5 with meteorological and primary pollutant data was less successful compared to O3. However, the models developed here show correlation coefficients similar to those obtained in other studies and better RMSE values. The models developed using ANNs were less effective for PM2.5 forecasting, which is associated with a possible overtraining of the neural network reducing the models generalization capacity. In contrast, the models developed using MLR and SVR, show better performances.
Regarding the selection of predictor variables, the use of only meteorological variables does not represent the variation of hourly concentrations of PM2.5, because this pollutant is associated with on-road sources emissions, and thus emission patterns dominated the variations of PM2.5 concentration in Manizales. The best PM2.5 model used only SO2 and CO concentrations in Manizales as predictors, as indicated by the Spearman correlation analysis. This suggests emissions from common vehicular sources. Additional meteorological variables did not improve model forecasting of PM2.5. Finally, none of the models could predict concentrations higher than 20 µg m -3 ; hence, the peak PM2.5 concentrations were underestimated.
The models developed have potential for making statistical forecasts of O3, demonstrating strong associations to meteorology, which can vary according to the study area. In addition, the use of hourly data, instead of daily averages, allows studying the changes in these relationships as a function of the period of the day, and its corresponding changes in emission patterns.
The methodology developed indicates that it is possible to select a subset of variables that are suitable for regression models by means of a preliminary exploratory analysis, such as the computation of the Spearman coefficients or through principal component analysis. The methodology proposed in this study might serve to other urban regions for implementing air quality forecasting models, not only as a tool for a better comprehension of pollutant dynamics, but also for contributing with air quality management.

DISCLAIMER
The authors declare no conflict of interest and no competing financial relationships that could influence data and results reported in this paper.