Assessment of Machine Learning Algorithms in Short-term Forecasting of PM10 and PM2.5 Concentrations in Selected Polish Agglomerations

Air pollution continues to have a significant impact on Europeans living in urban areas, and episodes of elevated PMx are responsible for a large number of premature deaths (mostly due to heart disease and stroke) each year. According to the annual EEA reports, Poland is one of the most polluted countries in Europe, experiencing high PMx concentrations during winter that mostly result from large emissions and unfavourable weather conditions in combination with environmental features. Thus, in addition to implementing municipal mitigation strategies, alerting residents to pollution episodes through accurate PMx forecasting is necessary. This research aimed to assess the feasibility of short-term PMx forecasting via machine learning (ML) and the subsequent identification of the primary meteorological covariates. The data comprised 10 years of hourly winter PM10 and PM2.5 concentrations measured at 11 urban air quality monitoring stations, including background, traffic, and industrial sites, in four large Polish agglomerations, viz., Poznań, Kraków, Łódź, and Gdańsk, which cover areas with high population density and diverse environments that extend from the Baltic Sea coast (Tricity) through the lowlands (Poznań and Łódź) to the highlands (Kraków). We tested four ML models: AIC-based stepwise regression, two tree-based algorithms (random forests and XGBoost), and neural networks. Employing analysis and cross-validation, we found that XGBoost performed the best, followed by random forests and neural networks, and stepwise regression performed the worst. This ranking was apparent in the threshold exceedance values of the binary forecasts obtained via regression. Overall, our results confirm the high applicability of ML to short-term air quality prediction with the perfect prog approach.


INTRODUCTION
Particulate matter (PM) is a term widely used for a mixture of organic and inorganic substances suspended in the air. The most commonly measured are those with a diameter of less than 10 µm (PM10) and less than 2.5 µm (PM2.5), with the latter being more harmful to human health. PM can penetrate the human respiratory system, but the small particles can even pass through the respiratory barrier and enter into the circulatory system, spreading throughout the body (Pinkerton et al., 2000;Feng et al., 2016). PM10 and PM2.5 have adverse health effects that include acute and chronic respiratory illnesses such as pneumonia, chronic bronchitis, and asthma. It may also aggravate coughing or breathing difficulty. These small particles are also responsible for cardiovascular diseases such as coronary heart disease, congestive heart failure, stroke, or cardiac modules. For that purpose, we compared the results from (1) linear regression with stepwise selection (stepAIC), (2) random forests (RFs), (3) XGBoost regression trees (XGBoost), and (4) neural networks (nnets). Regression models are limited by certain statistical assumptions, such as the independence of observations or distribution. More advanced machine learning methods overcome these limitations via data analysis with the use of self-learning statistical models (Hu et al., 2017). Machine learning uses computational statistical methods to ascertain the best models that can learn from historical data and make predictions on such bases. Here, we used the daily means of PM2.5 and PM10 concentration levels, as well as atmospheric conditions in four agglomerations: Kraków, Łódź, Poznań, and Tricity. These cities represent different locations, topography, and climatic conditions. The 1-day forecasts were prepared for wintertime, as this season tends to stand out in terms of very high pollution levels.
In general, there are two main approaches for predicting PM concentrations. One is with the use of numerical models, and the other is the application of statistical methods. Awe et al. (2019) have identified high uncertainties about real-life emissions from residential units, which is a primary PM's emission carrier. Furthermore, it is difficult to estimate the exact amount of illegal waste combustions in residential units and landfills. It is also challenging to obtain access to PM emission data from individual industrial plants or to assess the pollution input from the old cars and diesel cars commonly deprived of filters. The lack of precise emission inventory data (Dore et al., 2014) and initial air conditions is a significant disadvantage for the use of dynamical models that need reliable chemical and dispersion model outputs. Such a situation results in poor initial conditions for those dynamical models, making them flawed from the start and, as a result, rendering their output error-prone. Additionally, those dynamical models need vast computational resources that makes modelling time-consuming. For that reason, preparing forecasts or giving alerts on time is difficult if not impossible, especially in the case of nowcasting or short-term forecasts.
In this study, we decided to apply machine learning methods based on statistical assumptions. These models are becoming more and more reliable and less time consuming to calibrate. We use the perfect prog methods (Benestad et al., 2008) not only to make PM concentration forecasts, but also to choose which variables are the most decisive for actual air quality. The perfect prog techniques do not take into account errors driven by weather forecasts. It needs to be emphasised that we investigated PM's concentration forecasts only for the nearest 24 hours (1 day ahead). This approach is entirely appropriate since most of the up-to-date numerical short-term weather predictions models provide very reliable data assimilation and yield negligible forecast error within such a temporal horizon (Kendzierski et al., 2018).

Study Area
Four large urban areas (Kraków, Łódź, Poznań, and Tricity), each with more than 500,000 inhabitants and a population density above 2000 persons km -2 (except Tricity, ca. 1800 persons km -2 ), were taken into consideration. These cities are located in various parts of Poland ( Fig. 1) and are characterised by diverse topography, emission sources, and climate. Tricity, the polycentric urban agglomeration of Gdańsk, Gdynia, and Sopot, is located on the coast of Gdańsk Bay and hemmed in from the north by the Baltic Sea. Poznań is the fifth-largest city in Poland, located mainly on the plateau areas of the Wielkopolskie Lakeland at an elevation of 50-154 m ASL. Łódź is the third-largest city in central Poland on the edge of the Łódź Hills with an elevation ranging from 161.8 to 278.5 m ASL. Kraków is the second most populous city in Poland. It is located in the southern part of the country in the Vistula River valley (extending from west to east in the region of Lesser Poland). The city borders the hills of the Kraków-Częstochowa Upland to the north and Carpathian Foothills to the south, with an elevation ranging from about 200 to 370 m ASL.

Air Quality Data
Daily and hourly ground-level measurements of PM2.5 and PM10 for winter from Poznań, Łódź and Kraków were taken from the Voivodeship air pollution networks which, during the investigated period of years 2006-2016 belonged to the Polish Voivodeship Environment Protection Inspectorates. The hourly data for Tricity were obtained from the Agency of Regional Air Quality  (Jędruszkiewicz et al., 2017).
Monitoring in the Gdańsk Metropolitan Area (ARMAAG). The 11 urban stations (2-4 stations per investigated area) represent the official measurements for background, traffic, and industrial areas, as shown in Table 1. The locations of the air quality monitoring sites and meteorological stations are shown in Fig. 1.

Predictor Variables
Air quality for the winter season in Polish cities is usually shaped by complicated interactions of meteorological and human-related factors (Jędruszkiewicz et al., 2017). With that in mind, both groups of predictors were taken into account to create short-term ML-based forecasting models. The meteorological data (planetary boundary layer, near-surface air temperature, dew point, wind speed and direction, wind gust, sea level pressure, sunshine duration, total and lowlevel cloudiness, and total precipitation) were acquired from the nearest meteorological stations maintained by the Institute of Meteorology and Water Management -National Research Institute that adheres to the measurement standards of the World Meteorological Organization. Due to a lack of atmospheric vertical profile data from analysed locations nearby, we decided to use the nearest grid point of ERA-Interim reanalysis (Haylock et al., 2008;Dee et al., 2011) as a data representing the entire dynamics of near-surface air mixing. Among a group of all meteorological predictors, the daily mean, minimum, and maximum values were computed (where applicable). For wind parameters, its vector components (zonal and meridional) were calculated to account for the direction of air mass advection.
Although meteorological parameters are the most relevant in shaping air quality conditions, especially in the wintertime, they usually only emphasise only the effect of human-related emissions. Both phenomena are often correlated (e.g., a drop in air temperature increases emissions from heating systems). However, source inventory, especially for small heating installations, is usually far from being perfect, which causes extra uncertainty for all air quality modelling systems (Dore et al., 2014;Kryza et al., 2018). Moreover, some crucial parameters that potentially could have been used for a better estimation of human activity like traffic anomalies, industrial production, etc. are either not available on a daily scale or are not reliable enough. Therefore, this kind of information normally has to be usually parameterised to be correctly estimated by statistical models. In this study, we wanted to capture this type of factors by incorporating the information about the day of the week, and whether the current and previous day was a holiday or workday.
The last group of predictor variables were PM10 and PM2.5 daily means from the previous three days at a given location. Earlier studies have shown ) that the autocorrelation signal may diminish the importance of other factors if there is no evident change in other parameters. The 1-day autocorrelation for PM10 in the winter season may reach (Pearson's correlation) coefficient value of 0.65 and remains statistically significant for even up to 3-day lagged daily mean PM10 concentration.
In total, 32 predictor variables served to create machine learning models. These comprised 26 meteorological parameters, 3 parameters related to human activity, and 3 autocorrelation signals ( Table 2). All investigated predictor variables were initially tested with the Boruta algorithm (Kursa and Rudnicki, 2010) which were proven to exert more impact on the predicted parameter than a random noise for all investigated monitoring stations.

Machine Learning Models
A few commonly applied statistical machine learning methods, including linear regression with stepwise selection, random forests, Extreme Gradient Boosting regression trees, and multilayer perceptron neural networks were tested and evaluated against the measured PM10 and PM2.5 values, separately for each station. Repeated leave-group-out cross-validation (LGOCV; also known as Monte Carlo cross-validation) was used to prevent model overfitting and provide the best set of hyperparametric values. The randomly selected (without replacement) fraction of data to form the training set contained 66% of data, which was then used to create the model and test it on the rest of the data points for a chosen station (i.e., LGOCV approach was applied separately for every station). Subsequently, this process was repeated five times, generating new training and test partitions each time for the newly evaluated model. The final regression model for every ML method and station was the one with the lowest root-mean-square error (RMSE) among all performed iterations.
The selected variables were scaled and transformed according to the Yeo-Johnson transformation to fit into the methodological assumptions of the chosen models. In the case of a neural network-based model, the min-max scaling was performed (where applicable). In the case of the multiple regression, the selection of the best predictors that would subsequently result in the optimized model was conducted with the application of the stepwise Akaike Information Criterion (AIC) algorithm (Akaike, 1973). It is similar to ordinary stepwise multiple regression but with the difference that instead of the F-statistics, the value of the AIC determines the selection of the best combination of possible features. All the calculations were carried out using the R programming language (R Core Team, 2019) as well as packages for machine learning modelling included in the caret framework (Kuhn et al., 2019).
Two of the selected ML models represent decision tree-based algorithms. The first is the random forest approach, which is a non-parametric machine learning algorithm that could be used to predict with high accuracy (Liu et al., 2018). RF consists of a collection of classifiers with a tree structure. These classifiers are randomly and independently selected vectors with the same distribution that vote for the most popular class (Breiman, 2001). RF models have been successfully used to predict PM2.5 regardless of the climatic zone (Hu et al., 2017;Joharestani et al., 2019). The chosen RF hyperparameters used in this study vary for each model as the calibration was done separately for each station. The general rule was to create up to 500 trees in every cross-validation procedure with splitting variables at each tree node according to the square root of possible combinations while maximising the obtained variance (Kuhn et al., 2019).
Another decision tree algorithm represented here is the Extreme Gradient Boosting framework based on the gradient boosting technique. The main differences between XGBoost and RFs are how the trees are built (XGBoost-one tree at a time, RFs-trees are independent). In the case of XGBoost, there is also an adapted function loss to improve weak learners and combine the obtained results in a stage-wise manner. Therefore, XGBoost tends to outperform RFs only if data between classes are relatively balanced, to the extent that there is not much noise in the data, and when tuned parameters are carefully set up. The last issue was particularly challenging in our study, as the default settings tend to overfit the final model easily. However, the finally obtained hyperparameters with grid search were similar to those presented in other studies dealing with particulate matter concentrations (Joharestani et al., 2019) with slightly smaller n_rounds parameter (= ~150) and higher eta (= 0.3).
Another possible approach in modelling of nonlinear systems is the utilisation of artificial neural networks. ANNs are commonly considered to be the most advanced AI tool which allow for the analysis of multidimensional systems. Here, we used a multilayer perceptron with two hidden layers yielding a total of four layers. Resilient back propagation (Riedmiller and Braun, 1993;Riedmiller, 1994) was selected as an error propagation method. We tested multiple network architectures in an attempt to find the optimum number of neurons in hidden layers. The first layer size equals the number of independent variables, whereas the last one is a neuron providing final information on the daily average PM concentration, given that the regression approach was applied. Hidden layers' structure was a subject of tuning, and in the case of the second layer, we tested sizes from 10 to 20. In the third one, we investigated the models' quality using 3-7 neurons. As in the case of other ML schemes, here also we used an LGOCV crossvalidation scheme to assess the network architectures.

Regression
The models' performance for the regression approach was characterised using Pearson's correlation (r), coefficient of determination (R 2 ), root-mean-square error (RMSE) and mean absolute error (MAE) following formulas presented by Wilks (2008). R 2 provides information on the dependant variable's fraction of variance as explained by the model. RMSE reflects the average difference between predicted and observed values (average model's error), whereas MAE is the mean absolute error. Additionally, the standard deviation (SD) of the model's performance statistics was calculated to account for variance inflation problems that often occur while applying statistical modelling techniques (Maraun, 2013). The verification statistics for the Pearson correlation coefficients, RMSEs and SDs were additionally presented in the form of Taylor diagrams (Taylor, 2001), which allowed us to address different aspects of the models' accuracy on a single chart.

Binary events
The numerical values of forecasted PM10 and PM2.5 concentrations obtained via the regression approach were classified into a binary form for situations according to daily threshold limits set by the Ambient Air Quality Directive (EU, 2008) (50 µg m -3 for PM10 and 25 µg m -3 for PM2.5). The scheme of the contingency table adopted in this study follows Wilks's convention (2008) with observed and forecasted events split into four categories: (a) a number of correct hits, (b) false alarms, (c) misses, and (d) correct rejections.
Based on the created contingency tables, five verification measures were calculated to assess the robustness of the model (Table 3) for days that crossed the allowable PMx concentration values. The most fundamental ones were chosen, including the probability of detection (POD), proportion correct (PC), false alarm ratio (FAR) and success ratio (SR). Based on FAR and SR values, Table 3. Formulas and ranges of selected verification measures used in this study, according to Jolliffe and Stephenson (2012). the performance diagrams (Roebber, 2009) were created to summarise model performances for those two complimentary indices.
In some of the analysed locations' binary, events tend to be heavily unbalanced. For example, in Kraków there is a clear predominance of situations with bad air quality which makes it relatively easy for a forecaster to issue the correct binary forecast (assuming predominance of typical "bad air quality" conditions). The opposite situation with a relatively low share of bad air quality days as in Tricity also eases binary forecasting. Therefore, the Heidke Skill Score (HSS) was calculated in order to account for this kind of situation. The HSS returns fractional improvements over random chance and therefore makes it a more robust single measure for determining a model performance (Jolliffe and Stephenson, 2012). All formulas for binary events used in this study and its interpretability are presented in Table 3.

PM10 and PM2.5 Concentration Variability
There are clear seasonal, weekly, and daily cycles visible in the PM2.5 and PM10 concentrations among Polish agglomerations related to changes in atmospheric conditions and human-related emissions. These regularities are also confirmed in the analysed data (Fig. 2). The courses of PM2.5 and PM10 concentrations are very similar regardless of the analysed location, with PM10 concentration generally higher and PM10-to-PM2.5 ratio changing according to season. On average, the lowest concentration of PM10 was found in the Tricity (25 µg m -3 ), then Łódź and Poznań (32-35 µg m -3 ), with the highest in Kraków (64-68 µg m -3 ). The ratio of PM10 to PM2.5 is the lowest in the summer months, where it contributes 45-65% of PM10 concentrations, with these values rising to 70-90% in winter months. Also in wintertime, the most unfavourable air condition occurred with  Table 1. In the weekly cycle, the highest values were observed from Wednesday to Friday. In contrast, significantly better air quality conditions are typical for weekends on account for lower traffic and industrial activity (Juda-Rezler, 2006). At the weekend, the air quality is improved by about 10-20 µg m -3 due to less traffic and people leaving the city for a weekend away. Also, the autocorrelation signal plays an essential role in a weekly cycle, which is visible in terms of lower particulate matter values on Mondays than any other working days.
In the sub-daily scale, two maxima (morning, late evening/night) and minima (around 3:00-5:00 AM, early afternoon) were distinguished. For the most part, this is related mainly to the diurnal cycle of air temperature and Planetary Boundary Layer (PBL), as much as daily cycle of human activity.
From the standpoint of human health, the most important is the time when threshold limits set by the Ambient Air Quality Directive (EU, 2008) were exceeded (50 µg m -3 for PM10 and 25 µg m -3 for PM2.5). The monthly ratio of days when the limits set for PM10 and PM2.5 were exceeded on each station is presented in Fig. 3. Generally, it was found that the PM10 limit exceedance occurred mainly from October to April-May (heating season). In the Tricity, it is about 10-30% of days, in Łódź and Poznań about 10-40%, and in Kraków 50-90%. However, these values are highly variable year to year due to the meteorological conditions and stations' locations. The situation is quite similar in the case of PM2.5 concentration but with a significantly higher number of days with exceedance due to lower threshold value as presented in Fig. 3.

The Variable Importance of Predictors
In our study, we opted to use the meteorological parameters with the greatest impact on high PMx concentrations according to previous studies (Jędruszkiewicz et al., 2017). In the cold half of the year, the low temperature (below +5°C), thin PBL (i.e., below 300-500 m), stable atmosphere, anticyclonic weather, and low wind speed (below 2 m s -1 ) played the most important role in the accumulation of pollutants. In terms of Pearson's correlation coefficient, the highest absolute numbers against PM10 concentrations were obtained for the daily means of PBL (from r = -0.76 in Poznań to r = -0.65 in Gdańsk), air temperature (from r = -0.56 in Poznań to r = -0.43 in Kraków), and wind speed (from r = -0.62 in Łódź to r = -0.42 in Gdańsk). Other meteorological parameters were below the absolute Pearson correlation value of 0.50. The presence of high-pressure systems in winter months along with relatively low dynamics characteristic for such a sea-level pressure pattern leads to situations when high PMx concentrations tend to persist for a few days in a row. This also allows us to include the autocorrelation signal of previous PMx concentration and treat it like a starting point for a model prediction. The autocorrelation signal reaches up to 0.60-0.70 of the Pearson's correlation coefficient which confirms that this is often a crucial factor that impacts air quality in European cities (Draghicescu and Ignaccolo, 2005;Holst et al., 2008).
The main factors which favoured high PMx concentrations and had been previously confirmed in other studies on Polish cities were also the most significant ones for the created ML models. In almost all cases, the most important determinants were: average height of the PBL and 1-daylagged PMx concentration, followed by the mean air temperature and wind speed (Fig. 4). There is also a clearly visible tendency for most models to benefit mainly from PBL and autocorrelation predictors, with XGBoost being an exception as it uses only about 50% of importance for the 1-day-lagged autocorrelation signal compared to changes in PBL height. Another thing worth mentioning is that decision tree methods tend to focus only on 4-5 main predictors which gives more than 25% of the importance signal and tend to ignore most of the other factors, with variable importance for nnet and stepAIC methods changing smoothly among all predictors. This is probably related to the high cross-correlation between variables and the diminishing significance of other parameters in decision tree models that even if capable of adding value to the output, they are eliminated due to redundancy or partial signal overlap with other variables. This can be seen in the example of wind gusts (which are potentially included in changes to other wind components), or precipitation (causing wet scavenging) which is strongly correlated with simultaneous changes, for example, in dew point and air temperature. It is worth noting that the human-related factors such as (non-)working day or day of week were considered unimportant in most cases, even if it was classified higher than, for instance, the importance of precipitation's occurrence.

Machine Learning-driven Short-term Forecasts
As mentioned above, the models' quality assessment involved a twofold approach. The first utilised the pure regression approach, where we attempted to forecast PM10 and PM2.5 concentrations. The latter involved a categorical forecast where administratively regulated thresholds were applied to identify the PMx concentrations exceedances, which yielded the confusion matrices for each city and model, and in turn allowed for the calculation of quality metrics.

Regression Forecast
The results clearly show that there is a significant and consistent improvement depending on the type of algorithm used (Fig. 5). The sequence of models (from worst to best) is stepAIC, nnets, RFs, and finally XGBoost. What may be striking is a relatively poor performance of neural networks compared to decision tree methods (i.e., RFs and XGBoost). It would seem that in this case, the size of the learning sample might be a problem as neural networks usually work best when provided with more data. The shift in quality between RF and XGBoost algorithms is small, and in the case of PM10 it is mainly indicated by a better variance reproduction and only slight improvement in RMSE (Table 4). In the case of PM2.5 forecasts in all agglomerations (except Kraków) the difference between RFs and XGBoost is apparent for both RMSE and variance reproduction. The correlation coefficients between the observed and forecasted series of PMx are generally very high. The lowest values (Tables 4 and 5) occur for the stepAIC algorithm and vary between 0.77 and 0.86 (PM10) and between 0.76 and 0.86 (PM2.5). One might suggest that taking the overall abilities of this algorithm into consideration, those values are satisfactory.
Much higher values of correlation coefficients occur for the other three applied algorithms. In nearly all cases, values exceed 0.9, which indicates a high ability to represent the variability of the modelled variable. In both cases (PM10 and PM2.5), the XGBoost algorithm provided the highest correlation values (ca. 0.98). A high concordance in variability for both RF and XGBoost algorithms is further confirmed by the RMSE values (Tables 4 and 5).
As was the case for the correlation coefficients, the poorest values occur for the stepAIC algorithm where the average deviation from the forecast usually exceeds 10 µg m -3 , with values as high as 32.8 µg m -3 for PM10 in Kraków. The quality of the forecast via the neural network algorithm is better but compared to RFs and XGBoost it is also not satisfactory. RMSE values for neural network algorithms range from 9.3 µg m -3 to 25.5 µg m -3 (PM10) and from 6.3 µg m -3 to 15.5 µg m -3 (PM2.5). RF and XGBoost models performed fairly well with comparable RMSE values ranging from 4.0 µg m -3 (XGBoost, Łódź) to 13.5 µg m -3 (RFs, Kraków) for PM10. In the case of PM2.5, the differences in RMSE between those two methods are negligible and do not exceed 2 µg m -3 . RMSE itself varies from 2.2 µg m -3 (XGBoost, Gdańsk) to 9.6 µg m -3 (RFs, Kraków). When looking at the differences between agglomerations, Kraków stands out negatively with RMSE values being usually more than twice as high as those in other cities.
Scatter plots of observed and forecasted PMx concentrations (Fig. 6) provide additional insight into the abilities of the algorithms within the context of a whole range of predicted values. The stepAIC method is the least capable forecaster of high PM10 concentrations. This is visible, especially for the higher concentrations where underestimation of PM10 concentrations might exceed locally even up to 200 µg m -3 (Kraków). For other cities, the scale of underestimation of the highest concentration levels with stepAIC is lower and usually does not exceed 100 µg m -3 . For lower concentrations, this algorithm shows a much higher over-and underestimation compared to the other methods. This is to be expected as the linear methods usually perform poorly when modelling highly non-linear processes. Other ML algorithms utilised in this research are more consistent with the linear shape of observed vs. forecasted scatter plots. The RF algorithm forecasts tend to underestimate higher PM10 concentrations although this usually does not exceed 50 µg m -3 . Neural networks, despite showing considerably higher spread along the range of PM10 concentrations, seem to be more consistent for the higher levels of PM10 concentration than RF. XGBoost algorithm shows the highest consistency. The spread is minimal across the whole range of observed PM10 concentrations, while the minute underestimation is visible only in the case of Poznań.

Binary Forecast
The second approach with a categorical forecast of the threshold exceedance for PMx concentrations yields similar outcomes when it comes to the hierarchical quality of ML algorithms applied in PM short-term forecasting. In all cases, the worst performance is for the stepAIC algorithm, then neural networks, and finally RFs and XGBoost with quality metrics at relatively similar levels (Figs. 7 and 8). What seems striking is the apparent variability between agglomerations. In the case of northern Poland and Tricity, there is a rather significant shift in forecast model metrics. For example, Success Ratio (defined as 1-FAR) changes from 0.59 (nnets) to 0.95 (XGBoost). It is also apparent for POD increase-from 0.50 (stepAIC) to 0.81 (RFs). The least improvement in case of categorical PM10 forecasts occurs in Kraków where all used ML algorithms perform quite well with POD over 0.9 and Success Ratio near or over 0.9. In the case of PM2.5, a similar pattern emerges. However, in Poznań and Łódź the improvement appears for the Success Ratio metric (from ca. 0.75 to 0.97) with the POD at relatively similar levels. In Kraków, the differences in the quality metrics are minimal, and all the models perform very well in terms of POD and SR.

DISCUSSION AND CONCLUSIONS
Unsurprisingly, air quality forecasts for Polish agglomerations have garnered increasing public attention-to the point of becoming a common element in national and local weather forecasting  services during the winter recently-as they warn residents of PM10 and PM2.5 exceedances and the consequent restrictions, e.g., bans on burning in fireplaces (when other heating systems are available), limits on outdoor time for children, and reduced services on free public transport. In most cases, these decision-making systems rely on numerical weather prediction models coupled with a chemistry module and often suffer from imperfect emission inventory input on a local scale (Dore et al., 2014). Hence, this study applied the perfect prog approach to machine learning models for short-term air quality forecasting (Gutierrez et al., 2019).
Of the ML models we tested, the decision tree algorithms-RFs and XGBoost-unquestionably exhibited the best performance in simulating PM regardless of the analysed location or particle diameter. According to the administrative threshold, the probability of detecting days with poor air quality varied from 0.80 in Tricity to 0.98 in Kraków (Tables 4 and 5), whereas the false alarm ratio varied from 0.05 to 0.11 for these algorithms. The more informative Heidke Skill Score, which factors the frequency of situations with bad or good air quality into its calculation, was relatively stable, with values of 0.83-0.88.
Depending on the location of the sampling station, we obtained Pearson correlation coefficients of 0.97-0.99 and RMSE values of 4.0-13.5 and 2.2-9.6 for the PM10 and PM2.5, respectively, with the regression models. However, the corresponding statistics for the nnet models were significantly lower (r = 0.89-0.94), which we attributed to the relatively small sample size-an assumption that was confirmed during the cross-validation tests, as models that have received larger amounts of data during training frequently display overfitting. Multiple linear regression using stepwise AIC achieved the worst results in all of the analysed cases, with r ranging from 0.76 (for PM2.5 in Tricity) to 0.87 (for PM10 in Kraków) and the RMSE ranging from 8.0 (for PM2.5 in Tricity) to 32.8 (for PM10 in Kraków) (Tables 4 and 5). As illustrated in the Taylor diagrams (Fig. 5), all of the models demonstrated high variance flattening, and the reconstructed standard deviation values were always lower than the actual ones-a persistent issue with statistical modelling (Maraun, 2013).
Nevertheless, in our opinion, the normalised standard deviation of the forecasted time series, despite being approximately 10-15% lower than the observed values, substantiates the high applicability of the XGBoost, RF, and nnet ensemble. Specifically, based on two rationales, we find that our verification of the regression models surpasses that performed in other studies on PMx forecasting with ML (e.g., Nidzgorska-Lencewicz, 2018;Joharestani et al., 2019).
First, the majority of the previous research input instantaneous or hourly values from their datasets, whereas we targeted daily averages. Not unexpectedly, predicting weather conditions or anthropogenic emissions related to routine activity is significantly easier on a daily than a subdaily scale. Additionally, PMx concentrations typically exhibit seasonal consistency, with rather gradual day-to-day shifts, and sharp, sporadic changes tend to be closely correlated with meteorological events.
Second, we applied perfect prog, which is based on a quality-controlled observational and reanalysed dataset. The models primarily benefitted from the PBL height, 1-day-lagged air quality dataset, and, to a lesser degree, temperature and wind data (Fig. 4), which may prove to be a challenge in real-time application, especially with regard to the PBL, since the typical RMSE for this parameter in a 1-day forecast is approximately 250 m (Lavers et al., 2019). Furthermore, the RMSE values for the hourly air temperature and wind speed are approximately 2.3°C and 1.5 m s -1 , respectively (Kendzierski et al., 2018), although they are significantly better for daily aggregates (e.g., < 1°C for the daily mean of the near-surface air temperature). The variable importance of the precipitation is very low; therefore, wet scavenging processes, even though they are often considered the most challenging variable in weather forecasting models, produce negligible effects on the ML models' predictions.
The perfect prog approach we used assumes a constant relationship between the predictors and the predictands, as diagnosed from the training dataset (Benestad et al., 2008). Therefore, one must keep in mind that flawed meteorological data, changes to the PMx measurement devices or the stations' locations, or any other factors that decrease the homogeneity of the input data-including long-term strategies that reduce anthropogenic emissions, such as thermal modernisation or clean transportation programs-may negatively affect the predictive accuracy. Although these issues may need to be considered in real-world air quality forecasting, they were not detected during our period of analysis, nor did they impact the models in a statistically significant way.
The tested machine learning models have proven to be a useful tool for forecasting short-term changes in air quality in the selected Polish agglomerations and thus can be utilized in air quality management as a component of any city's warning system. The current state of ML modelling offers cost-and computationally effective solutions that can be coupled with numerical weather prediction systems for a real-time hybrid approach without requiring a detailed emission inventory. In the future, we plan to extend the time scale of the predictions to at least three days.