Assessment and Comparison of Multi-annual Size Profiles of Particulate Matter Monitored at an Urban-industrial Site by an Optical Particle Counter with a Chemometric Approach

The size of airborne particles is a key air quality parameter that is related to their composition, transport properties and effects on human health and the environment. Optical particle counters (OPCs) are increasingly used to dynamically characterize the size of ambient air particles. Monitoring campaigns lasting several months or even years generate millions of individual data values that must be effectively processed to extract information. Data mining algorithms as SelfOrganizing Map (SOM) can support exploratory data analysis and pattern recognition in aerosol science. The use of SOMs, which offer powerful visualization features using 2D maps, allows us to interpret a large amount of data while avoiding any loss of information on variability from pre-treatments, such as compacting data recorded every minute to hourly or daily means. In the present study, we processed the data collected with an OPC during a long-term monitoring campaign (almost 3 years) conducted near residential buildings positioned very close to a steel plant and used them to assess and compare particulate matter (PM) profiles. About 12 million individual recorded values in total were handled. The current approach enabled us to identify four main PM profiles, follow their variation over time, and relate the differences to changes in the plant management and processes. Furthermore, it is potentially broadly applicable in high-frequency, long-term air quality monitoring campaigns employing different types of instruments to characterize the particle size and chemical composition of both PM and gases.


INTRODUCTION
Air particulate matter (PM) is a key air quality parameter which can be related to composition, exposure and effects on humans (WHO, 2013;Chen et al., 2016;Yang et al., 2019) and it can be characterized with several different metrics. PM can be vehicle of pollutants such as heavy metals and/or persistent organic pollutants (POPs) that can be transferred to biota (Kodnik et al., 2015;Fellet et al., 2016;Popek et al., 2017;Kłos et al., 2018). It is renown that PM dimensions are a key parameter which is related to the ability of PM to retain and transport specific pollutants (Bernardoni et al., 2017;Barbas et al., 2018;Giungato et al., 2018;Shen et al., 2019).
Optical particle counters (OPCs) are increasingly used for ambient air characterization in the particle size low micrometer range. They are easy to use and can be run with little on-site maintenance for long-term monitoring. In OPCs the light is scattered by sampled PM providing information about particle number and dimensions. Moreover, nowadays there is a growing interest in assessing the reliability of low-cost OPCs aiming to integrate the existing air quality regulatory networks (Bulot et al., 2019;Feinberg et al., 2019).
The most part of commercially available OPCs are able to record data per minute, or minute fraction. Thus a monitoring campaign lasting several months or years produces millions of single values to be elaborated for extracting information and knowledge to reach the specific aim of the study (city air quality assessment, hot spot industrial monitoring, comparison between different instruments, etc.).
In scientific literature there are several examples of multivariate analysis approaches for elaborating OPC data recorded in ambient air. Khan et al. (2015) present a study regarding minutely resolved data recorded in a 25-day monitoring campaign in a semi-urban tropical environment.
In Dominick et al. (2015) a study on a 3-month survey recording data per minute during a northeast monsoon in Malaysia is discussed. Agudelo-Castañeda et al. (2016) monitored PM size classes near a highway recording data every 15 min for 2 years. All the above mentioned studies perform principal component analysis (PCA) on hourly averages and show the resulting PC loadings by a table, no graphs showing PC scores are present. Etzion and Broady (2018) study concerns a 5-month monitoring in an urban neighborhood in which minutely resolved OPC data are elaborated by wavelet analysis. Ranalli et al. (2016) propose the use of functional exploratory data analysis of data collected every 6 seconds for 45 days in an urban site, but with a rather long computing time (several hours on a general-purpose desktop computer).
To our knowledge studies involving ambient air OPC data elaboration by neural networks are not present in literature.
In a recent publication we proposed a chemometric approach based on Self-Organizing Map (SOM) algorithm to elaborate OPC data recorded during a 3-month survey to assess the difference of PM profiles perceived at two different residential sites positioned near an industrial site (Licen et al., 2019). The Self-Organizing Map is an artificial neural network algorithm for unsupervised pattern recognition (Kohonen, 2001). SOMs have been applied to analyze ambient air data collected with monthly, weekly or irregular frequency (Astel et al., 2013;Zhong et al., 2017;Romanić et al., 2018;de Oliveira et al., 2019), handling few thousands of data points. A much challenging task is represented by data with higher frequency, as it is the case of "quasi-real-time" air monitoring; first case studies have been proposed considering few months for odor nuisance characterization by electronic noses (Licen et al., 2018a, b). In several studies, to simplify data handling, the data are reduced to hourly or daily means, discarding information on intra-hour variability. The use of SOM algorithm allows instead to elaborate a high number of data without omitting information on data variability associated with data pre-treatments. Respect to PCA also possible non-linear relationships between variables can be detected, the data noise is reduced by the intrinsic properties of SOM algorithm and, additionally, SOM provides relevant visualization features using 2D maps (Kohonen, 2001) whereas in the presence of a very high number of samples visualizing PCA scores in a graph with a clear appearance is rather complex. Moreover, the approach permits to relate the model outcomes with data recorded by institutional air quality monitoring stations, providing meteorological data and pollutant concentrations by normed methods.
In the present study we describe the OPC data elaboration for PM profile assessment and comparison of a long monitoring (nearly 3 years) carried out near residential buildings positioned very close to a steel plant. About 12 million recorded single values have been handled on the whole.

Site
The optical particle counter was positioned near dwellings ( Fig. 1, A) which are distant approximatively 200 m from an integral cycle steel plant in the city of Trieste (NE Italy). The main renown sources of particulate matter of the integrated cycle steel plant are the blast furnace, the raw material storage areas, and the coke oven batteries (Aries et al., 2007;Boscolo and Padoano, 2011;Boscolo et al., 2019). In the past years, studies were conducted on the site to assess impacts on civil dwellings by carcinogenic pollutants and odor emissions (Cozzi et al., 2010;Licen et al., 2016;Licen et al., 2018c).

Wind Direction Data
Hourly wind direction data representative of the meteorological conditions of the city have been collected from a synoptic weather station (Molo Fratelli Bandiera (MFB)) in an open position by the sea 2 km from the plant, and they are available from the regional environmental protection agency (ARPA-FVG) website (OSMER). These data have not been used to build the models described in the study, thus we will refer to them in Sec. 3 as wind "external data."

PM10 Hourly Data
The hourly PM 10 concentration data (µg m -3 ; measured by normalized method EN 12341) collected at Site B ( Fig. 1) positioned at the fence of the steel plant about 200 m from Sampling Site A, were obtained from the ARPA-FVG website (ARPA; ARPA-FVG, 2016). These data have not been used to build the models described in the study, thus we will refer to them in Sec. 3 as PM 10 "external data."

Self-Organizing Map
The SOM unsupervised algorithm works with no need of a priori data grouping or classification. A Self-Organizing Map is organized as a bi-dimensional array of neurons (or units), which are vectors of scalars related to the experimental variable data, that in the present study are the particle count responses of the channels. The array dimensions and starting values of the neurons are set according to heuristic rules proposed by . During the training, the model "learns" from the data, adjusting the neuron values according to the Euclidean distance from the experimental vectors. In this way a new set of vectors (neurons) which still represent the variability of the processed data is obtained. The neuron vectors show simple geometric relationships (distances) and have the same number of variables as the experimental data. This algorithm is able to handle also non-linear relationships between variables and reduces data noise.
The neurons can be represented as hexagons stuck together in a bi-dimensional map allowing visual exploration of the data (Vesanto, 1999). In the present specific application, the neuron vectors represent recurrent PM profiles perceived at the monitoring site. According to Vesanto (1999) and Himberg et al. (2001) the Self-Organizing Map can be explored to observe the distribution of the values of every single modeled experimental variable on the map (heatmap), showing how each one of the original variables relates to the others in the SOM. Moreover, the number (hits) of experimental vectors which are represented by every single map unit can be evaluated.
The recurrent profiles can be further grouped using, e.g., a k-means clustering algorithm  on SOM neurons data set. The second-level clusterization can be represented on the SOM map by a color code to identify regions with relatively homogeneous properties. In this application the clusters represent air type profiles observed at the receptor.
Moreover, as shown in Licen et al. (2019) the results of the model and the cluster assignment can be coupled with other data collected on site by other instruments matching them by date/time, allowing cluster characterization.

Calculations
SOM building and k-means clustering were performed in the Matlab 6.5 (MathWorks, Inc.) computing environment, implementing the SOM toolbox (2000). Algorithm results exploration and 2D map visualization of the outputs were performed using in-house scripts in R software environment (R Core Team, 2016) implemented by the openair package (Carslaw and Ropkins, 2012). In addition, wind direction data and pollutant data were elaborated in R environment. The site map was prepared in R environment implemented by the ggplot2 package (Wickham, 2016) and ggmap package (Kahle and Wickham, 2013). Circlize package (Gu et al., 2014) has been used to produce Fig. 4. The trajectory animation presented in the supplementary material was prepared in R environment implemented by the animation package (Xie, 2013).

Monitoring Results Overview
In the present paragraph we illustrate an overview of results obtained during the 3 years' monitoring, using basic statistics applied to OPC channel counts. The results are shown in Table 1 where mean, median, max and min for each PM dimension split by year are reported.
The variation of PM channel counts for each year by hour of the day, day of the week and month, obtained using timeVariation function present in openair package are reported in the supplementary material.

Self-Organizing Maps Evaluation
An independent SOM model has been evaluated for each year. For all the three models the algorithm parameters were set accordingly to . The number of neurons was initially determined as 5 times the square root of the number of experimental vectors and divided by 4. The square root of the ratio between the two largest eigenvalues of the data set was evaluated and it was used to set the ratio between the map side lengths. Finally, the actual side lengths of the map were set to have their product as close as possible to the initially evaluated number of map units. We obtained SOM maps with a 39 × 21, a 37 × 22 and a 37 × 21 lattice for year 2014, 2015 and 2016 respectively.
In the left part of Fig. 3 the distribution of the modeled experimental variables on the map (heatmaps) are shown for each year. Four gray hues characterize quartiles of a single variable. Black color represents values higher than the 95 th percentile.
It can be seen that for all years the upper part of the heatmaps shows lower number of counts for all the variables. For year 2014 higher number of counts for small PM dimensions are present in the lower left corner of the map while higher counts for large PM dimensions are present in the lower right corner. Heatmaps for year 2016 show a similar behavior while heatmaps for year 2015 seem to be a "mirror" with respect to the other years (Fig. 2).

SOM Maps Clusterization
SOM algorithm allowed to reduce the starting data sets to approximatively 800 recurrent profiles (units) for each year. In order to group the units representing akin PM size profiles, a k-means clustering algorithm  was operated on the matrix composed by the vectors characterizing the SOM units. The algorithm was iterated for 200 epochs for a range of clusters from 2 to 10. In order to select the best clusterization we used the Davies-Bouldin index (Davies and Bouldin, 1979) which is function of the ratio of the within-cluster scatter to the between-cluster separation. The number of clusters for which the corresponding DB index was lower was 4 for each SOM model. The grouping of the units by cluster is reported in the right part of Fig. 2. It can be seen that the larger part of the maps is enclosed in Cluster 1, while Clusters 2-4 are located in the lower part of the maps.

Cluster Characterization with External Data
Aiming to relate the clusters to wind direction and PM10 concentration recorded by the regional EPA (see Sec. 2.3 and 2.4), we built a series of box plots grouping the external data variables by clusters. Every experimental vector recorded by OPC was assigned to the more similar unit, in terms of Euclidean distance, which is called "best matching unit" (BMU). As every unit is assigned to a specific cluster, the cluster value is in this way associated to all the experimental vectors assigned to that specific unit. Recording "date-time" was the only common variable between the OPC data and the external data, thus we were able to assign a cluster value to the "external data" recorded at the same "date-time" to OPC data. At the end of this process the "external-data" were grouped by cluster and represented as box plots, which are shown in Fig. 3 for PM10 and in Fig. 4 for wind direction. In Fig. 4 the interquartile ranges of box plots are laid on a wind rose. Each wind rose represents the extent of the interquartile range separately by year and corresponding to the same cluster. In Fig. 3 it can be noticed that Cluster 1 is neatly related to low values of PM 10 , whereas the other clusters are related to higher values of PM 10 . In particular, Cluster 3 is related with the highest median values in year 2014 and 2016, while in year 2015 an inversion in the behavior of Cluster 3 and 4 can be observed.
Considering the interquartile ranges of wind direction reported in Fig. 4, a similar extent can be observed along the years for each cluster. Moreover, comparing the wind direction with the relative positioning of the monitoring site and the plant reported in Fig. 1, it can be noticed that Cluster 1 is related with wind blowing from the city, conversely the other clusters are associated with wind blowing from the plant.
It is possible to explore and compare the cluster particulate matter profiles representing the variables, as modeled in the SOM units, by box plots. Each box plot is generated gathering together all the values of a variable assumed by the units classified in the specific cluster. The PM profiles are shown in Fig. 5, the values are normalized by variable and the box plot whiskers are omitted for the sake of the readability of the figure. In the supplementary material it is reported another possible representation of Fig. 5 in which a box plot of each modeled variable split by cluster is shown without normalization of the channel counts.
The Cluster 1 to 3 profiles show a similar behavior among    the 3 years, with Cluster 1 displaying low values of counts for all the PM sizes, Cluster 2 moderately higher values from PM07 to PM3 and Cluster 3 high values from PM03 to PM07.
Cluster 4 shows a similar behavior for years 2014 and 2016, with the latter displaying absolute lower values with respect to the former year. The behavior of Cluster 4 in year 2015 shows not very high values of PM3 to PM10 with respect to the trend of the other years. This evidence can be explained considering that in 2015 a change in the plant management and process occurred. In particular, a revamping of the blast furnace has been operated thus the casting activity of the plant has been stopped during some periods of the year to allow makeover operations. As the blast furnace is a renowned source of coarse PM fraction (Boscolo and Padoano, 2011) the effect of the abovementioned changes can be seen in year 2015 were the PM coarse fraction in Cluster 4 is lower and in year 2016, during which, after the setting up of the blast furnace activity, the profile of Cluster 4 shows a similar behavior as in year 2014 but with lower absolute values.

Cluster Frequency Evaluation
Every experimental vector is associated to a best matching unit as described in Sec. 3.3, thus, considering the units grouped into the same cluster, we can count the experimental vectors associated to that cluster and eventually evaluate the recurrence of that cluster during the monitoring period in terms of percentage frequency. In Table 2 the frequency for each cluster for each year is shown.
Clusters show similar frequencies among the year, except for Cluster 3. Going into detail and exploring the frequency distribution per cluster for every month of the year (see tables reported in Appendix A) it can be noticed that Cluster 3 shows higher frequency in autumn and winter for years 2014 and 2015. Thus, considering that in 2016 the monitoring ended in October, it is possible that difference in frequency with respect to the previous years is due to the lower number of experimental data recorded in the autumn season.

PM Profiles' Temporal Evolution
The temporal evolution of the PM cluster profiles can be represented by stacked bar graphs in which each day is represented by a stacked bar built according to the percentage of each cluster assigned to the experimental data. An example regarding the period January-March 2014 is reported in Fig. 6.
The temporal evolution can be followed at a finer detail representing the trajectory (sequence of the BMU assigned to each experimental sample). In the supplementary material we present a graphical animation of 1-day trajectory as well as the minutely normalized PM profiles plotted in the same graph as the profile of the assigned cluster. In this way the cluster profiles can be used as a "reference" to interpret the single recorded experimental sample.

CONCLUSIONS
The approach we presented allowed us to extract variable profiles from millions of individual recorded data and to support model outcomes using data recorded with other  . 6. Example of daily stacked bar graph (Jan-Mar 2014) built according to the percentage of each cluster assigned to the experimental data for each day (Y-axis = percentage, ND = not determined (experimental data not recorded)).
instruments. Three independent Self-Organizing Maps constructed from data collected during years 2014, 2015 and 2016 showed overall coherence, highlighting typical PM size patterns identified as clusters of recurrent profiles recorded annually at the monitoring site. The variability between the 3-year models was associated with i) the wind direction, which identifies the sources of PM affecting the site, and ii) independent PM10 data measured with a beta gauge PM monitor. Slight changes in the PM size profiles between the years were attributed to changes in the management of the nearby steel plant. Our method is potentially broadly applicable in high-frequency, long-term air quality monitoring campaigns employing different types of instruments to characterize the particle size and chemical composition of both PM and gases.

SUPPLEMENTARY MATERIAL
Supplementary data associated with this article can be found in the online version at http://www.aaqr.org.