Sabina Licen 1, Sergio Cozzutto2, Pierluigi Barbieri1 1 Department of Chemical and Pharmaceutical Sciences, University of Trieste, 34127 Trieste, Italy
2 ARCO SolutionS s.r.l., spin-off company of the Department of Chemical and Pharmaceutical Sciences, University of Trieste, 34127 Trieste, Italy
Received:
August 27, 2019
Revised:
January 10, 2020
Accepted:
February 6, 2020
Download Citation:
||https://doi.org/10.4209/aaqr.2019.08.0414
Licen, S., Cozzutto, S. and Barbieri, P. (2020). Assessment and Comparison of Multi-annual Size Profiles of Particulate Matter Monitored at an Urban-industrial Site by an Optical Particle Counter with a Chemometric Approach. Aerosol Air Qual. Res. 20: 800-809. https://doi.org/10.4209/aaqr.2019.08.0414
Cite this article:
Air particulate matter dimensions are a key air quality parameter which can be related to composition, transport properties, exposure and effects on humans and the environment. Optical particle counters are increasingly used for dynamic ambient air particle matter size characterization. Monitoring campaigns lasting several months or years produces millions of single values to be elaborated, requiring effective data treatment procedures for extracting information and knowledge. Data mining algorithms as Self-Organizing Map (SOM) can support exploratory data analysis and pattern recognition in aerosol science. The use of SOM allows to elaborate a high number of data, with powerful visualization features using 2D maps, avoiding to lose information on data variability with data pre-treatments, such as compacting minute data to hourly or daily means. In the present study we describe the optical particle counter data elaboration for particulate matter profile assessment and comparison of a long monitoring time (nearly three years) carried out near residential buildings positioned very close to a steel plant. About twelve millions recorded single values have been handled on the whole. The approach allowed to identify four main particulate matter profiles and follow their variation during the years relating the differences with changes in the plant management and process. The possible applications of the present approach are broad in the field of air quality high frequency long monitoring campaigns with different types of instruments for size and compositional characterization of both particulate matter and gases.HIGHLIGHTS
ABSTRACT
Keywords:
Ambient air; Pollution; Self-Organizing Map; Pattern recognition.
Air particulate matter (PM) is a key air quality parameter which can be related to composition, exposure and effects on humans (WHO, 2013; Chen et al., 2016; Yang et al., 2019) and it can be characterized with several different metrics. PM can be vehicle of pollutants such as heavy metals and/or persistent organic pollutants (POPs) that can be transferred to biota (Kodnik et al., 2015; Fellet et al., 2016; Popek et al., 2017; Kłos et al., 2018). It is renown that PM dimensions are a key parameter which is related to the ability of PM to retain and transport specific pollutants (Bernardoni et al., 2017; Barbas et al., 2018; Giungato et al., 2018; Shen et al., 2019). Optical particle counters (OPCs) are increasingly used for ambient air characterization in the particle size low micrometer range. They are easy to use and can be run with little on-site maintenance for long-term monitoring. In OPCs the light is scattered by sampled PM providing information about particle number and dimensions. Moreover, nowadays there is a growing interest in assessing the reliability of low-cost OPCs aiming to integrate the existing air quality regulatory networks (Bulot et al., 2019; Feinberg et al., 2019). The most part of commercially available OPCs are able to record data per minute, or minute fraction. Thus a monitoring campaign lasting several months or years produces millions of single values to be elaborated for extracting information and knowledge to reach the specific aim of the study (city air quality assessment, hot spot industrial monitoring, comparison between different instruments, etc.). In scientific literature there are several examples of multivariate analysis approaches for elaborating OPC data recorded in ambient air. Khan et al. (2015) present a study regarding minutely resolved data recorded in a 25-day monitoring campaign in a semi-urban tropical environment. In Dominick et al. (2015) a study on a 3-month survey recording data per minute during a northeast monsoon in Malaysia is discussed. Agudelo-Castañeda et al. (2016) monitored PM size classes near a highway recording data every 15 min for 2 years. All the above mentioned studies perform principal component analysis (PCA) on hourly averages and show the resulting PC loadings by a table, no graphs showing PC scores are present. Etzion and Broady (2018) study concerns a 5-month monitoring in an urban neighborhood in which minutely resolved OPC data are elaborated by wavelet analysis. Ranalli et al. (2016) propose the use of functional exploratory data analysis of data collected every 6 seconds for 45 days in an urban site, but with a rather long computing time (several hours on a general-purpose desktop computer). To our knowledge studies involving ambient air OPC data elaboration by neural networks are not present in literature. In a recent publication we proposed a chemometric approach based on Self-Organizing Map (SOM) algorithm to elaborate OPC data recorded during a 3-month survey to assess the difference of PM profiles perceived at two different residential sites positioned near an industrial site (Licen et al., 2019). The Self-Organizing Map is an artificial neural network algorithm for unsupervised pattern recognition (Kohonen, 2001). SOMs have been applied to analyze ambient air data collected with monthly, weekly or irregular frequency (Astel et al., 2013; Zhong et al., 2017; Romanić et al., 2018; de Oliveira et al., 2019), handling few thousands of data points. A much challenging task is represented by data with higher frequency, as it is the case of “quasi-real-time” air monitoring; first case studies have been proposed considering few months for odor nuisance characterization by electronic noses (Licen et al., 2018a, b). In several studies, to simplify data handling, the data are reduced to hourly or daily means, discarding information on intra-hour variability. The use of SOM algorithm allows instead to elaborate a high number of data without omitting information on data variability associated with data pre-treatments. Respect to PCA also possible non-linear relationships between variables can be detected, the data noise is reduced by the intrinsic properties of SOM algorithm and, additionally, SOM provides relevant visualization features using 2D maps (Kohonen, 2001) whereas in the presence of a very high number of samples visualizing PCA scores in a graph with a clear appearance is rather complex. Moreover, the approach permits to relate the model outcomes with data recorded by institutional air quality monitoring stations, providing meteorological data and pollutant concentrations by normed methods. In the present study we describe the OPC data elaboration for PM profile assessment and comparison of a long monitoring (nearly 3 years) carried out near residential buildings positioned very close to a steel plant. About 12 million recorded single values have been handled on the whole. The optical particle counter was positioned near dwellings (Fig. 1, A) which are distant approximatively 200 m from an integral cycle steel plant in the city of Trieste (NE Italy). The main renown sources of particulate matter of the integrated cycle steel plant are the blast furnace, the raw material storage areas, and the coke oven batteries (Aries et al., 2007; Boscolo and Padoano, 2011; Boscolo et al., 2019). In the past years, studies were conducted on the site to assess impacts on civil dwellings by carcinogenic pollutants and odor emissions (Cozzi et al., 2010; Licen et al., 2016; Licen et al., 2018c). The time- and size-resolved monitoring of particulate matter has been performed by an OPC with eight channels (Model 212 Eight Channel Particle Counter; Met One Instruments, Inc., Rowlett, Texas, USA). The instrument permits to discriminate eight size ranges (< 0.3, > 0.3–0.5, > 0.5–0.7, > 0.7–1.0, > 1.0–2.0, > 2.0–3.0, > 3.0–5.0, > 5.0–10.0 µm). Channels will be named from now on in the text respectively as PM03, PM05, PM07, PM1, PM2, PM3, PM5 and PM10. The monitoring period lasted nearly 3 years starting from January 2014 and ending in October 2016. Data count per minute for each channel has been provided with air sampling at 1 L min–1. Hourly wind direction data representative of the meteorological conditions of the city have been collected from a synoptic weather station (Molo Fratelli Bandiera (MFB)) in an open position by the sea 2 km from the plant, and they are available from the regional environmental protection agency (ARPA-FVG) website (OSMER). These data have not been used to build the models described in the study, thus we will refer to them in Sec. 3 as wind “external data.” The hourly PM10 concentration data (µg m–3; measured by normalized method EN 12341) collected at Site B (Fig. 1) positioned at the fence of the steel plant about 200 m from Sampling Site A, were obtained from the ARPA-FVG website (ARPA; ARPA-FVG, 2016). These data have not been used to build the models described in the study, thus we will refer to them in Sec. 3 as PM10 “external data.” The SOM unsupervised algorithm works with no need of a priori data grouping or classification. A Self-Organizing Map is organized as a bi-dimensional array of neurons (or units), which are vectors of scalars related to the experimental variable data, that in the present study are the particle count responses of the channels. The array dimensions and starting values of the neurons are set according to heuristic rules proposed by Vesanto et al. (2000). During the training, the model “learns” from the data, adjusting the neuron values according to the Euclidean distance from the experimental vectors. In this way a new set of vectors (neurons) which still represent the variability of the processed data is obtained. The neuron vectors show simple geometric relationships (distances) and have the same number of variables as the experimental data. This algorithm is able to handle also non-linear relationships between variables and reduces data noise. Details on the SOM algorithm and its general analytical features can be found elsewhere (Zupan et al., 1997; Himberg et al., 2001; Kohonen, 2001; Ballabio and Vasighi, 2012). The neurons can be represented as hexagons stuck together in a bi-dimensional map allowing visual exploration of the data (Vesanto, 1999). In the present specific application, the neuron vectors represent recurrent PM profiles perceived at the monitoring site. According to Vesanto (1999) and Himberg et al. (2001) the Self-Organizing Map can be explored to observe the distribution of the values of every single modeled experimental variable on the map (heatmap), showing how each one of the original variables relates to the others in the SOM. Moreover, the number (hits) of experimental vectors which are represented by every single map unit can be evaluated. The recurrent profiles can be further grouped using, e.g., a k-means clustering algorithm (Vesanto and Alhoniemi, 2000) on SOM neurons data set. The second-level clusterization can be represented on the SOM map by a color code to identify regions with relatively homogeneous properties. In this application the clusters represent air type profiles observed at the receptor. Moreover, as shown in Licen et al. (2019) the results of the model and the cluster assignment can be coupled with other data collected on site by other instruments matching them by date/time, allowing cluster characterization. SOM building and k-means clustering were performed in the Matlab 6.5 (MathWorks, Inc.) computing environment, implementing the SOM toolbox (2000). Algorithm results exploration and 2D map visualization of the outputs were performed using in-house scripts in R software environment (R Core Team, 2016) implemented by the openair package (Carslaw and Ropkins, 2012). In addition, wind direction data and pollutant data were elaborated in R environment. The site map was prepared in R environment implemented by the ggplot2 package (Wickham, 2016) and ggmap package (Kahle and Wickham, 2013). Circlize package (Gu et al., 2014) has been used to produce Fig. 4. The trajectory animation presented in the supplementary material was prepared in R environment implemented by the animation package (Xie, 2013). In the present paragraph we illustrate an overview of results obtained during the 3 years’ monitoring, using basic statistics applied to OPC channel counts. The results are shown in Table 1 where mean, median, max and min for each PM dimension split by year are reported. The variation of PM channel counts for each year by hour of the day, day of the week and month, obtained using timeVariation function present in openair package are reported in the supplementary material. An independent SOM model has been evaluated for each year. For all the three models the algorithm parameters were set accordingly to Vesanto et al. (2000). The number of neurons was initially determined as 5 times the square root of the number of experimental vectors and divided by 4. The square root of the ratio between the two largest eigenvalues of the data set was evaluated and it was used to set the ratio between the map side lengths. Finally, the actual side lengths of the map were set to have their product as close as possible to the initially evaluated number of map units. We obtained SOM maps with a 39 × 21, a 37 × 22 and a 37 × 21 lattice for year 2014, 2015 and 2016 respectively. In the left part of Fig. 3 the distribution of the modeled experimental variables on the map (heatmaps) are shown for each year. Four gray hues characterize quartiles of a single variable. Black color represents values higher than the 95th percentile. It can be seen that for all years the upper part of the heatmaps shows lower number of counts for all the variables. For year 2014 higher number of counts for small PM dimensions are present in the lower left corner of the map while higher counts for large PM dimensions are present in the lower right corner. Heatmaps for year 2016 show a similar behavior while heatmaps for year 2015 seem to be a “mirror” with respect to the other years (Fig. 2). SOM algorithm allowed to reduce the starting data sets to approximatively 800 recurrent profiles (units) for each year. In order to group the units representing akin PM size profiles, a k-means clustering algorithm (Vesanto and Alhoniemi, 2000) was operated on the matrix composed by the vectors characterizing the SOM units. The algorithm was iterated for 200 epochs for a range of clusters from 2 to 10. In order to select the best clusterization we used the Davies-Bouldin index (Davies and Bouldin, 1979) which is function of the ratio of the within-cluster scatter to the between-cluster separation. The number of clusters for which the corresponding DB index was lower was 4 for each SOM model. The grouping of the units by cluster is reported in the right part of Fig. 2. It can be seen that the larger part of the maps is enclosed in Cluster 1, while Clusters 2–4 are located in the lower part of the maps. Aiming to relate the clusters to wind direction and PM10 concentration recorded by the regional EPA (see Sec. 2.3 and 2.4), we built a series of box plots grouping the external data variables by clusters. Every experimental vector recorded by OPC was assigned to the more similar unit, in terms of Euclidean distance, which is called “best matching unit” (BMU). As every unit is assigned to a specific cluster, the cluster value is in this way associated to all the experimental vectors assigned to that specific unit. Recording “date-time” was the only common variable between the OPC data and the external data, thus we were able to assign a cluster value to the “external data” recorded at the same “date-time” to OPC data. At the end of this process the “external-data” were grouped by cluster and represented as box plots, which are shown in Fig. 3 for PM10 and in Fig. 4 for wind direction. In Fig. 4 the interquartile ranges of box plots are laid on a wind rose. Each wind rose represents the extent of the interquartile range separately by year and corresponding to the same cluster. In Fig. 3 it can be noticed that Cluster 1 is neatly related to low values of PM10, whereas the other clusters are related to higher values of PM10. In particular, Cluster 3 is related with the highest median values in year 2014 and 2016, while in year 2015 an inversion in the behavior of Cluster 3 and 4 can be observed. Considering the interquartile ranges of wind direction reported in Fig. 4, a similar extent can be observed along the years for each cluster. Moreover, comparing the wind direction with the relative positioning of the monitoring site and the plant reported in Fig. 1, it can be noticed that Cluster 1 is related with wind blowing from the city, conversely the other clusters are associated with wind blowing from the plant. It is possible to explore and compare the cluster particulate matter profiles representing the variables, as modeled in the SOM units, by box plots. Each box plot is generated gathering together all the values of a variable assumed by the units classified in the specific cluster. The PM profiles are shown in Fig. 5, the values are normalized by variable and the box plot whiskers are omitted for the sake of the readability of the figure. In the supplementary material it is reported another possible representation of Fig. 5 in which a box plot of each modeled variable split by cluster is shown without normalization of the channel counts. The Cluster 1 to 3 profiles show a similar behavior among the 3 years, with Cluster 1 displaying low values of counts for all the PM sizes, Cluster 2 moderately higher values from PM07 to PM3 and Cluster 3 high values from PM03 to PM07. Cluster 4 shows a similar behavior for years 2014 and 2016, with the latter displaying absolute lower values with respect to the former year. The behavior of Cluster 4 in year 2015 shows not very high values of PM3 to PM10 with respect to the trend of the other years. This evidence can be explained considering that in 2015 a change in the plant management and process occurred. In particular, a revamping of the blast furnace has been operated thus the casting activity of the plant has been stopped during some periods of the year to allow makeover operations. As the blast furnace is a renowned source of coarse PM fraction (Boscolo and Padoano, 2011) the effect of the abovementioned changes can be seen in year 2015 were the PM coarse fraction in Cluster 4 is lower and in year 2016, during which, after the setting up of the blast furnace activity, the profile of Cluster 4 shows a similar behavior as in year 2014 but with lower absolute values. Every experimental vector is associated to a best matching unit as described in Sec. 3.3, thus, considering the units grouped into the same cluster, we can count the experimental vectors associated to that cluster and eventually evaluate the recurrence of that cluster during the monitoring period in terms of percentage frequency. In Table 2 the frequency for each cluster for each year is shown. Clusters show similar frequencies among the year, except for Cluster 3. Going into detail and exploring the frequency distribution per cluster for every month of the year (see tables reported in Appendix A) it can be noticed that Cluster 3 shows higher frequency in autumn and winter for years 2014 and 2015. Thus, considering that in 2016 the monitoring ended in October, it is possible that difference in frequency with respect to the previous years is due to the lower number of experimental data recorded in the autumn season. The temporal evolution of the PM cluster profiles can be represented by stacked bar graphs in which each day is represented by a stacked bar built according to the percentage of each cluster assigned to the experimental data. An example regarding the period January–March 2014 is reported in Fig. 6. The temporal evolution can be followed at a finer detail representing the trajectory (sequence of the BMU assigned to each experimental sample). In the supplementary material we present a graphical animation of 1-day trajectory as well as the minutely normalized PM profiles plotted in the same graph as the profile of the assigned cluster. In this way the cluster profiles can be used as a “reference” to interpret the single recorded experimental sample. The approach we presented allowed us to extract variable profiles from millions of individual recorded data and to support model outcomes using data recorded with other instruments. Three independent Self-Organizing Maps constructed from data collected during years 2014, 2015 and 2016 showed overall coherence, highlighting typical PM size patterns identified as clusters of recurrent profiles recorded annually at the monitoring site. The variability between the 3-year models was associated with i) the wind direction, which identifies the sources of PM affecting the site, and ii) independent PM10 data measured with a beta gauge PM monitor. Slight changes in the PM size profiles between the years were attributed to changes in the management of the nearby steel plant. Our method is potentially broadly applicable in high-frequency, long-term air quality monitoring campaigns employing different types of instruments to characterize the particle size and chemical composition of both PM and gases.INTRODUCTION
METHODS
SiteFig. 1. Map of the area of interest. The location of Trieste in NE Italy is shown on the upper right corner of the figure. The sampling site (A), the regional environmental protection agency sampling site (B), the boundary of the steel plant (black line), the position of the coke oven batteries (c.o.), the position of the blast furnace (b.f.), the raw material storage areas (white line polygons; c.s.a. = carbon storage area, m.s.a. = mineral storage area) are highlighted.
Optical Particle Counter Monitoring
Wind Direction Data
PM10 Hourly Data
Self-Organizing Map
Calculations
RESULTS AND DISCUSSION
Monitoring Results Overview
Self-Organizing Maps EvaluationFig. 2. Heatmaps for each SOM model: (a) year 2014, (c) year 2015, (e) year 2016. The filling of the hexagons represents the basic statistics (quartiles = gray scale, upper outliers > 95th percentile = black) of each modeled variable. SOM map for each model: (b) year 2014, (d) year 2015, (f) year 2016. The filling of the hexagons represents the cluster split of the units. The cluster numbers are positioned at the units which are the centroids of the clusters respectively.
Fig. 3. Box plots of PM10 data, recorded by the regional environmental protection agency at B site, split by cluster for each year: (a) year 2014, (b) year 2015, (c) year 2016. The box represents the interquartile range; the thick black line represents the median value; the dashed whiskers extend to the data extremes.
SOM Maps Clusterization
Cluster Characterization with External DataFig. 4. Interquartile range of wind direction data, recorded by the regional environmental protection agency at MFB site, split by cluster and year: (a) Cluster 1, (b) Cluster 2, (c) Cluster 3, (d) Cluster 4. The thick black line represents the median value.
Fig. 5. Particulate matter profiles of units split by clusters for each year represented by box plots.
Cluster Frequency Evaluation
PM Profiles’ Temporal EvolutionFig. 6. Example of daily stacked bar graph (Jan–Mar 2014) built according to the percentage of each cluster assigned to the experimental data for each day (Y-axis = percentage, ND = not determined (experimental data not recorded)).
CONCLUSIONS
REFERENCES