Department of Environmental Engineering, National Ilan University, Yilan 26047, Taiwan
Received:
November 2, 2019
Revised:
November 23, 2019
Accepted:
November 24, 2019
Download Citation:
||https://doi.org/10.4209/aaqr.2019.11.0554
Chang, I.C (2019). Identifying Leading Nodes of PM2.5 Monitoring Network in Taiwan with Big Data-oriented Social Network Analysis. Aerosol Air Qual. Res. 19: 2844-2864. https://doi.org/10.4209/aaqr.2019.11.0554
Cite this article:
TEPA (Taiwan Environmental Protection Administration) currently has regulated six types of air pollutants based on the AQI. Among these, the three items most prone to exceeding the standard are PM2.5, PM10, and O3, in that order. PM2.5 pollution episodes in Taiwan mainly occur in winter and spring when the northeast monsoon prevails. In addition to local pollution sources, transboundary air pollution affects Taiwan. Obviously, the existing AQ monitoring data analyzed by the BD-oriented perspective not only simplifies the simulation calculation and verification resources of the AQ model but also assists in real-time insight into the causal relationships between the AQ and important parameters of meteorology, pollution sources, and regions. This study integrates the BD-oriented Social Network Analysis (SNA) approach and data visualization tools to analyze the event co-occurrence and spatial correlation characteristics of two pollution scenarios for AQ monitoring stations based on two severe PM2.5 pollution conditions: (1) the Z-value of PM2.5 daily average concentration is higher than 1.65 (Scenario I), and (2) the daily average concentration of PM2.5 exceeds TEPA’s regulation on the AQ standard (Scenario II), to identify the regional leading nodes suitable for different pollution scenarios. Furthermore, Principal Component Analysis (PCA) and time series data are employed to verify the spatial-temporal representation of these leading nodes, which can be regarded as means to the real-time AQ management decision-making as well as instant transboundary pollution precaution in the future. This study contributes to the application of the discrete data-driven approach (SNA) and the continuous data-driven approach (PCA) in an ambient AQ monitoring network, which can clearly explain and analyze the regional high pollution characteristics of PM2.5 in Scenarios I and II. The results of this study are consistent with previously relevant findings in Taiwan.Highlights
ABSTRACT
Keywords:
PM2.5 pollution scenarios; Transboundary air pollution episodes; Social Network Analysis; Betweennesscentrality.
The health threats caused by PM2.5, such as allergic rhinitis, asthma, severe chronic diseases, and premature death (Franklin et al., 2007; Kloog et al., 2013; Kim et al., 2018), are critical issues around the world. Pope et al. (2002) observed that long-term exposure to PM2.5 increases the morbidity rates of general diseases, cardiopulmonary diseases, and lung cancer by 4%, 6%, and 8% respectively for every 10 µg of PM2.5. Xie et al. (2015) analyzed the three-year (2010–2012) epidemiological data collected in the Beijing area and concluded that PM2.5 affects the morbidity and mortality of angina in Beijing. Hadei et al. (2017) applied a AQ model to assess potential public health problems resulted from exposure to air pollution during 2013 to 2016, and presented that PM2.5 had the maximum short term health impacts of ambient air pollution in the Tehran area. Sahu and Kota (2017) found that the PM2.5 was the primary air pollutant in Indian capital during 2011 to 2014, and the daily exceeding ratios is around 85%. Zhao et al. (2018) indicated the Kaifeng area in China experienced the worst ambient AQ during winter of 2015 to 2017. Thus, to protect human health, in 2005 the WHO (2005) set the recommended annual and daily average standards for PM2.5 at 10 and 25 µg m–3, respectively. As in many countries, Taiwan’s PM2.5 pollution level often exceeds WHO standards. Episodes and scenarios related to air pollution are widely used as essential elements in formulating regulations on air pollution, AQ simulation, and air quality management (AQM). However, simulating secondary air pollutants and related topics requires great time and calculation resources to process the complex physical and chemical reactions involved. Further, the three-dimensional grid AQ simulation model entails high resource costs, making it unsuitable for determining an immediate pollution episode. Yet, it is obvious that above mentioned simulation approaches can be simplified through use of appropriate air pollution episodes. Conventional screening of air pollution episodes and scenarios begins with classifying meteorological patterns and preparing relevant statistics. However, simply using the aforementioned meteorological classifications or AQ models is not feasible for instantly selecting appropriate and statistically representative pollution scenarios. As a result, some typical statistical approaches have factored in the screening of air pollution episodes using AQ monitoring data (Meyer et al., 1997). Fang et al. (2017) analyzed concentration characteristics of PM2.5 during haze periods in the Changchun area in China, and found the Pearson correlation coefficient existed high values between PM2.5 and CO, that were affected by pollutant emissions and meteorological conditions; they also indicated the wind speed, temperature and pressure are low and the RH value is relatively high on haze days, in addition, stable weather during haze days makes the pollution heavier. Multivariate statistical analysis approaches, especially Principal Component Analysis (PCA) and Cluster Analysis (CA), have also been applied (Kuebler et al., 2002; Beaver and Palazoğlu, 2006). Zhang et al. (2014) used PCA and a nonparametric T2 control chart to predict episodes of over-standard ozone concentrations. Sun et al. (2015) predicted over-standard ozone episodes by adopting Generalized Linear Mixed Effects Models (GLMMs) and used BD-oriented and Machine Learning (ML) classification models with lower prediction errors as well. Currently, data and BD-oriented approaches and their applications have drawn the attention of academic researchers and business professionals. However, a number of potentially useful values remain hidden in large-scale data. For instance, the large volumes of AQ monitoring data provided in official open data might reflect critical pollution patterns and relations. Though researchers have attempted to screen for representative pollution episodes or explore possible classifications of pollution patterns through AQ dataset of monitoring stations, they appear to be unable to identify stations (nodes) that are useful representative indicators of regional pollution patterns. A social network is defined as a set of social entities, such as people, organizations, or countries, with some pattern of relationship between them (Qing et al., 2019; Oliveira and Gama, 2012). In social network analysis (SNA) graphs of the network are used to interpret patterns of social ties, where the nodes represent social entities and the lines represent social interactions between nodes. To provide AQ real-time warnings and evaluate the effectiveness of the AQ regional management strategy, it is necessary to identify leading AQ monitoring stations (nodes) which are representative of their air pollution regions. Thus, the aim of this study is to analyze the spatial-temporal characteristics of two severe PM2.5 pollution scenarios and identify the leading nodes affected by transboundary pollution in an ambient AQ monitoring network (TAQMN) deployed by TEPA through harnessing the BD-oriented SNA approach. In addition, the PCA approach is used to compare the results. In the field of AQM research, PCA approach is an effective and objective data analysis tool that is often used to simplify variables (Dai et al., 2015; Iodice et al., 2016; Mari et al., 2016; Yao et al., 2016; Chen et al., 2017), identify sources of pollution (Song et al., 2006; Viana et al., 2006; Shi et al., 2009; Deka et al., 2014; Huang et al., 2015; Luo et al., 2015; Gao et al., 2016; Arhami et al., 2017; Widiana et al., 2017), classify meteorological patterns (Cheng and Lam, 2000), and evaluate model diagnostics (Dominick, 2012; Eder et al., 2014; Li and Wen, 2014). To summary, the PCA has two advantages: (1) it simplifies the relevant variables of air pollution to achieve economic effectiveness; (2) it usually uses for screening pollution episodes as well as exhibits more objective statistical representativeness than the meteorological classification. For example, Yu (2013) and Yu et al. (2005, 2006) have adopted PCA to screen O3 and PM10 pollution episodes, and AQ meteorological characteristics of these episodes were also analyzed in Taiwan. Big data (BD) is a revolutionary phenomenon, widely explored in scientific and business research. In fact, several definitions for BD have been proposed, but attributive, comparative, and architectural definitions play a significant role in shaping how BD is viewed (Hu et al., 2014; Pankaj et al., 2015; De Mauro et al., 2016; Makrufa and Aybeniz, 2016). In sum, regardless of the amount of data, any new understanding that can be obtained from data, and any new data values that can be created by useful technologies and tools may be regarded as being produced by a BD-oriented approach. Lausch et al. (2015) suggested that the BD-oriented method of building air pollution receptor models can provide real-time predictions and analyses for a large amount of data and perform functions such as correlation (association) analysis, prediction, regression, segmentation (classification), and clustering (grouping). Vardoulakis et al. (2003) argued that important factors affecting the urban PM2.5 should include meteorology, traffic flow, personnel mobility, road network structure, and concerned receptor points. Therefore, the BD-oriented method may be used to simultaneously analyze the above factors. Zheng et al. (2013) adopted BD-oriented analysis techniques, such as Artificial Neural Network (ANN) and Conditional Random Forest (CRF), to combine the history of AQ stations with the real-time AQ data and then archived the parameters of important factors in PM2.5 levels previously mentioned by Vardoulakis et al. (2003) to estimate the real-time urban PM2.5. Karner et al. (2010) collected roadside monitoring data from 37 research papers from 1983 to 2007, including the measurement values of more than 600 air pollutant concentration levels at different road distances, using the BD-oriented method to analyze the impact of roads to the air pollution concentration. Their results indicated that road distances ranging from 80 to 600 m can be attenuated as background values, and the concentration of pollutants at a road distance at 150 m dropped at least 50%. SNA aims to understand networks and their participants and has two main focuses: the actors, who form the nodes, and their relationships in a specific social context (Cachia, 2008; Stevan et al., 2019). Zhang and Peng (2017) indicated that SNA-related research is a hot topic and has been applied to substantive problems in many subjects and disciplines in recent years, in particular in the field of management science. SNA is not a formal theory in sociology. Instead, it is a strategy for investigating social structures through networks and graph theory. Typically, SNA is a concept and a method, studying social structures, organizational systems, interpersonal relationships, and group interactions. The SNA methodology accurately quantifies and analyzes complex relationships, revealing their structures and interpreting their phenomena using visual graphs (i.e., sociograms) and quantitative measures of social networks (Brass et al., 2004; Charles, 2011; Grunspan et al., 2014; Anthony, 2017). Furthermore, SNA can be regarded as a discrete data-driven analytical technology useful in BD-oriented approaches for examining hidden and valuable rules and knowledge through the interactive relationships of nodes and ties in a social network. SNA centrality analysis is a family of concepts of characterizing the structural importance of a node’s position in a network. The node-level centrality measure refers to the relationship between the observation node and its surroundings. Degree, closeness, and betweenness centrality are the most crucial and well-known centrality measures, widely used in SNA. Node-level centrality measures may be further standardized as dimensionless or percentage values to avoid being affected by the size of the network (Freeman et al., 1991; Borgatti, 2005; Rong, 2013). According to Borgatti and Everett (2006), Freeman (2004), and Nieminen (1973, 1974), the definitions of the three well-known SNA centrality measures are excerpted and described as follows: (1) Degree centrality of a node is defined as the number of ties a node has (in graph-theoretical terminology, the number of edges adjacent to this node). It is used to calculate and observe the linking number of nodes. (2) The closeness centrality of a node is equal to the total distance (in the graph) of this node from all other nodes. Closeness centrality is used to measure the proximity of a certain node to others in the network. (3) Betweenness centrality may be defined loosely as the number of times a node needs to pass a node to reach another node. It is thus the number of shortest paths that pass through a given node. Evelien and Ronald (2002) and Faust (1997) deemed that betweenness centrality measures the ability of a node to broker the delivery of messages between any two other nodes. In recent years, SNA has been widely applied in the fields of information management and social science for characteristics classification and identification (Manski, 2013; Hasan and Ukkusuri, 2014; Suthaharan, 2014; García-Palomares et al., 2015; Amerini et al., 2017; Kim and Hastak, 2018). Li et al. (2018) applied SNA to provide an up-to-date bibliometric view of current life cycle assessment (LCA). However, SNA has been less used in environmental management, especially in AQM research in Taiwan. For that reason, the application of SNA in this study represents an innovative research application in the field of AQM. Since this study addresses the crucial characteristics of nodes in a social network, the betweenness centrality measure, which has been successfully applied in many disciplines (Gómez et al., 2013; Wrzus et al., 2013; Grunspan et al., 2014; Tian et al., 2019), plays a key role in this study. Accordingly, we use it to identify the leading nodes in our social network structure. In sum, betweenness centrality is a useful measure of the extent to which the network is dominated by a few central nodes. The research processes and methods of this study are shown in Fig. 1. They are divided into six sub-phases sequentially based on a systematic CRISP-DM (Cross-industry standard process for data mining) cycle process of the BD-oriented methodology (Shafique et al., 2014). The phases shown in Fig. 1 include: (1) Project understanding: what this study expects to gain from the data, including the objective, relevant references, appropriate methods, scope, and limitations of the research; (2) Data understanding: The data understanding phase of the CRISP-DM involves scrutinizing the data available for analysis. This step is critical in avoiding unexpected problems during the next phase, data preparation, which is typically the longest part of a study or project; (3) Data preparation: this phase is one of the most important and often time-consuming aspects of data analysis. In fact, it is estimated that data processing usually takes 50–70% of a project's time and efforts; (4) Modeling: Determining the most appropriate analytical methods, tools, or models for analyzing the data based on the previous phases of Project and Data understanding; (5) Evaluation: At this stage, the assessment of whether the project’s outcomes meet the research requirements or expectations is formalized. This step requires a clear understanding of the stated research objectives and key points to be interpreted or verified in this phase; (6) Deployment: the process of using or discussing the new insights and findings to advance understanding and application in the field. In the data understanding phase of this study, the information on the synaptic weather pattern and PM2.5 monitoring data are collected by ETL (Extract-Transform-Load) rules from the open data cloud service of TEPA. After data collection, it is necessary to prepare and process the data for further research. Data preparation involves data cleaning, summarizing, scaling, and normalizing the data. The processed data is subsequently imported into the SNA and PCA modeling and analysis phase. The results are validated and analyzed using data visualization tools. With these tools, paired nodes’ relationships in a network through the number of connections (i.e., tie-strength) will be identified. Further, the map of betweenness centrality measures presented will be used to determine the characteristics of leading nodes for differentiation of pollution areas they cover or link. Furthermore, the duration lag and effect related to PM2.5 transboundary pollution is also identified by using the data visualization approach from a set of organized data. The AQ monitoring network in Taiwan was planned by TEPA in 1990 and officially launched running in September 1993. To date, there are 76 general AQ monitoring stations (Fig. 2), in which 63 general stations, 4 industrial stations, 2 national park stations (one of which is a general station), 4 background stations (two of which are general stations), and 6 traffic stations. Since August 2005, monitoring of PM2.5 pollutants has been added to various stations, in order to make observation and understand the characteristics of PM2.5 pollutions. On May 14, 2012, the PM2.5 concentration threshold for the AQ standard was amended as an average of 35 µg m–3 for 24 hours or a day. Concerning the aforementioned AQ monitoring data, currently TEPA has provided requesters with data by means of the OPENDATA method, in the way they can download and use the data by themselves. For that reason, this study collected the hourly monitoring data of PM2.5 from the aforementioned data source on the basis of the defined critical air pollutant and spatial-temporal scope as well. This study selected the concentration conditions and intentions of pollution episodes in the following two severe PM2.5 pollution scenarios. (1) The daily average concentration of the PM2.5 at any station with a high Z-value (> 1.65). This indicates that the concentration of PM2.5 of the AQ monitoring station conforms to a high value of 95%. However, according to historical AQ monitoring data from Taiwan, this phenomenon is most likely to be driven by transboundary air pollution. (2) The daily average concentration of the PM2.5 exceeding the standard (35 µg m–3) regulated by TEPA. This indicates that managing levels exceeding the AQ standard has always been the objective of the authorities. Therefore, the number of days which exceed standard limits can be used as a performance evaluation indicator for an AQM strategy. According to the temporal scope defined, the AQ monitoring data collected period was designated from 2015 to 2017, and the daily (24-hrs) average concentration of PM2.5 of each AQ monitoring station was taken as the sample value. The data dimension of daily concentration of PM2.5 could be treated as a collection of time series of 76 vectors (76 stations × 1096 days). Subsequently, these monitoring data were normalized by using the following equation (Eq. (1)): where Zik is the kth day Z-value of concentration of PM2.5 at station i; Cik is the kth day concentration of PM2.5 at station i; µi is the average concentration of PM2.5 at station i; and Si is the concentration standard deviation of PM2.5 at station i. The Zik is greater than 1.65 and Cik exceeds 35 µg m–3 of the kth day were designated as concentration criterion for making definitions of scenario I and II of this study respectively. The methodological procedure of SNA includes determining the cooperation among nodes and the analysis of hidden relationships (Podobnik and Lovrek, 2015; de-Marcos et al., 2016; Erfanmanesh and Hosseini, 2016; Schlattmann, 2017). Therefore, the SNA methods in this study are summarized and explained as follows: Prior to performing the analysis of the SNA centrality measure and constructing the sociogram, a 2-mode affiliated matrix (i.e., sociomatrix) (1096 days × 76 stations) with binary expression should be constructed for each scenario in this study after the PM2.5 concentration scale of both scenarios is confirmed. In the SNA, 2-mode data refers to data recording ties between two different sets of entities. The best known example of a 2-mode SNA is the study of class and race by Davis, Gardner, and Gardner (DGG) published in their classic book Deep South (Davis, 1941). Table 1(a) presents a simplified example. We construct a 2-mode day-by-station SNA matrix X with 4 rows (days d1, d2, d3, and d4) and 5 columns (stations A, B, C, D, and E) in which element xij = 1 if a pollution episode tie was observed between day i and station j, and xij = 0 otherwise. For example, the last row has values 1, 0, 0, 0, 0 indicating that day 4 (d4) has ties only with the first station (A). By contrast, the third row of the table shows that day 3 (d3) has ties with stations B and E. Most social networks are conceived of as relations among a set of nodes, and are therefore represented as a 1-mode matrix or a simple graph or digraph. Typically, an SNA 2-mode (node-by-event) matrix (M) with binary expression can be converted to a 1-mode (node-by-node) quantitative (value) square matrix by matrix multiplication or product MMT (Borgatti, 2009). Table 1(b) shows a simplified 1-mode (station-by-station) quantitative square matrix example in which element xij indicating the total number of episodes appears simultaneously in the 4 days between stations i and j. We convert it from Table 1(a) (a 2-mode day-by-station matrix X) by matrix multiplication XTX. For example, the first row has the values 2, 0, 1, 0, 0 indicating that the total number of episodes in 4 days appearing at station A is 2. This station has one pollution episode appearing simultaneously in 4 days with station C only. The elements in an SNA 1-mode square matrix are used as tie-weights or tie-strength between any pair of nodes. Consequently, the value of each element in this 1-mode matrix can be regarded as the total number of days (i.e., tie-strength) of PM2.5 pollution episodes occurring simultaneously between paired AQ monitoring stations over the three years (2015 to 2017). Typically, the SNA sociogram is used to visually comprehend the tie relationships and topology of all nodes in an SNA network after an SNA 1-mode matrix is developed. As illustrated in Fig. 3, Table 1(b) is a simplified example demonstrating the construction of an SNA sociogram from a 1-mode matrix. However, the diagonal elements of Table 1(b) are not taken into account while drawing an SNA sociogram. The digits labeled in Fig. 3 are tie-strengths or tie-weights between any pair of nodes, in short the tie-strength indicates the days of PM2.5 episode appear simultaneously at any pair of AQ monitoring stations(nodes) in this study. Liu et al. (2008) indicated that betweenness centrality is the sum of the shortest paths of a given node and is a measurement of a node's ability to mediate and control resources. Nodes with high betweenness centrality are key to knowledge transfer, influencing both the content and fluidity of information dissemination. High betweenness centrality indicates the node is a broker for the delivery of messages between other nodes in a SNA network. According to Freeman (2004) and Borgatti et al. (2002) the node-level betweenness centrality measure of node i, denoted as CB(ni), could be obtained from the mathematical expression in Eq. (2). It indicates the number of shortest paths (geodesic distance or steps) that pass through a given node. It is usually used to gauge the extent to which a node facilitates the flow of information in an SNA network. where gjk is the number of shortest paths from node j to node k; gjk(ni) refers to the number of shortest paths from node j to node k passing through node i. In addition, the standardized node-level betweenness centrality measure, denoted as CB*(ni), is to divide Eq. (2) by the maximum possible linking number of the node-level betweenness centrality in a g-node SNA network, as written in Eq. (3): According to Yu et al. (2006), the relationships between the standardized Z-value and the nonrotating principal component value are shown as following Eqs. (4) and (5): where Lij is the factor loading of the jth principal component of station i; and Pjk is the component score of the kth variable (i.e., The kth AQ station) in the jth principal component. where λj is the eigenvalue of the jth principal component and also represents the variance of the jth principal component. The principal components of these linear combinations exhibit not only the smallest variance in their components, but also the maximum individual differences of the individual components (Jolliffe and Cadima, 2016). Therefore, the first principal component can explain the maximum variation in the concentration field. The factor loading of PCA is not only used to represented the correlation between the principal component and each station, but also to comprehend the spatial characteristics of PM2.5 pollution episodes in this study. In the field of BD, data visualization tools and technologies are essential in analyzing massive amounts of information to make data-driven decisions. A core skill in data science is the presentation of quantitative information and data in a graphical form by turning large and small datasets into visuals that are easier for the human brain to understand and process. Therefore, data visualization tools are used in this study to determine the relationships between nodes and implement the duration lag and effect related to transboundary pollution of PM2.5. Based on the concentration conditions of Scenario I and the established SNA 1-mode square matrix for 76 of TEPA’s AQ monitoring stations (nodes), this study performed a SNA betweenness centrality analysis and then a sociogram (Fig. 4(a)) was drawn. As shown in Fig. 4(a), the offshore island stations ST_073 (Matsu) and ST_075 (Magong) in Scenario I are simultaneously connected to PM2.5 pollution episodes (Z-value > 1.65) on the main island of Taiwan. Consequently, according to results of aforementioned analysis, the essential and regional leading nodes of Scenario I are determined and summarized in Table 2 by further using visualization tools. Based on the concentration conditions of Scenario II and the established SNA 1-mode square matrix of 76 AQ monitoring stations (nodes), similar to Scenario I, this study performed a SNA betweenness centrality analysis and a sociogram (Fig. 6(a)) was drawn. As shown in Fig. 6(a), station ST_074 (Kinmen) is connected to PM2.5 pollution episodes of Scenario II on the main island of Taiwan, in particular in the areas south of Taichung in the western part of Taiwan. Therefore, according to aforementioned analysis results, the essential and regional leading nodes of Scenario II are determined and summarized in Table 3. PCA was performed based on the daily (24-hour) concentration of the 76 AQ monitoring stations in the PM2.5 monitoring project managed by TEPA in the defined temporary scope of this study (2015-2017). As shown in Table 4 and Fig. 7, among top five principal components, the explained variance of the fifth principal component was 1.66%, of which the maximum factor loading was 0.299, which belonged to a low correlation, however, it did not have any value of discussion. As a result, the principal component after the fifth principal component could be ignored, and there are four important principal components identified and factored in this study. Additionally, ST_073 and ST_075 are generally regarded as AQ background stations for the main island of Taiwan. In the past, when transboundary air pollution occurred, the abovementioned AQ at these two stations often appeared high concentration pollution. The geographic locations of stations ST_073 and ST_075 are quite far from Taiwan’s main island. The PCA analytical results of this study indicated that the PC1 factor loading values for stations ST_073 and ST_075 are about 0.47 while the PC2 are 0.47 and 0.56, respectively. It is shown that there is a leading correlation of transboundary air pollution among stations ST_073, ST_075 and the western part of the main island of Taiwan. Hence, this conclusion is consistent with previous SNA analytical results of this study. Table 5(a) shows the proportion of the synoptic weather patterns of station ST_073 located at offshore of Taiwan under the PM2.5 pollution episodes of Scenario I (Z-value > 1.65) of the study. It indicates that the top four patterns are strong northeast monsoon (31.68%), standard northeast monsoon (22.36%), weak northeast monsoon (11.8%), and high pressure reflux (8.7%), respectively. In similar, the Table 5(b) indicates that the strong northeast monsoon (23.31%), high pressure reflux (14.66%), standard northeast monsoon (13.91%), and weak northeast monsoon (13.53%) top four the synoptic weather patterns of station ST_074 located at offshore of Taiwan under the PM2.5 pollution episodes of Scenario II (the daily average concentration exceeds 35 µg m–3) of the study. Figs. 8(a) and 8(b) show connections between the abovementioned four main weather patterns and seasons occurred in these two offshore leading nodes identified separately in Scenario I and Scenario II of this study by means of Web-chart (the comparison among tie-strength in a Web-chart is in a relative measure). Thus, the strong northeast monsoon, the standard northeast monsoon, the weak northeast monsoon, and the high pressure reflux are the top four common weather patterns in the two scenarios of PM2.5 pollution episodes defined in this study, often occurring in the winter and spring seasons when the northeast monsoon prevails. To understand the impact of transboundary air pollution of PM2.5 in Taiwan, this study adopted the external PM2.5 pollution episode of October 29, 2017, as reported by TEPA, for further analysis. The synoptic weather pattern of that day was a strong northeast monsoon in winter. On the day (2017/10/29) at station ST_073 on an offshore island, the average hourly concentration of PM2.5 at 13:00 in the afternoon was about 20 µg m–3, while the concentration was less than 30 µg m–3 between 14:00 and 17:00. However, it had been affected by transboundary air pollution since 18:00. Therefore, the average concentration began to increase and exceeded 30 µg m–3, reaching 35 µg m–3 at 19:00, and increased hourly to its highest value of approximately 91 µg m–3 at 03:00 the next day (2017/10/30). Then, it dropped to 21 µg m–3 at 16:00. As a result, the duration of the impact of the transboundary air pollution on ST_073 may be estimated at approximately 22 hours, from 18:00 on 2017/10/29 to 16:00 the following day (2017/10/30), with a maximum concentration of approximately 91 µg m–3. According to TEPA monitoring data, on 2017/10/29, station ST_003 (Wanli) on the main island of Taiwan nearest to Matsu (108km south of Matsu), the average hourly concentration of PM2.5 was about 27 µg m–3 at 13:00 in the afternoon and less than 30 µg m–3 between 14:00 and 19:00. However, it began to be affected by transboundary air pollution at 20:00. The average concentration began to increase and exceeded 30 µg m–3, reaching 45 µg m–3 at 21:00, and climbed hourly to its highest value of about 73 µg m–3 at 23:00. It then fell hourly to 23 µg m–3 at 06:00 the following day (2017/10/30). To interpret the data from stations in different regions of the main island of Taiwan and how they were affected by the aforementioned PM2.5 transboundary pollution episodes, this study employed the PM2.5 concentration contour of time (X-axis) and distance from the south of Matsu (Y-axis). As shown in Fig. 9, for station ST_003, it will be affected by the PM2.5 transboundary pollution episodes about 2 hours later than station ST_073. Accordingly, we estimate that the areas from northern Taiwan to the central and Yunlin-Chiayi-Tainan regions (around 100 km, 200 km, 300 km away from Matsu, respectively) will be affected by the PM2.5 transboundary pollution episodes approximate 2 to 6 hours later than Matsu. Because the Kaohsiung-Pingtung area in the south has a higher level of PM2.5 concentration, the impact and duration lag of the transboundary pollution episodes are not easily identified. Based on these results, we estimate the degree (interval) of duration lag caused by the PM2.5 transboundary pollution episodes at the leading nodes identified in different AQ management areas, in order to provide recommendations for generating real-time warnings in the future. A short summary on results and discussions of this section is organized as follows. Scenario I: Figs. 9 and 10(a) show that station ST_073 (Matsu) is a useful offshore leading node linking to the PM2.5 pollution episodes of Scenario I in the northern region of the main island of Taiwan. Furthermore, the northern region is estimated to be affected by the PM2.5 transboundary pollution episodes approximate 2 hours later than Matsu in particular during northeast monsoon prevailing in the spring and winter. Station ST_001 (Keelung) is a feasible inland leading node in northern Taiwan, while station ST_064 (Yilan) is a leading node in the Hualien-Taitung region. Stations ST_034 (Xianxi) and ST_038 (Lunbei) are respectively identified as useful inland leading nodes in the central (Taichung-Changhua-Nantou and Hsinchu-Miaoli areas) and Yunlin-Chiayi-Tainan regions. In addition, both regions are estimated to be separately affected by the PM2.5 transboundary pollution episodes around 4 and 6 hours later than Matsu. Station ST_051 (Daliao) is a feasible leading node in the Kaohsiung-Pingtung region. Nevertheless, the possible interval of the duration lag of this region with the PM2.5 transboundary pollution episodes of station ST_073 is difficult to determine from Fig. 9. Scenario II: Fig. 10(b) indicates that station ST_074 (Kinmen) is a useful offshore leading node linking PM2.5 pollution episodes of Scenario II in the areas south of stations ST_030 (Dali) and ST_031(Zhongming) in the western Taiwan. Stations ST_068 (Zhushan), ST_038 (Lunbei), and ST_054 (Zuoying) are appropriate inland leading nodes in Taichung-Changhua-Nantou, Yunlin-Chiayi-Tainan, and Kaohsiung-Pingtung, respectively. However, the daily PM2.5 rarely exceeds TEPA’s AQ standard value of 35 µg m–3 in the areas north of stations ST_030 and ST_031 in the western and eastern part of the main island of Taiwan, none of the characteristics of the PM2.5 pollution episodes and leading nodes could be identified or screened in this Scenario. In addition, the number of PM2.5 pollution episodes exceeding the AQ regulatory standard value of offshore station ST_073 (approximately 14.7%) is lower than station ST_074 (approximately 24.3%) within the period of this study (2015–2017). In this case, offshore station ST_073 could not establish a stronger linkage of PM2.5 pollution episodes in this Scenario with most stations situated on the main island of Taiwan. Similarly, according to Fig. 9 stations ST_074, ST_068, ST_038, and other stations are located about 200km south of Matsu, if PM2.5 transboundary pollution episodes occur in station ST_073, it is estimated that they will affect the areas south of the central region in the western part of Taiwan about 4 hours to 6 hours later in particular during the northeast monsoon prevailing in spring and winter. Yet, station ST_054 in the Kaohsiung-Pingtung region faces the same problem as in Scenario I. It is difficult to clearly determine the possible interval of the duration lag in the area affected by the PM2.5 transboundary pollution episodes occurring in station ST_073. The PCA results revealed that the high factor loadings of the first principal component (PC1) were concentrated in the area north of Hsinchu and Yilan counties, consistent with the areas linked by the leading node group in the northern region of SNA Scenario I (since few pollution episodes appeared in this region in Scenario II). The high factor loadings of the second principal component (PC2) were concentrated in the western area south of Chiayi county, consistent with the areas linked by the leading node groups of the Yunlin-Chiayi-Tainan and Kaohsiung-Pingtung regions in the SNA Scenario I and II. The high factor loadings of the third principal component (PC3) were concentrated in the areas of Taichung city and Changhua as well as Yunlin counties, consistent with the areas linked by the leading node group of the central region in SNA Scenario I. Except for the areas north of Miaoli county, PC3 and the areas linked by the leading node group of the central region in SNA Scenario II may be regarded as consistent. The high factor loadings of the fourth principal component (PC4) were concentrated in Hualien and Taitung counties, consistent with the areas linked by the leading node group of the Hualien-Taitung region in SNA Scenario I (since few pollution episodes appeared in this region in Scenario II). As a result, the SNA approach can be used in this study to analyze the results of these two scenarios, which can be reasonably matched with the analytical results of the PCA approach, especially in Scenario I. In sum, the analytical results of Scenario I indicate that the offshore leading node, station ST_073 (Matsu), links simultaneously with the PM2.5 pollution episodes of the northern region on the main island of Taiwan. Stations ST_001 (Keelung), ST_064 (Yilan), ST_034 (Xianxi), ST_038 (Lunbei), and ST_051 (Daliao) can be separately identified as inland leading nodes of PM2.5 pollution episodes situated at certain AQ zones of TAQMN in Scenario I. In addition, the analytical results of Scenario II show that the offshore leading node, station ST_074 (Kinmen), links simultaneously with the PM2.5 pollution episodes between the areas south of Taichung in the western part of Taiwan. Stations ST_071 (Zhushan), ST_038 (Lunbei), and ST_054 (Zuoying) are separately identified as inland leading nodes linking the PM2.5 pollution episodes of Scenario II occurring in the central and southern AQ zones of the TAQMN. The areas from northern Taiwan to the central and Yunlin-Chiayi-Tainan regions are estimated that will be affected by the PM2.5 transboundary pollution episodes approximate 2 to 6 hours later than Matsu. The BD-oriented and discrete data-driven SNA approach has been proven to be a reasonable and alternative means for determining PM2.5 leading nodes and pollution regional classification in TAQMN by this study. For Scenarios I and II in the Yunlin-Chiayi-Tainan area, the PM2.5 concentration level of station ST_038 (Lunbei) has a high leading association with the PM2.5 pollution episodes of other nearby stations. Hence, it is a useful leading node for two different scenarios in the region. Stations ST_34 (Xianxi) and ST_68 (Zhushan) in Taichung-Changhua-Nantou region are respectively identified as inland leading nodes in Scenario I and II. Scenario I represents PM2.5 pollution episodes in which transboundary pollutants reach Taiwan through the strong or standard northeast monsoon. Since there are no obvious fixed or mobile pollution sources close to station ST_34, in this case station ST_34 is a reasonable leading node for Scenario I. Scenario II represents a PM2.5 concentration level higher than the AQ regulatory value. Since station ST_68 located at the site of major fixed pollution sources and leeward of mobile pollution sources in the central region of Taiwan, its PM2.5 concentration easily exceeds the regulatory standard. For that reason, it is a rational leading node identified by Scenario II. We adopted the continuous data-driven PCA method to analyze the daily PM2.5 concentrations, in order to compare the SNA analytical results. Our findings showed that the PCA pollution regional classification were consistent with the regional classification of SNA in Scenario I and the area south of Taichung in Scenario II. Although the PCA method can explain the spatial characteristics of specific air pollution scenarios, it cannot comprehend the spatial-temporal leading characteristics among AQ monitoring stations. Nevertheless, the SNA method can identify leading nodes during the period when the contaminated areas are influenced by the PM2.5 transboundary pollution episodes and help us understand the strength of the spatial-temporal correlation between AQ monitoring stations. Consequently, Big data-oriented SNA could supply effective and visualized information of leading nodes for identifying possible sources, distinguishing spatial correlation of monitoring stations, evaluating the effectiveness of control strategies on sources for distinct PM2.5 pollution scenarios. This study suggests other novel BD-oriented methods (such as deep learning for time series analysis) or visualization techniques can be introduced to understand the time series and spatial pollution characteristics of pollutants. In addition, this study suggests that appropriate scenarios (such as domestically contaminated scenarios or clean and pollution-free scenarios) can be selected in the future for follow-up research and analysis to further comprehend and estimate the possible interval of duration lag in Kaohsiung-Pingtung region affected by the PM2.5 transboundary pollution of Matsu. We express our gratitude to the Ministry of Science and Technology of Taiwan (MOST 107-2622-E-197-010 -CC3) for funding this study. INTRODUCTION
LITERATURE REVIEW
Multivariate Analysis on Air Pollution
The BD-oriented Approach and Analysis on Air Pollution
Social Network Analysis
METHODS
Research ProcedureFig. 1. Diagram of research procedures and methods.
Data Understanding: Ambient AQ Monitoring DataFig. 2. Locations of the 76 TEPA’s monitoring stations.
Data Preparation: Data Definitions and Concentration Criteria for both PM2.5 Pollution Scenarios
Modeling and Evaluation: SNA Approach
(1) Generate a 2-mode affiliated matrix with binary expression for each pollution scenario
(2) Convert to a 1-mode quantitative square matrix for each scenario
(3) Establish the SNA sociogram between nodesFig. 3. Schematic representation of building up a SNA sociogram.
(4) SNA betweenness centrality analysis
Modeling and Evaluation: PCA Approach
Modeling and Evaluation: Data Visualization
RESULTS AND DISCUSSIONS
Evaluation and Deployment: SNA ApproachFig. 4. The sociogram of SNA betweeness centrality analysis and tie-strength contours of leading nodes for scenario I.
Fig. 6. The sociogram of SNA betweenness centrality analysis and tie-strength contour of station ST_074 (Kinmen) for scenario II.
Evaluation and Deployment: PCA ApproachFig. 7. The loading factor contours of rotated principal components.
Evaluation and Deployment: The Synoptic Weather Patterns of the Offshore Leading Nodes
Evaluation and Deployment: Analysis on the Duration Tag and Effect Related to Transboundary Air Pollution Fig. 8. The Web-charts of weather conditions and seasons (Source: Soong et al., 2005).
Fig. 9. Analysis of the duration lag and effect related to PM2.5 transboundary air pollution (Source: TEPA’s AQ monitoring data).
Evaluation and Deployment: A Short Summary
(1) SNAFig. 10. Locations of leading nodes and groups for scenario I and II.
(2) PCA
(3) Summary
CONCLUSIONS
This study adopted PM2.5 concentrations of TEPA’s 76 ambient AQ monitoring stations between 2015 and 2017 and synoptic weather patterns as data sources. Meanwhile, the BD-oriented SNA and PCA methods were used to analyze two severe PM2.5 pollution scenarios selected by this study. Scenario I represents a daily PM2.5 concentrations in which the monitoring station reaches a high value, 95%, for three years. Since the concentration distribution and baseline concentration for each station are different, the Z-values of the PM2.5 concentration for some stations are higher than 1.65 but do not exceed TEPA’s regulatory standard value (35 µg m–3). Most of the PM2.5 pollution episodes of Scenario I appear in the winter and spring, and transmit transboundary pollutants to Taiwan through strong or standard northeast monsoons. Therefore, Scenario I is a useful and meaningful measure for zoning emergency responses to areas affected by transboundary air pollution and protection of residents’ health. Scenario II is the daily PM2.5 concentration higher than TEPA’s regulatory standard. The reasons for its excess include not only transboundary air pollution episodes caused by strong and standard northeast monsoons but also severe air pollution episodes caused by nearby and local emission sources and unfavorable diffusion weather conditions. Consequently, Scenario II is useful and meaningful as a measure for zoning improvement for periods when air pollution exceeds the TEPA standard.
ACKNOWLEDGMENTS