Nur Fariha Syaqina Zulkepli , Mohd Salmi Md Noorani, Fatimah Abdul Razak, Munira Ismail, Mohd Almie Alias School of Mathematical Sciences, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia
Received:
October 8, 2018
Revised:
May 1, 2019
Accepted:
May 28, 2019
Download Citation:
||https://doi.org/10.4209/aaqr.2018.08.0315
Zulkepli, N.F.S., Noorani, M.S.M., Razak, F.A., Ismail, M. and Alias, M.A. (2019). Topological Characterization of Haze Episodes Using Persistent Homology. Aerosol Air Qual. Res. 19: 1614-1624. https://doi.org/10.4209/aaqr.2018.08.0315
Cite this article:
Haze is one of the major environmental issues that have continuously vexed countries worldwide, including Malaysia, for the last three decades. Therefore, this study aims to investigate the differences between the topological features of months with and those without haze episodes observed at air quality monitoring stations located in the areas of Jerantut, Klang, Petaling Jaya and Shah Alam. We employ persistent homology, which is a method of topological data analysis (TDA) that focuses on connected components and holes in the data, to characterize the local particulate matter (PM10). The summary statistics reveal drastic changes in the lifetimes of the topological data from every station during haze episodes, highlighting the possibility of developing an early detection system for haze based on our approach.HIGHLIGHTS
ABSTRACT
Keywords:
Haze; Particulate matter; Persistent homology; Time delay embedding; Topological data analysis.
Haze phenomenon is one of the major environmental issues that occur continuously in countries across the world, Malaysia included (Wen et al., 2016). The country is also afflicted, with the Klang Valley region being one of the areas that are particularly affected by such phenomenon (Latif et al., 2018). The severity of this issue can be attributed to the large-scale forest and plantation fires that occur in Indonesia’s Sumatra Island. Nevertheless, it is worsened further by local emissions from domestic industries, vehicle usage, and open burning activities (Afroz et al., 2003; Wen et al., 2016). In Malaysia, the chronological history of haze episodes can be underlined with the severe incidents recorded in the year 2005, 2013, 2014 and 2015 accordingly (DOE, 2018a). This issue has continually emerged year after year in Malaysia and the neighboring countries of Brunei and Singapore, whereby the prolonged duration of the problem has spurred the study into investigating haze episodes. Particulate matter (PM10) are small particles floating around in the air in the form of smoke, dirt, and dust that originate from factories, vehicles and farming activities (Schwartz et al., 1996; Payus et al., 2013). During haze episodes, PM10 is typically highlighted as a major air pollutant in the Southeast Asian regions, particularly the Klang Valley, Malaysia (Afroz et al., 2003; Azmi et al., 2010). Acknowledged as a harmful pollutant, inhalation of the material leads to diminishing lung function and causes various respiratory diseases, especially acute exacerbation of asthma (Schwartz et al., 1996). According to the Malaysian Ambient Air Quality Guidelines (MAAQG), an average of PM10 concentration over 24-hours that exceeds 150 µg m–3 is considered to be unhealthy for human health. Therefore, a standard technique used to identify haze phenomenon is by observing the concentration of PM10 whereby haze emergency is declared when it reaches the emergency level (> 500 µg m–3) (DOE, 2018b). Several studies have explored into a comparison of PM10 concentration with MAAQG standards, undertaken by Azmi et al. (2010), Abdullah et al. (2012), Ling et al. (2010) and Rahman et al. (2015) respectively. They have consequently concluded that the concentration in the air quality monitoring stations located in Klang Valley exceeded the acceptable level recommended by the MAAQG during haze episodes. Various differences have also been revealed between the concentration of PM10 according to the locations of air quality monitoring stations, encompassing rural, urban and industrial areas respectively. Moreover, Azmi et al. (2010) and Abdullah et al. (2012) have also revealed that urban areas logged higher PM10 concentrations compared to rural areas. A study by Yusof et al. (2010) has provided further analysis using statistical models, lognormal and Weibull distributions to investigate the relationship between PM10 concentration and monsoon season. Similarly, other methods like chemometric analysis (Azid et al., 2015), fuzzy comprehensive evaluation method (Zhao et al., 2010) and chaotic approach (Hamid and Noorani, 2014) have also been utilized in assessing data on air pollutants. Furthermore, haze detection can also be found to be fast gaining scholarly traction. Previously, works focusing on the topic were geared towards analyzing satellite imagery data using remote sensing methods (Makarau, 2014). Meanwhile, recent research is specifically related to haze periods and focused on analyzing particulate matter concentration to study its morphology during haze episodes (Zeb et al., 2018). Yu et al. (2018) have also contributed by investigating human health risk in an indoor and outdoor environment that is exposed to the phenomenon, whereas another study has looked into mitigating severe urban haze episodes (Sharma and Balasubramanian, 2018). It should be noted that this literature has analyzed PM10 in a quantitative manner, whereas to the best of our knowledge, no effort has been expended on its qualitative aspects. Therefore, this paper is addressing the research gap by providing an analysis of the qualitative aspects and structures of PM10, particularly on their topological features via persistent homology. Persistent homology is a relatively new method that is robust under perturbations of input data, independent of coordinates and dimensions alike, and offers a solid representation of qualitative features of the input data (Otter et al., 2017). These particular features of the data are captured accordingly as specific parameters change. As the approach is fundamentally based on the mathematical field of topology, the qualitative features in question encompass topological features like connected components, holes, voids and more. The qualitative approach of persistent homology is also particularly useful in handling complexity of data due to noise, high dimensionality or incomplete structure. This poses great challenges to researchers dealing with real-world data since data cleaning is required in order to remove the noise, which might result in information lost during the process (Jorquera et al., 2000; Elangasinghe et al., 2014). By contrast, persistent homology retains all information from data. The noise that may occur in multiple scale levels is filtered out by persistent homology and significant features are captured (Ghrist, 2008). Furthermore, its robustness has led other researchers to explore the method further in various fields, such as Gidea and Katz (2018)’s effort on the capability of the holes (1-dimensional features) to detect financial crisis in stock market data. Meanwhile, Emrani et al. (2014) have investigated wheeze signal detection via the persistency of topological features between wheeze and non-wheeze signals. Moreover, the summary statistics of topological features undertaken by Mittal and Gupta (2017) has shown that persistent homology is usable in early detection of bifurcations and chaos in complex systems. The explorations on persistent homology have also been tackled in various fields, encompassing the classification of breast cancer (Dewoskin et al., 2010), viral evolution (Chan et al., 2013) and protein structure (Gameiro et al., 2015). A good and comprehensive review regarding the current applications of persistent homology may be sourced from Otter et al. (2017). To the best of our knowledge, there is a negligible amount of research output on environmental issues by persistent homology to date. Therefore, this study is conducted to fill the research gap by exploring the effectiveness of persistent homology in characterizing and detecting one of the pressing environmental issues, namely haze phenomenon. This paper intends to investigate haze episodes using persistent homology by analyzing the topological features of PM10 data in Malaysia. The proposed technique has been applied to daily average PM10 data from four air quality monitoring stations of Jerantut, Klang, Petaling Jaya and Shah Alam from the year 2000 until 2015. The objective of this study is to investigate topological features for months with and without haze episodes, thus rendering the data on PM10 being partitioned according to the respective month. The topological features are then extracted for connected components (0-dimensional features) and holes (1-dimensional features), with the resulting information represented in persistence diagrams. Next, the summary statistics of the topological features, like an average of all lifetimes of the connected components and maximum lifetimes of all holes, are calculated for each persistence diagram. In this paper, the content has been organized accordingly, whereby the subsequent section consists of a discussion on data involved in this study. Then, the data transformation processes and the method of persistent homology will be explained in next two sections. Extensive summary statistics of topological features will be introduced in the section before its resulting outcomes are discussed. The daily average data for PM10 encompassing the year 2000 to 2015 for four air quality monitoring stations (i.e., Jerantut, Klang, Petaling Jaya, and Shah Alam) has been obtained from the Department of Environment (DOE), Malaysia. Klang and Shah Alam are categorized as urban areas, while Petaling Jaya and Jerantut are categorized as respectively of the industrial and rural areas, accordingly (DOE, 2018c). Due to its rural location, Jerantut, in particular, has been chosen as the background station for comparative purpose (Azmi et al., 2010; Banan et al., 2013). Missing data has been treated with the mean substitution method (Pigott, 2001). The MAAQG standards are necessary to represent the safety level that causes no adverse health effects to human. Therefore, this study has adhered to the safe level as recommended by MAAQG to act as the benchmark for air quality level of each station. Fig. 1 shows the time series of the daily average for PM10 from 1 January 2000 until 31 December 2015, and the MAAQG for 24-hour average PM10 at 150 µg m–3. The presence of several peaks (rectangles) is indicative of the values of PM10 concentration exceeding the MAAQG during haze episodes that have occurred in August 2005, June 2013, March 2014, September 2015, and October 2015 accordingly (DOE, 2018a). Thus, these months have been highlighted as the main focus of this research. It is worth noting that based on the chronology of haze episodes (DOE, 2018a), the selected months have been reported of experiencing severe haze. The descriptive statistics of the months are shown in Table 1 accordingly, with Klang, Petaling Jaya and Shah Alam stations showing higher concentrations of PM10 compared to Jerantut station during the haze episodes. This is attributable to the different locations of the air quality monitoring stations, with Klang and Shah Alam being located in an urban area, Petaling Jaya in an industrial area and Jerantut (background station) in a rural area (DOE, 2018c). Prior to the implementation of persistent homology, the time series data sets must be transformed into point cloud data, which is achieved via the Takens method. The higher dimensional data sets will allow us to look into higher dimensional topological features, such as holes and voids. Basically topological features are features that are invariant after deformations such as stretching, splitting and cutting (Hatcher, 2002; Edelsbrunner and Harer, 2010; Ghrist, 2014). The idea of using a combination of Takens method and persistent homology is discussed in the work of Perea and Harer (2015) and references therein. The Takens method (Takens, 1981) stated that a time series x0, x1, …, xn–1 can be reconstructed in a phase space of dimension m, where each point in the phase space is given by the vector xn(m,τ) = xn, xn+τ, …, xn+(m–1)τ, τ is the time delay and m is the embedding dimension. The two parameters τ and m require careful selection to ensure clear extraction of the desired topological features. In this study, a comparison made between the months has indicated that the different settings of time delay and embedding dimension will affect the results. The two parameters have been fixed as τ = 1 and m = 3. The time delay is chosen as trivial time delay, whereas the embedding dimension chosen is 3 as the 1-dimensional topological features are best visualized in three-dimensional space throughout the study. This decision is supported by Umeda (2017), whose work has fixed both parameters as τ = 1 and m = 3, while others like Khasawneh and Munch (2014) and Khasawneh et al. (2018) chose m = 3 to visualize the 1-dimensional features. Similarly, previous studies by Pereira and Mello (2015) and Maletić et al. (2016) have also fixed m, whereas in other areas of research (not related to persistent homology), Sivakumar (2003) and Sivakumar (2002) each have chosen τ = 1 to achieve better results for their respective research. Computations of simplicial complexes are compulsory in extracting topological features from point cloud data. The 0-simplices represent vertices or points, 1-simplices represent edges or lines, 2-simplices represent triangles and 3-simplices represent tetrahedra, and so on. A simplicial complex is built by a combination of these simplices (see Fig. 2) accordingly, whereby its complex formation is from data points and thus indicating the dependency of topological features on a scaling parameter (filtration value), ε. Fig. 3(a) in particular shows an example of a simplicial complex formation. The formation commenced at ε = 0, where the four 0-simplices formed represents four points in a point cloud. As the value ε increases to 0.5, four circles are formed with each 0-simplex acting as the center and ε as the radius of the circles. The circles keep growing as the value ε increases until any pair of the circles intersects each other. From this intersection, a 1-simplex is formed by connecting two 0-simplices by a line (edge). As seen in Fig. 3(a), four 1-simplices have been formed at ε = 1.4, and the stage is marked with the formation of a simplicial complex via the combination of the 0-simplices and 1-simplices. The ε value then further increased to 2, with more circles intersecting each other and resulting in the appearance of two 2-simplices (triangles). At this stage, a new simplicial complex is formed (see Fig. 3(a)). It should be noted that the simplicial complex at ε = 1.4 is contained in the simplicial complex at ε = 2, which is also known as filtered simplicial complexes. In this work, the constructions of simplicial complexes are done using Vietoris-Rips simplicial complex, or otherwise also known as Rips complex. Rips complex is a set of k-simplices, such that the distance of any two points in k-simplices is less than or equal to 2ε (Hatcher, 2002; Edelsbrunner and Harer, 2010; Ghrist, 2014). The birth and death points of topological features are captured depending on the formation of simplicial complexes and subsequently recorded in a diagram known as barcode (Fig. 3(b)). Each feature is represented by a horizontal line in the barcode, with the left endpoint of the line being the birth point and the right endpoint of the line considered as the death point. One can recognize any noise in the barcode by looking at the short lines, while the significant features are signified by the long lines (Ghrist, 2008). Furthermore, the black lines in the barcode represent the connected components, whereas the red line represents the hole (see Fig. 3(b)). Moreover, persistence diagram (Fig. 3(c)) is yet another representation of barcode that serves to summarize the birth and death of topological features in R2. The x-axis and y-axis of the persistence diagram is the representation of the birth and death points respectively for each of the topological features. Besides, a point (b, d) with multiplicity q in a persistence diagram represents q features that have same birth, b, and death, d, points. Another feature that persists longer through the filtration stage is located way above the diagonal line in the persistence diagram. Additionally, the persistency of a feature is measured by the difference between the death and birth points, d – b, and subsequently known as the lifetime of the feature (Otter et al., 2017). Fig. 3 shows an example of connected components (0-dimensional) and hole (1-dimensional) extracted based on the formation of simplicial complexes. At the starting point, four connected components have appeared with a birth value at ε = 0, and as the filtration value increases the features have remained until 1-simplices appeared at ε = 1.4. It should be noted that the four connected components have collapsed into one at ε = 1.4. Interestingly, a new feature has appeared at this stage, represented by a hole appearance. As the values of ε have continued to increase to 2, two 2-simplices are subsequently formed and close the hole. This causes the death of the hole and renders only one connected component to persist by the end of the filtration. Persistence diagram in Fig. 3(c) is another representation of these topological features, simplifying the barcode in Fig. 3(b). A persistence diagram consists of k-dimensional features with 0-dimensional features representing the connected components, 1-dimensional features representing holes, and 2-dimensional features representing voids etc. Meanwhile, a persistence diagram ωm consists of n features, ωmi = (bi, di), with bi and di (i = 1, 2, …, n) indicating their birth and death points respectively. All of the features are summarized using summary statistics and described below: The first summary statistic is the sum of all lifetimes (Eq. (1); Pereira and Mello, 2015) of k-dimensional features. If the value of the sum is close to 0, the persistence diagram has practically short-lived features for each particular dimension, k. The second summary statistic is the average of all lifetimes (Pereira and Mello, 2015; Mittal and Gupta, 2017) of k-dimensional features, as described in Eq. (2). A small value of average indicates that for a particular dimension, k the data set has mostly short-lived features and vice versa. Next, the third summary statistic is the maximum of all lifetimes (Pereira and Mello, 2015; Mittal and Gupta, 2017) of k-dimensional features. For each dimension k, the value of the maximum lifetimes of all features is defined as: which will determine the most significant k-dimensional feature. In this study, the summary statistics of 0-dimensional and 1-dimensional features have been calculated using Eqs. (1)–(3). The changes of values in each summary statistics have been observed to track the evolution of the topological features for months with and without haze accordingly. Remark. We acknowledge that according to Cohen-Steiner et al. (2007) and Cohen-Steiner et al. (2010), the summary statistic, sumk, is stable whereas the stability of the avgk is still in doubt. However, there may be merit in taking into account avgk as considered by Pereira and Mello (2015) and Mittal and Gupta (2017) but in general it should be done with care. Based on our observations using our data sets, small variation of the input data leads to small variation of avgk values. This study has analyzed the time series of daily average for PM10 that is partitioned according to month for four air quality monitoring stations located in Jerantut, Klang, Petaling Jaya, and Shah Alam for 16 years (2000–2015). The time delay embedding with time delay τ = 1 and embedding dimension m = 3 have been applied for each month of data on PM10, allowing each month to generate a point cloud data. Persistent homology is then applied to each point cloud with maximum filtration value, εmax = 700. The evolution of topological features based on Rips complexes is next observed in the range of filtration value, ε = [0, 700], starting from the simplest form which is under-approximation until it displays over-estimation (i.e., one large Rips complex). The computation of persistent homology in this study has been completed using the R-package TDA (Fasy et al., 2017). All extracted topological features have been represented in persistence diagrams, following which the summary statistics for each persistence diagrams have been calculated. Fig. 4 has displayed a comparison between persistence diagrams generated in August 2005 where severe haze has been reported (DOE, 2018a), in contrast with December 2005 where no haze has occurred. The point clouds generated using time delay embedding with respect to the persistence diagrams are also shown in Fig. 4. The month of December has been selected due to the lesser likelihood for haze to occur during the north-east monsoon season (November–March), as Malaysia receives more rainfall during the period and increases the removal rate of PM10 (Yusof et al., 2010). From the persistence diagrams shown in Fig. 4, the topological features (i.e., black dots represent connected components, red triangles represent holes) for a month with haze episode recorded (i.e., August 2005) is spread away from the origin. In contrast, a month without haze (i.e., December 2005) has shown that the features accumulate close to the origin. For each station, a hole is present among the holes (red triangles) as in Fig. 4(a), which is located the furthest from the origin and has the highest value of birth and death points compared to other holes. It should be noted the corresponding feature in Jerantut is not too far from the origin compared to other stations, implying that the stations in Klang, Shah Alam, and Petaling Jaya have experienced haze of higher severity compared to Jerantut. Summary statistics are calculated to summarize the persistence diagrams, whereby the resulting outcomes are shown in Figs. 5 and 6. An observation of the various peaks in Fig. 5(a) has revealed drastic increments of the sum of all lifetimes, sum0, for connected components (0-dimensional features) during haze episodes (rectangles). This is indicative of the connected components persisting longer during haze as the filtration value varied compared to the normal months. This is further elucidated by the average of all lifetimes, avg0, as shown in Fig. 5(b), which suggests that the persistence diagrams for months with haze consist of mostly long-lived features, hence producing the peaks. Figs. 5(a) and 5(b) also show the sum and average of the lifetimes of connected components extracted in the background station, Jerantut being the lowest compared to other stations. It should be noted that the maximum of all lifetimes are not calculated for connected components as the formation of simplicial complexes (Rips complexes) for PM10 persists until it becomes a large simplicial complex with birth, b = 0, and death, d = 700 (since maximum filtration value, εmax = 700), resulting in the same values of maximum for all lifetimes (max0 = 700) according to all selected months. Furthermore, the summary statistics for holes (1-dimensional features) are shown in Fig. 6 and revealed drastic changes in the values of the sum, average and maximum of all lifetimes denoted as sum1, avg1 and max1 respectively (see Figs. 6(a)–6c)) during haze episodes (rectangles). As seen in Fig. 6(c), the peaks (in rectangles) are the values of maximum of the lifetimes of all holes which imply that there is a hole with its lifetime being higher compared to other holes in the month that experienced severe haze. Hence, Figs. 5 and 6 have revealed the behavior of the topological features extracted by persistent homology to be consistent with real haze phenomena. When the haze occurred, persistent homology has identified this occurrence by showing the strong rising of topological features lifetimes. The rectangles in Figs. 5 and 6 indicate the particular month recorded with severe haze in the selected years, with a clear and drastic increase in the values of summary statistics in the month. These characteristics have also been shown by each station involved, substantiating the consistency of the topological characterization for months with and without haze accordingly. Overall, the values (i.e., summation, average, maximum) of all lifetimes for the topological features of the background station (Jerantut) are the lowest compared with the others. Thus, it is clear that the topological features extracted via persistent homology are capable of distinguishing the months with severe haze and without haze. While the existing methods (Azmi et al., 2010; Ling et al., 2010; Abdullah et al., 2012; Rahman et al., 2015) have served to directly quantify PM10 concentration without providing a qualitative understanding, this paper has successfully filled in research gap by revealing the changes of topological features seen between months with and without haze accordingly. This study uses persistent homology, a method of topological data analysis (TDA), to evaluate the topological features embedded within PM10 concentration data. The data collected from all four stations involved in this study (Klang, Petaling Jaya, Shah Alam and Jerantut) display changes in topology for the months during which severe haze episodes were recorded. Although existing methods are capable of directly quantifying and analyzing PM10 concentration data in order to identify the months affected by haze episodes, this study proposes a new qualitative approach based on topology, which complements other methods by providing robust qualitative structures that filter noise without sacrificing data. The summary statistics of the topological features (viz., connected components and holes) indicate a significant increase in the sum, average and maximum values for the lifetimes of these features during haze episodes for all four stations. Furthermore, the lowest values are exhibited by the summary statistics for Jerantut, characterizing this location as a background site due to its low level of air pollution. Overall, our approach accurately identifies haze episodes by extracting topological features and analyzing their summary statistics, thereby demonstrating its applicability as the basis of an early warning system. We particularly believe that early signals can be detected by refining data analysis, and the progressive efforts spearheaded by this work are making waves in this area. We also recommend that future research address the as-yet undetermined stability of the summary statistic avgk. The authors would like to express their utmost gratitude to Universiti Kebangsaan Malaysia for the Research University Grant (DIP-2017-011, GGPM-2016-001), Ministry of Education Malaysia Research Grant FRGS/1/2017/STG06/UKM/01/1, and to Department of the Environment (DOE) for providing the air quality data. We also thank the reviewers for their helpful and critical comments.INTRODUCTION
DATA
Fig. 1. Time series of daily average of PM10 for (a) Jerantut, (b) Klang, (c) Petaling Jaya and (d) Shah Alam air quality monitoring stations from 1 January 2000 until 31 December 2015.
TIME DELAY EMBEDDING
PERSISTENT HOMOLOGY
Fig. 2. k-simplices for 0 ≤ k ≤ 3.
Fig. 3. (a) Formation of simplicial complexes with respect to the filtration values, ε. (b) Barcode and (c) persistence diagram for the formation of simplicial complex illustrated in (a). In (b) and (c), the black lines and black dots represent connected components in (a), while red line and red triangle correspond to hole in (a).
SUMMARY STATISTICS OF TOPOLOGICAL FEATURES
RESULTS
Fig. 4. Point clouds and persistence diagrams for the months with and without haze, (a) August 2005 and (b) December 2005, respectively, for Jerantut, Klang, Petaling Jaya and Shah Alam stations (left to right). The black dots and red triangles in persistence diagrams represent connected components and holes.
Fig. 5. (a) sum0 of all lifetimes and (b) avg0 of all lifetimes for four air quality monitoring stations, Jerantut, Klang, Petaling Jaya and Shah Alam. The rectangles indicate the months with severe haze episodes, namely August 2005, June 2013, March 2014 and September and October 2015 (left to right).
Fig. 6. (a) sum1 of all lifetimes, (b) avg1 of all lifetimes and (c) max1 of all lifetimes for four air quality monitoring stations, Jerantut, Klang, Petaling Jaya and Shah Alam.
CONCLUSIONS
ACKNOWLEDGMENTS