Machine Learning Classification Model to Label Sources Derived from Factor Analysis Receptor Models for Source Apportionment

Factor analysis (FA) receptor models are widely used for source apportionment (SA) due to their ability to extract the source contribution and profile from the data. However, there is subjectivity in the source identification and labelling due to manual interpretation, which is time-consuming. This raises a barrier to the development of the real-time SA process. In this study, a machine learning (ML) classification algorithm, k-nearest neighbour (kNN), is applied to the source profiles obtained from the United States Environmental Protection Agency’s (U


INTRODUCTION
Source apportionment (SA) is the practice of identifying the emission sources and their contribution to develop control strategies for effective air quality management (Karagulian and Belis, 2012;Bove et al., 2014;Hopke, 2016). Receptor models apportion the ambient concentrations to their respective sources. Chemical mass balance (CMB) is the most suitable receptor model if the source profiles are available. However, in the absence of source profiles, multivariate factor analysis (FA) models like positive matrix factorization (PMF), principal component analysis (PCA), and UNMIX are the alternatives (Viana et al., 2008;Hopke and Cohen, 2011). FA models extract the source contribution and profile from the data itself (Viana et al., 2008;Hopke and Cohen, 2011;Hopke, 2016) because of which it is the most widely used receptor model (Karagulian et al., 2015;Hopke et al., 2020). The profiles must be designated with source based on the literature or compared with measured source profiles, which is time-consuming and involves manual interpretation. It is the most subjective and least quantifiable step in applying FA receptor models (Reff et al., 2007). Hopke et al. (2020) reported 741 global particulate matter (PM) apportionments from 414 published papers during 2014-2019, of which only 89 applied CMB while others applied FA receptor models. 539 cases applied PMF, which is the most utilized receptor model. A review of these studies exhibits that the average (± SD) gap between the data collection and result publication is 4 (± 2) years, while the average delay in the publication of FA receptor models is 4 (± 3) years. Only 38 (6%) studies reported results for FA receptor models within a year. 166 (25%) and 149 (23%) studies reported results after 2 and 3 years, respectively. A few studies have taken almost 8-12 years to publish results by the time the results become irrelevant. This raises a barrier to developing the real-time SA process to track the source of pollution and their emission in real-time. Real-time SA is essential for managing air quality as a lot of information from dynamic source activity and various episodic events are suppressed when samples are collected for a long period during the study period. The improved time resolution of identification of pollution sources can help regulatory agencies take immediate action and reduce the emission at the source to improve air quality. Long-term online measurements coupled with the receptor model could be considered for real-time SA for managing air quality (Rai et al., 2020;Lalchandani et al., 2021;Prakash et al., 2021;Yang et al., 2022). In these studies, real-time chemical speciation instruments are used to obtain high-resolution (30 min or 1-h) chemical characterization of the aerosols. However, there is still a gap of almost 1-2 years in obtaining the results of these studies. This indicates that the process of applying the receptor model and result interpretation needs to be updated to achieve real-time SA. Pernigotti and Belis (2018) developed DeltaSA, a tool to assign a factor to a source in FA receptor models based on the similarity between a given factor and source chemical profiles from public databases. Similarly, Liao et al. (2022) developed a numerical method to identify apportioned factor profiles by integrating distance and probability-based profile matching approaches. It is also essential to have models which can be integrated easily into the receptor models. The application of machine learning (ML) algorithms can be a potential solution to facilitate source identification. Automation of the process of labelling profiles derived from FA receptor models using ML can reduce the time taken for the process immensely and the uncertainty associated with results due to modeler bias. This will also streamline the process of identification and labelling of source profiles. Classification models are the most appropriate choice for the problem because of their ability to designate new samples with labels based on previous data. To our knowledge, no approaches have been developed to apply ML algorithms for linking factors to the appropriate sources.
The objective of this study are as follows: (a) investigate the application of ML classification algorithms to label profiles with appropriate sources based on the SPECIATE database; (b) validate the model on source profiles available in the literature.

Data
The United States Environmental Protection Agency's (U.S. EPA) SPECIATE is a repository of organic gas and particulate matter (PM) category-specific emission speciation profiles of air pollution sources. SPECIATE 5.1, the latest version, includes 6,746 PM, gas, and other profiles. These emission source profiles are used to provide input to the Chemical Mass Balance (CMB) receptor models, verify profiles derived from ambient measurements by multivariate receptor models (e.g., factor analysis and positive matrix factorization) and interpret ambient measurement data (Simon et al., 2010;U.S. EPA, 2015;Bray et al., 2019;U.S. EPA, 2019). Data used to create these profiles come from various sources, including peer-reviewed journal articles and emissions testing conducted primarily by the EPA (Bray et al., 2019). Further details about the SPECIATE database is available in Simon et al. (2010), Bray et al. (2019) and U.S. EPA (2019). In this study, only PM2.5 source profiles are used for model development. 1731 PM2.5 source profiles were collected from the SPECIATE repository (U.S. EPA, 2015) and were grouped into five major categories (count in parentheses) namely, biomass burning (325), coal combustion (108), dust (431), industrial (312) and traffic (555). The details of SPECIATE profiles used in this study and the respective assigned sources is provided in Table S1 of the Supporting Information. The uncertainties associated with the species in the source profiles, as provided by the SPECIATE is also considered in this study.
The total data was randomly split into a 70/30 ratio with train and test sizes of 1211 and 520, respectively. The source-wise train and test samples (count in parentheses) are biomass burning (219; 106), coal combustion (71; 37), dust (299; 132), industrial (224; 88) and traffic (398; 157). The split tries to maintain a 70/30 ratio for the whole dataset and each source. The list of elements and composite profiles of the five categories is provided in Table S2 of the Supporting Information.

Methodology
The PM2.5 source profiles collected from SPECIATE database were fed into the ML model as input data, and the trained model was validated on the test dataset. The trained model can label the new factors derived from FA receptor models by assigning a source profile to the factors. The modeling framework implemented in this study is presented in Fig. 1. The k-Nearest Neighbour (kNN) classification ML algorithm is applied to develop the model. kNN is a memory-based algorithm and assumes that similar things exist nearby. kNN classifies the value for new samples using the majority vote of the k closest samples from the memorized data (Hastie et al., 2009;Kuhn and Johnson, 2013;Sammut and Webb, 2011;Yang, 2019). kNN classification algorithm follows a five-step process: (a) select distance metric; (b) select number of nearest neighbours (k < n); (c) compute the distance from other data points to the desired point; (d) sort the points in increasing order of distance; (e) compute the average of k nearest neighbours' responses (Hastie et al., 2009;Kuhn and Johnson, 2013;Sammut and Webb, 2011;Yang, 2019). In kNN, data is the model as it makes no assumptions about the relationship among the data and predicts output based on training data. kNN is simple, efficient and flexible in its speed and scalability, and since the training data in this problem will remain the same and predictions will be made based on that only, kNN is the most suitable algorithm for this problem (Sammut and Webb, 2011;Winters-Miner et al., 2015). The methods for assignment of a factor to a source, as reported in Pernigotti and Belis (2018) and Liao et al. (2022), apply either distance-based indicators or a combined approach of distance-based proximity measures and probability-based matching algorithm (naïve Bayes classifier, NBC). kNN also applies distance metric to compute the distance from other data points to desired point. kNN has the option to utilize various distance metrics such as Euclidean, Manhattan and Minkowski. However, kNN has the option to either use uniform or distance-based weights. In the case of uniform weights, all the neighbouring points are weighted equally, while in distance-based weights, the closer neighbors will have a greater influence than points that are far away. kNN then classifies using a majority vote among the k neighbours (Hastie et al., 2009;Kuhn and Johnson, 2013;Sammut and Webb, 2011;Yang, 2019). Another significant difference between kNN and NBC (applied by Liao et al. (2022)) is that NBC is a parametric model while kNN is a non-parametric model. NBC uses a fixed number of parameters for model building and considers strong assumptions about the data. Contrary to that, kNN is flexible with the number of parameters for building the model and considers fewer data assumptions. Also, NBC requires the underlying probability distributions for categories to obtain acceptable results and  will only work if the decision boundary is linear, elliptic, or parabolic, while kNN does not require any such information and is often successful where the decision boundary is irregular (Hastie et al., 2009;Kuhn and Johnson, 2013;Sammut and Webb, 2011;Yang, 2019). Further details of the kNN algorithm is available in the literature (Hastie et al., 2009;Kuhn and Johnson, 2013;Sammut and Webb, 2011;Yang, 2019). The KNeighborsClassifier module of the Sklearn Python library is used in this study for model development (Scikit-learn, 2011). The ML model should be evaluated on samples not used while building the model to assess the unbiased sense of model effectiveness (Kuhn and Johnson, 2013). Due to this, the data is divided into train and test data. The train data is used to develop the model, while the test data is used for evaluating the model's performance. However, if the test data is locked away, train data is further divided into the train and validation sets to measure performance on unseen data to select a good hypothesis (Kuhn and Johnson, 2013;Russell and Norvig, 2018). But, resampling methods such as cross-validation can also be used to assess the model performance using the training set (Kuhn and Johnson, 2013). Hyperparameter tuning using grid search with 10-fold cross-validation was conducted to determine the value of k in the kNN model. Based on the hyperparameter tuning results, the kNN model was trained for k = 5 (number of neighbours). The model takes around 6 seconds to train the model and provide the results on the test and validation datasets in approximately 3 seconds.
The accuracy of classification models can be quantified by precision, recall, F1 score and accuracy. Precision is the ability of the classifier not to label the sample as positive, a sample that is negative, while recall is the ability of the classifier to find all the positive samples. Precision and recall can be calculated by Eqs. (1) and (2), where TP, FP, TN and FN are true positive (number of positive instances correctly classified), false positive (number of negative instances incorrectly classified), true negative (number of negative instances correctly classified) and false negative (number of positive instances misclassified) respectively. F1 score is the weighted harmonic mean of the precision and recall and can be calculated by Eq. (3).

1
Precision Recall F score Precision Recall The F1 score varies from 0 (worst) to 1 (best) (Sammut and Webb, 2011). The accuracy of the model is calculated by Eq. (4).

Model Development
kNN is applied to the train data for model development and test data for validation. The train and test accuracy for the model is 0.85 and 0.79. The precision, recall and F1 score for the five major sources and the overall average on the test data is presented in Fig. 2. It is observed that the sample size significantly affects the model performance. The performance for the traffic source is the best as it has the highest number of samples, followed by dust and industrial sources. The precision of coal combustion is high, but the recall is low, reducing the F1 score.  The confusion matrix presented in Table 1 explains the model's performance for classifying source-wise samples. Out of 106 biomass burning samples in test data, only 101 are assigned as biomass burning by the model, out of which only 86 are correctly labelled. Similarly, only 19 and 111 out of 37 and 132 coal combustion and dust samples are assigned to the correct class. The low performance of the model for coal combustion is due to the lack of enough training samples to develop a robust theory about the source. For traffic source, 138 samples were assigned to the correct class by the model out of 157. Also, the true and predicted label count is equal for traffic source, which is not equivalent for other sources. This explains the effect of sufficient training data. The details of source-wise samples assigned to each class is available in Table 1.

Validation
The model developed above is applied to the source profiles obtained through measurement for Delhi, India (Prakash et al., 2021) and the receptor model for Cincinnati, USA (Sahu et al., 2011) that have been used for source apportionment of PM2.5 in literature. The mass and the uncertainties associated with the species were used for prediction. The objective is to predict the label of the sources and compare them with those provided in the literature.
The comparison of profiles for similarity indicates that IND is highly correlated with BB (0.91), GV (0.96), DV (0.92) and CC (0.76). Similarly, CC also has a high correlation with BB (0.96), GV (0.68) and DV (0.63). Other profiles also indicate similarities, as presented in Fig. 4. This demonstrates that even measured profiles have similarities, affecting the model performance as sources are not very well alienated from each other.
The probability matrix shows the possibility of the samples being assigned to a specific class. The probability matrix for the Delhi dataset is presented in Table 2. BB is assigned to the correct class of biomass burning with a probability of 0.6. RD, SD and CD were also allocated to the suitable class dust with probabilities of 0.4, 0.6, and 0.6, respectively. GV (0.8) and DV (0.8) are allotted to the traffic source. However, IND is labelled as a traffic source for two reasons. First, the number of traffic source samples is higher than any other source in training data, causing a bias in the model. Second, as discussed above also, IND, DV and GV are dominated by the same chemical species OC (50.31%, 55.11%, 76.14%) and EC (17.75%, 38.90%, 16.29%), and the IND source profile shows substantial similarity with DV (0.92) and GV (0.96), which can confuse the model.

Model derived source profiles
The PMF receptor model-derived source profiles for Cincinnati are taken from Sahu et al. (2011). Six major sources are selected from this study for validation, viz., soil dust (SD), metal processing (MP), biomass burning (BB), coal combustion (CC), gasoline vehicle (GV), and diesel vehicle (DV). Si (36.41%), Al (14.17%), Ca (6.32%) and Fe (10.83%) characterize SD, while MP has a relatively high concentration of Cu, Zn, Fe, and Pb. CC has the highest concentration of S (58%), while BB has an abundance of K. GV and DV have high OC and EC concentrations. The detailed elemental composition of the six sources is presented in Fig. 5.
The similarity between the source profiles for Cincinnati is presented in Fig. 6. DV and GV show a strong correlation (0.81) with each other and MP (0.65, 0.74). BB exhibits a good correlation with CC (0.43) and strong with MP (0.57). This indicates that model-derived profiles are similar to the measured profiles, affecting the model performance.  The probability matrix for the Cincinnati dataset presented in Table 3 assigns DV and GV to the correct class of traffic source. CC, BB and SD are also assigned to the suitable sources, but MP is appointed to the traffic source, possibly because of the same reasons discussed above for the measured profiles.
The ML approach presented in this study has its strengths and limitations. The strengths of this model are no human intervention and subjectivity, an efficient automated process, and the model can be coupled with receptor models easily for verification of results. However, the limitations are the requirement of huge datasets and model bias is there. However, this is different than human bias as it does not depend on the modeler and can be reduced over the period with more training.
This study attempted to identify the issues with the FA receptor models, which is a barrier to real-time SA. An ML-based classification model is developed to label the sources derived from FA receptor models automatically. The model's validation on measured and receptor model-derived source profiles provides evidence that the ML model performance is within the acceptable estimation range. The performance can be further increased by balancing the number of samples  for each source in the training data. This model can act as another layer of the process for verification of the results of FA receptor models and is a small step towards automation of the process for real-time SA. The aforementioned being said, we acknowledge that in this study, no secondary sources are used while prediction due to a lack of training data and only PM2.5 source profiles are used. The same framework can be transferred to other pollutant sources with relevant data.

CONCLUSION
This study implemented k-Nearest Neighbour (kNN), a classification machine learning (ML) algorithm for the automatic labelling of profiles derived from factor analysis (FA) receptor models based on the SPECIATE database. The train and test score of the model is 0.85 and 0.79, respectively. The overall weighted average precision, recall and F1 score is 0.79. The performance of the model during validation exhibits acceptable results. The application of ML models for source profile labelling will reduce the time taken and the subjectivity associated with results due to modeler bias. This process can act as another layer of the process for verification of the results of FA receptor models. The application of this methodology advances the process towards realtime source apportionment.