Evaluation of a Modified Receptor Model for Solving Multiple Time Resolution Equations : A Simulation Study

This study was conducted to evaluate the performance of an improved source apportionment model that is suitable for incorporating data with multiple time resolutions. This evaluation was achieved by using synthetic data sets that simulated environmental concentrations of volatile organic compounds (VOCs) and fine particulate matter (PM2.5) from the five following sources: petroleum refinery, vehicle exhaust, industrial coating, coal combustion, and natural gas. Hourly VOCs and speciated PM2.5 data were simulated for a one-week period. The PM2.5 data were further averaged every twelve hours to generate data sets with mixed temporal resolutions. The Multilinear Engine program was applied to resolve the source profiles and contributions. A series of sensitivity analyses was conducted to examine how uncertainties in the profile variation, measurement error, and source collinearity affected the model performance. The resolved factor profiles closely matched the input profiles, and the measurement error had a larger impact on the modeling results than the profile variation. In the most comprehensive data set that contained all three types of uncertainty, the R values between the input and retrieved source contributions were between 0.87 and 0.95. The estimated percentage contributions were also comparable with the input ones, demonstrating the applicability and validity of this improved model.


INTRODUCTION
Volatile organic compounds (VOCs) and fine particulate matter (PM 2.5 ) have been found to be associated with adverse health effects such as asthma induction and exacerbations, respiratory symptoms, and impaired lung function (Seagrave et al., 2006;Wichmann et al., 2009;Stanek et al., 2011a, b).Thus, it is helpful to identify the contributions of air pollutants from different sources and evaluate the source-specific health impacts (Heal et al., 2012;Lall et al., 2012) in order to design effective control strategies.To achieve this analysis, source apportionment models, such as chemical mass balance (CMB), absolute principal component scores (APCS), and positive matrix factorization (PMF) models, can be used to quantify the contribution of a particular source at a receptor site (Cooper and Watson, 1980;Viana et al., 2008a;Wang et al., 2008;Gugamsetty et al., 2012;Huang et al., 2012).Based on the mass conservation concept, the receptor model is expressed by the following mass balance equation: where x ij represents the concentration of measured species j in sample i, g ip is the contribution of source p to sample i, f pj is the mass fraction of species j in source profile p, and e ij is the residual of the model fit for the measured concentration.The multivariate models require less previous knowledge regarding the emission source profile than the CMB model.Therefore, when complete emission profiles are not available, the PMF model (one of the advanced multivariate models that simultaneously retrieve g and f from x) can be applied (Viana et al., 2008a, b).
To date, few studies have considered combining both gaseous and particulate species into one source apportionment model.Previously, the Multilinear Engine (ME), a more flexible program for solving the PMF problem, was used to quantify the contributions of different sources to PM 2.5 and VOC mixtures (Wu et al., 2007).Nevertheless, different sampling time schedules for VOCs and PM 2.5 resulted in difficulties for conventional receptor modeling.For example, the PM 2.5 composition data are commonly obtained by analyzing time-integrated filter samples.However, the hourly VOCs data can be monitored semicontinuously with automated monitors.For conventional receptor models, it is necessary to average the high-resolution data over longer sampling periods to match the low-resolution data.To prevent decreasing sample size and the loss of valuable information from the high-resolution data (Lioy et al., 1989), a more flexible model for solving multiple time resolution equations is needed.A previous study (Zhou et al., 2004) used field data to demonstrate that a modified ME program is suitable for this purpose.
This study was conducted to evaluate the performance of an improved source apportionment model that is suitable for incorporating data with multiple time resolutions.This assessment was achieved by using synthetic data as in several previous studies (Miller et al., 2002;Brinkman et al., 2006;Christensen et al., 2006;Lingwall and Christensen, 2007;Hemann et al., 2009;Vedal et al., 2009;Habre et al., 2011).

Data Simulation
A primary data set that represented the hourly measurements in a week was created from five sources (i.e., petroleum refinery, vehicle exhaust, industrial coating, coal combustion, and natural gas) based on the assumption that the receptor site is located near an industrial complex (Buzcu and Fraser, 2006;Leuchner and Rappenglück, 2010).The hourly contribution matrix G (i.e., g ip in Eq. ( 1)) was generated by using average contributions (Table S1 in the Supporting Information) that were available in the literature (Zheng et al., 2002;Lee et al., 2003;Kim et al., 2005;Marmur et al., 2005;Alastuey et al., 2006;Hopke et al., 2006;Kulkarni et al., 2007;Song et al., 2008;Yuan et al., 2009;Chan et al., 2011;Guo et al., 2011).Next, hourly fluctuations were added by assuming random variability and a lognormal distribution (Lingwall and Christensen, 2007) for each source, except for vehicle exhaust.Time series data of carbon monoxide (a classic tracer for traffic) collected at an air quality monitoring station in an urban area were used to approximate the diurnal and weekly vehicle exhaust variations.The factor profile matrix F (i.e., f pj in Eq. ( 1)) was derived from the US EPA SPECIATE database (Lingwall and Christensen, 2007;Simon et al., 2010), which included 18 VOCs and 17 PM 2.5 species (Fig. S1 in the Supporting Information).The species included in the simulated data set were determined according to source profile availability and proportion of values below the method detection limits (MDLs).The MDLs were developed based on the analytical techniques that were used to determine VOCs (Gas Chromatography) and elements (Xray Fluorescence).For organic carbon, elemental carbon, nitrate, and sulfate, the MDLs were obtained from Hemann et al. (2009).After establishing G and F, the observation matrix X was calculated by using the following equation: The temporal resolution for PM 2.5 data was intentionally reduced by averaging the associated source contributions at twelve-hour intervals to generate a mixed temporal resolution data set (D Pri ).If a simulated concentration value was below the MDL for species j, it was replaced by half of the MDL value.After the creation of the simulated data, the uncertainties of the input measurements were processed as follows: 1) the uncertainties for concentrations below the MDL were set to 5/6 of the MDL value and 2) the uncertainties for the other concentrations were determined by using the following equation (Miller et al., 2002): A series of sensitivity analyses was conducted to examine how the profile variation, the presence of measurement error, or the source collinearity affects the accuracy and precision of the modeling results (Table 1).Briefly, random noise was added to the source profiles (D FVar ) to reflect source profile variability by using Eq. ( 4) (Brinkman et al., 2006).
where CV (= 0.4) represents the coefficient of variation and z pj is the random number that was sampled from the standard normal distribution for each species in each profile.In a previous study, the global CV average ranging   (Brinkman et al., 2006).The measurement errors (D MErr ) were added in a similar manner with a CV of 0.1 (Miller et al., 2002).Furthermore, a data set that combined profile variation and measurement error (D Comb ) was generated accordingly.
In the above data sets, the contributions of each source were assumed to be independent of each other (Table 2).We further simulated another series of data sets with correlated source contributions.The correlated primary data set (D C Pri ) was simulated by increasing the cross-correlations of the temporal patterns between the industrial coating and coal combustion sources and between the vehicle exhaust and natural gas sources (Table 2).The associated data sets with profile variation (D C FVar ), measurement error (D C MErr ), or a combination of these two (D C Comb ) were generated consequently by following the above mentioned procedures.Thirty simulation runs were conducted for each scenario in the sensitivity analyses and the total number of tested data sets was 270.

Model Implementation
The mass balance equation for the modified ME model that is suitable for multiple time resolution data is expressed as follows (Zhou et al., 2004): where x sj represents the concentration of measured species j in sample s and t s1 and t s2 are the start and end times, respectively, as described by the number of time units.In this study, the time unit is set to one hour based on the VOCs sampling interval.In Eq. ( 5), g ip is the contribution of source p during the time units for sample s, f pj is the mass fraction of species j in source profile p, and e sj is the residual for species j in sample s.
In this study, the unmeasured mass of PM 2.5 was included and defined as the mth species as follows: where (PM 2.5 ) s is the measured PM 2.5 mass concentration of the sth sample.This inclusion eliminated the need for the post-hoc regression process (Larson et al., 2006;Wu et al., 2007).It also constrains the sum of all PM 2.5 species mass fractions in each source to equal 100% based on the following auxiliary equation: where δ is set to 0.02 in this study.

Data Analysis
The agreement between the simulated and modeled contributions and profiles was evaluated for each data set by calculating the average absolute error (AAE) (Christensen and Gunst, 2004;Christensen et al., 2006;Lingwall and Christensen, 2007).Unlike other model performance statistics, such as the root mean square error (RMSE), the AAE gives the amplitude of average error without being pulled by the large one (Javitz et al., 1988;Willmott and Matsuura, 2005).The AAE of contribution (AAE g ) was defined as follows: 1 1 where n is the number of samples and ĝ ip is the modeled contribution of source p to sample i.The AAE of profile (AAE f ) was obtained in a similar way.In addition, the coefficient of determination (R 2 ) of the contribution estimates for each source was calculated to evaluate the variance of the model fits (Brinkman et al., 2006).

Primary Analysis
For the hourly distributed primary data set, all R 2 values were greater than 0.99, which indicated excellent agreement between the retrieved and simulated matrices (Table 3, D Hour ).In addition, the percentage contributions that were reconstructed from the hourly distributed primary data set matched well with the input ones.After reducing the time resolution for the PM 2.5 data, the R 2 values decreased slightly, but remained above 0.95 (Table 3, D Pri ).Nevertheless, the percentage contributions had higher deviation than those from the hourly distributed data set, especially for vehicle exhaust/VOC.This result potentially occurred because no major species had a mass fraction greater than 15% in the vehicle exhaust/VOC profile (Fig. S1).This relatively nonunique profile could lead to poor separation from other sources (Brinkman et al., 2006).To test this hypothesis, the mass fractions of ethylene and isopentane in the VOC source profile of vehicle exhaust were increased to more than 20% and re-applied to the run with the lowest percentage contribution estimate (22% in Table 3) for vehicle exhaust/ VOC.In this case, a more accurate value (29%) was obtained, which suggested this hypothesis was valid.

Sensitivity Analysis
Compared with the primary data set (D Pri ), the AAE did not change much after adding profile variation (D FVar ), although the variability was larger (Fig. 1).In contrast, the AAE significantly increased with the presence of measurement error (D MErr and D Comb ).These results suggest that the bias of the modeling results from measurement error is greater than that resulting from profile variation.The explanation is that the measurement error greatly influenced the residual term e sj in Eq. ( 5), thus affecting the solutions of the minimization algorithm.
The R 2 values between the retrieved and simulated contributions in the primary data set are given in Fig. 2 (D Pri ) by source.Most R 2 values were greater than 0.8.Only minor influences resulted from profile variation (D FVar ).However, the R 2 values decreased for all sources when measurement error was considered (D MErr and D Comb ), especially for the coal combustion source.In the input data, coal combustion was assumed to contribute only 6% to the VOC mass concentration.Because receptor models may not fully resolve sources with low mass contributions (Ulbrich et al., 2009), this issue was investigated by doubling the average coal combustion contribution to VOC (6% to 12%).Increasing this percentage to 12% improved the median R 2 value from 0.525 to 0.778.This result indicated that the measurement error has a larger impact on modeling results when the percentage contribution of a source is smaller (Miller et al., 2002;Brinkman et al., 2006;Ulbrich et al., 2009).
A similar tendency was observed for each scenario in the correlated data sets when compared with the noncorrelated data sets (Fig. 1).The minor deviation that resulted from the collinearity of the source contribution was unexpected.Fig. 2 shows that greater R 2 values were obtained for coal combustion in the correlated data sets than in the non-correlated data sets (D C Comb vs. D Comb ).This result contradicts those of previous studies, in which the correlation between the simulated and predicted contributions decreased as collinearity increased (Brinkman et al., 2006;Vedal et al., 2009;Habre et al., 2011).This contrasting result was explored by generating an additional data set (D C2 Comb ) with modified correlation structures among the simulated sources.The coal combustion in the D C2 Comb data set was simulated as cross correlating with natural gas, which also had a low source contribution (Table S2), instead of correlating with the high contributor (industrial coating) in the D C Comb data set.It was assumed that the other three sources were independent of each other.A lower median (range) R 2 of 0.536 (0.004-0.891) was observed in the D C2 Comb data set relative to the D Comb and D C Comb data sets (R 2 of 0.642 (0.019-0.897) and 0.871 (0.644-0.951), respectively).These observations showed that better results indeed were obtained in scenarios that did not have source collinearity between two low contributors (R 2 : D Comb > D C2 Comb ).On the other hand, source collinearity may reduce the disadvantage (R 2 : D C Comb > D Comb ) of a source having small percentage contribution (e.g., coal combustion) when it is correlated with another source having high percentage contribution (e.g., industrial coating in D C Comb ).We further simulated a data set (D C 3S ) in which three sources (vehicle exhaust, coal combustion, and natural gas) were cross-correlated.For the high contributor of vehicle exhaust, decreased R 2 (0.915) and percentage contributions (19% for VOC and 23% for PM 2.5 ) were obtained, compared to the ones in the D Comb data set (Table 3).On the other hand, for the low contributors of coal combustion and natural gas, their R 2 increased to 0.784 and 0.953, respectively, although their percentage contribution estimates also increased.
Source profile collinearity was investigated by replacing natural gas in D Comb with diesel exhaust which has a profile highly correlated with that of vehicle exhaust (Chin and Batterman, 2012).A lower R 2 of 0.516 and higher percentage contributions (15% for VOC and 11% for PM 2.5 ) were observed for diesel exhaust, compared to those of natural gas in the original D Comb data set (Table 3, D FCor ).The problem of profile collinearity might be solved by including additional species to reduce similarity between profiles (Chin and Batterman, 2012).These above analyses indicated a trade-off between high and low contributors when they correlated with each other.The contribution estimates of the high contributors may be underestimated while those of the low contributors may be overestimated.
The percentage contributions in the most comprehensive data set, D C Comb , which combine profile variation, measurement error, and source collinearity, are given in Table 3.The modeled percentage contributions are comparable with the input contributions.A comparison between the simulated and retrieved source profiles is shown in Fig. 3.The error bars represent the standard deviations of each species for 30 runs.Good agreement was obtained between the input source profiles and the model estimates.It should be noted that the simulated data cannot completely illustrate the complexities of the real world.Nevertheless, results based on synthetic data are valuable for providing insights regarding the limitations and abilities of the improved receptor model.
One additional issue to be discussed is determining the appropriate number of factors in the model.The above analysis assumed that the number of factors was known (i.e., 5 in this study).In the real world, the number of sources is generally unknown.Thus, we applied the model to the D C comb data set and varied the number of factors between 3 to 7. Next, we observed changes in the maximum individual column mean (IM) and the maximum individual column standard deviation (IS) of the scaled residual matrix versus the number of factors in the model (Lee et al., 1999;Brinkman et al., 2006;Heo et al., 2009).Because previous studies suggested that Q values depend greatly on assigned uncertainties, the common evaluation method of comparing the obtained and theoretical Q values was not used (Paatero and Tapper, 1993;Paatero, 2007).As shown in Fig. 4, the optimal number of factors should be between four and six based on the judging criterion of having an apparent decrease in IM and IS.This information usually provides initial guidance regarding the possible range of numbers of factors rather than pointing toward a single number of factors in the final solution (Lee et al., 2003;Brinkman et al., 2006).Previous studies showed that the optimal fit should be determined based on the most physically meaningful solutions (Lee et al., 2003;Kim et al., 2005;Paatero, 2007;Ulbrich et al., 2009;Yuan et al., 2009).In the four-factor solution of the D C Comb data set, n-heptane, toluene, and m,p-xylene had large residuals that resulted from the combination of industrial coating and coal combustion sources.In the six-factor solution, two factors with profiles resembling vehicle exhaust were obtained.However, one of the factors was not interpretable because its profile lacked important i-pentane and organic carbon species.Consequently, a five-factor solution was the most reasonable.

CONCLUSIONS
In this study, a modified source apportionment model was D MErr Add measurement error to D Pri D Comb Add profile variation and measurement error to D Pri D C Pri Primary data set with correlated source contribution D variation and measurement error to D C Pri between 0.11 and 0.41 was used

Fig. 1 .
Fig. 1.Distribution of the AAE results for the source contributions and profiles in different data sets.

Fig. 2 .
Fig. 2. Distribution of the coefficient of determination (R 2 ) for the source contributions in different data sets.

Fig. 3 .
Fig. 3. Comparison of the simulated (black bar with error as the mean ± SD) and retrieved (gray bar with error as the mean ± SD) source profiles in the D C Comb data set.

Fig. 4 .
Fig. 4. Determination of the number of factors by using the maximum individual column mean (IM, filled circles) and the maximum individual column standard deviation (IS, gray bars) from scaled residuals.

Table 1 .
Simulated data sets and sensitivity analysis scenarios.

Table 2 .
Correlation coefficients between source contributions in the primary data sets.

Table 3 .
R 2 values of the estimated source contributions and the modeled percentage contributions (PC, in %) in different data sets.