What Influences Low-cost Sensor Data Calibration? - A Systematic Assessment of Algorithms, Duration, and Predictor Selection

The low-cost sensor has changed the air quality monitoring paradigm with the capacity for efficient network expansion and community engagement. The surge in its use has sparked new research interests in understanding its data quality. Many studies have employed field calibration to improve sensor agreement with co-located reference monitors. Yet, studies that systematically examine the performance of different calibration techniques are limited in scope and depth. This study comprehensively assessed ten widely used data techniques, namely AdaBoost, Bayesian ridge, gradient tree boosting, K-nearest neighbors, Lasso, multivariable linear regression, neural network, random forest, ridge regression, and support vector machine. We compared their performance using a standardized baseline dataset and their responses to various parameter combinations. We further assessed the training sample size effect to understand the optimal duration of field calibration for achieving good accuracy. Finally, we tested different predictor combinations to address whether the inclusion of more predictors will lead to better performance. Using baseline data, the neural network achieved the best performance, followed by the four regression-based methods, showing very consistent and stable performance. While confirming that the latest research tendency is deep learning, regression is still a viable option for studies with limited effort in parameter tuning and method selection, especially considering its computational efficiency and simplicity. The sample size effect is most evident when the sample size drops below 30%, which is equivalent to six weeks of continuously collected hourly data. Although algorithms react differently to the number of predictors, their performance was typically boosted by adding more predictors, especially the particle count and humidity. Our study not only describes an approach of sophisticated data-driven calibration for practical applications, but also provides insights into the compounding impacts of parameters, samples, and predictors in algorithm performance.


INTRODUCTION
Air pollution is one of the global leading mortality risk factors (Apte et al., 2017;Liang and Gong, 2020). Even at low concentrations, fine particulate matter with aerodynamic diameters smaller than 2.5 µm (PM 2.5 ) is significantly associated with an increased health hazard (Bell et al., 2011) and adverse social-environmental effects (Sager, 2019). Increasing evidence proves that socio-economically disadvantaged communities suffer more from higher levels of air pollution (Colmer et al., 2020;Gray et al., 2013;Peled, 2011). There is a critical need to characterize the spatial-temporal patterns of PM 2.5 at the granular level to better estimate and mitigate those risks at the individual or community level.
A paradigm shift in granular-level air monitoring is the growing usage of low-cost sensors (LCSs) to supplement conventional sparsely located regulatory stations (Mao et al., 2019;Snyder et al., 2013).
PA sensors are equipped with two laser scattering particle counters (Plantower PMS5003) that report independently at approximately a 120 s interval. The Plantower sensors use a fan to draw air through an inlet past the laser, producing a scattering effect that is detected by the photodiode. A proprietary algorithm developed by Plantower was applied to convert the amount of light scatter detected into particle sizes, and then from particle count (µm dl -1 ) into mass concentration (µg m -3 ). Because the indoor and outdoor conversion options are different, only data calculated using the outdoor conversion method was used. The mass concentration for PM 1 , PM 2.5 , and PM 10 are reported, all of which are average for the two channels. If the outdoor particle values reported for the two channels drift apart, the PurpleAir system will downgrade one of the channels and exclude the channel from the data average. Raw particle count is also reported in six size bins ranging from 300 nm to 10 µm, separately particle sizes greater than 0.3 µm diameter, 0.5 µm, 1.0 µm, 2.5 µm, 5.0 µm, and 10 µm. PA sensors also use a Bosch BME280 sensor to estimate relative humidity (RH), temperature, dew point, and pressure. The data transmission and storage are enabled by its Wi-Fi module for real-time data transmission and a built-in SD card as a backup solution to internet disconnection.

Reference instrument and calibration system
Reference instruments typically refer to federal reference methods (FRMs) and federal equivalent methods (FEMs) that provide National Ambient Air Quality Standards (NAAQS) in the U.S. (U.S. EPA, 2011), or similar sampling technologies in other countries (Cao et al., 2013). FRMs and FEMs commonly use more sophisticated and regularly maintained technologies for particle mass measurement such as direct gravimetric methods, beta attenuation, and oscillating microbalance methods (Schmidt-Ott and Ristovski, 2003). Despite their gold standard role in air quality monitoring, the implementation and operational costs are high. For instance, it costs approximately $50 million to maintain U.S. national ambient air quality monitoring system per year (U.S. GAO, 2020). Besides, the site selection is primarily based on population density, with less consideration of other factors such as social inequality (Watson et al., 1997).

Data Collection and Cleaning
Here, we employed a US-wide PurpleAir correction dataset from a previous EPA work to make the results generic enough to avoid any location-specific biases (Barkjohn et al., 2021). Part of the collocation data was obtained from sensor calibration experiments that were operated by air monitoring agencies. Another portion of the data came from privately owned sensors that are within 30 m of an active EPA Air Quality System site reporting PM 2.5 and have been confirmed by a local air monitoring agency for their identities. A thorough data cleaning was performed to ensure data quality following these steps ( Fig. 1): 1) One Iowa dataset that constituted 55% of the entire collocated dataset was thinned from 10,907 to 3,762 data points to better balance the datasets among the states and to avoid building a final model that is Iowa dependent. All highconcentration data (≥ 25 µg m -3 ) were retained and low concentration data were randomly drawn; 2) A 90% completion threshold was applied to data to enable a true representation of daily averages; 3) Extremely high and low values in PM 2.5 , temperature (> 540°C), and RH (> 100%) collected by PA were removed; and 4) Each PA units has two identical Plantower sensors (refer to as channels hereafter), and the agreement between the data collected from both channels can indicate potential data outliers. We first calculated the absolute and percentage differences between two PA channels using their 24-hour average. Percentage is the absolute difference divided by the average of the two channel readings. The percentage difference was used to deal with channel disagreement under a high concentration scenario that can not be captured by absolute difference. Records with an absolute difference of 5 µg m -3 or fall outside of two standard deviations of the entire percent difference dataset were removed.
Because no Texas site was included in the national dataset, we supplemented it with the field calibration data that we collected at the Texas Commission on Environmental Quality (TCEQ) Denton Airport South station (EPA site number: 481210034, Lat: 33.2190759, Long: -91.19962841). From April 12, 2020 to September 17, 2020, four PA sensors were placed at a close distance (< 5 m) to a FEM regulatory instrument (BAM). To reduce data redundancy, we picked only one sensor with R 2 > 0.9 between the two channels and with the highest agreement with other units during the same deployment period. After that, the same data download and cleaning procedure were applied.
The final dataset contains 50 PA sensors that were located in 16 states across 39 sites (Fig. 2), with a total of 12,705 records. California and Iowa have 19 sensors and account for almost 60% of the total number of data records. The longest data collection period was 833 days and the shortest one was only two days. Thirty-eight sites contribute over 100 records, which is equivalent to approximately three-month period of data collection (Fig. S1). Overall, the PA sensors are in good agreement with the reference data, with the mean R 2 as 0.88 using linear regression, but tend to overestimate the ambient PM 2.5 level (Fig. 2). The mean R 2 between the PA and reference data for all sites is 0.88, with the highest agreement as 0.996 and the lowest as 0.468. Detailed site information and data summary can be found in the supplementary file. One example of the timeseries comparison between the PA and reference data collected in the Texas site is displayed in Fig. S2.

Data Experiments 2.3.1 Testing the effects of different algorithms and parameter combinations
We tested ten widely applied and openly accessible machine learning algorithms that can be roughly divided into four groups: regression-based, distance-based, network-based, and ensemble (Table 1).
Regression-based algorithms. As one of the earliest methods being tested, multivariate linear regression (MLR) takes the linear form of one response variable and a set of explanatory variables. In LCS calibration studies, the readouts of the reference instrument are the response variable and the LCS data is the main explanatory variable. Other influencing factors, including environmental or mechanical ones (e.g., temperature, RH, sensor age), have also been widely used under the assumption that all factors respond linearly to the reference data. Ordinary least squares is often used by default in MLR to estimate the coefficients by minimizing the sum of the squared residuals. The final selected US-wide correction model for PA sensor adopted the MLR form Barkjohn et al. (2021): Ridge, Bayesian ridge, and Lasso are all extensions of MLR, with additional regularization parameter that aims to minimize complexity. Ridge regression uses a tunable additive L2 norm penalty term-the sum of squares of coefficients-in the optimization. Alpha is the parameter that The map shows the collocation sites. The color symbol indicates the total number of PA data records and the size symbol indicates the total number of collocation sites for each state. The scatter plot shows the relationship between the mean PM 2.5 concentration reported by the PA sensors and their corresponding reference station during the calibration period. The marks are labeled by site. Color represents the R 2 between those two PM 2.5 values. The box plot displays the five-number summary of the mean PA data.
balances the minimization of the residual sum of squares and the magnitude of coefficients. The model complexity tends to reduce as the alpha value increases. An optimal alpha provides a trade-off between significant overfitting at low alpha values and underfitting at high alpha values. Bayesian ridge regression uses regularization in probabilistic terms. The model estimation is conducted by iteratively maximizing the marginal log-likelihood of the observations (Pedregosa et al., 2011). Lasso performs L1 regularization by adding a factor of the sum of absolute value of coefficients in the optimization process. The alpha works similar to that of ridge regression. Support vector machine (SVM) regression finds the best fit line as the hyperplane that has a maximum number of points. SVM uses kernel functions, including linear, polynomial, and gaussian radial basis kernel function, to convert low dimensional data space into a better dimensional space, so data points can be better separated.
Distance-based algorithm. K-nearest neighbors (KNN) is a distance-based method that uses the mean of all the nearest neighbors' values to predict the value of new data. K indicates the count of the nearest neighbors. The weights of neighbors could be assigned in two ways: uniform treats all neighbors equally, whereas distance-based weighting assigns higher weights to the closer neighbors. Network-based algorithms. Neural network (NN) is relatively new but attractive to users because of its superior performance (Okafor et al., 2020;Yamamoto et al., 2017). One previous study has reported a 10% increase in R 2 from MLR to NN, with the improvement attributable to its ability in capturing the data variation (Mahajan and Kumar, 2020). A NN is an architectural structure consisting of highly interconnected processing units (neurons) that are organized in layers. The weight of neurons is tuned and optimized through the supervised learning process.
Ensemble methods. Ensemble techniques are typically built upon many weaker classifiers to create a strong classifier. AdaBoost is the first generation of boosting algorithms and another successful example is the random forests that build decision trees independently and combine results at the end. Both methods have a main parameter-the number of estimators or treescontrolling the structure. Generally, a larger quantity of estimators can lead to better performance but longer training time. Additionally, the accuracy will plateau after a certain number of estimators. Gradient boosting differs by building one tree at a time and combining results along the way in a forward stage-wise fashion. A larger number of boosting stages usually results in better performance. The fraction of samples is fitting the individual base learners. A fraction less than one may lead to a reduction of variance and an increase in bias.
Since there is no golden standard for choosing the optimal parameter, we tested a range of parameters that are recommended by the algorithm documentation or close to the default values picked by the sourcing code Scikit Learn (Pedregosa et al., 2011). Python 3.9.7 was used to implement those algorithms. All codes are available at: https://github.com/unt-geo/Calibration 2.3.2 Sample size effect A critical question in data-driven techniques is to determine how much training data is needed to achieve a specific performance goal. In the context of the LCS field calibration, we aim to answer two questions: 1) As training data grows, will performance continue to improve? 2) Does the sample size effect vary by algorithms?
The proper test of the sample size effect requires a geographically and size balanced dataset. Otherwise, the assessment may be misleading. To reduce the bias, we first adjusted the whole daily average dataset by selecting all 38 sites with more than 100 days of data, and further randomly selecting 100 data points from each site. The final dataset with 3,800 records was used to conduct the sample size experiment, which was randomly split into 90% for the training set and 10% for the test set. The training data was used to fit the model and the test data was to provide an unbiased evaluation of the model fit on the training dataset. We further prepared various training datasets at different sample sizes. Specifically, we constructed 10 sets of training samples with 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, and 100% of the entire dataset in the order of data collection time.

Predictor selection
Many previous works are focused on using a single variable-PA PM 2.5 concentration-to run the calibration model. However, because the responsive rate of LCSs is controlled by a range of internal and ambient environmental factors, it can be beneficial to include additional influencing variables in the modeling process as evidenced in the previous literature (Gao et al., 2015). Similar to the sample size effect, the key questions lie in whether more predictors will lead to better performance, and whether a plateau effect exists in the variable selection for certain algorithms.
In this experiment, we picked seven variables that are commonly used in calibration studies, separately PM2.5 concentration (PM 2.5 conc), PM 2.5 count (C 2.5 ), PM 1 count (C 1 ), PM 5.0 count (C 5 ), PM 10 count (C 10 ), humidity (RH), and temperature (T). PM 2.5 conc is the mass concentration generated by the proprietary algorithm developed by the laser counter manufacturer Plantower, which incorporates assumptions about potentially varying density and shape of the particles. However, because the information on the assumptions is unrevealed, it is unlikely that the assumed particle properties would be similar to those observed in the fields. With this consideration, we included the other type of PA output values-the particle counts in different sizes, which are the raw reporting of airborne particle numbers. Some studies have found that particle counts explain well in the calibration model (Zusman et al., 2020).
We first tested the effects of each single predictor on explaining the variance of reference data using univariate linear regression. We then tested the combined effects of multiple predictors. Datasets 2-5 incorporated the RH and T to account for the known sensitivity of sensors to fluctuations in meteorological conditions (Castell et al., 2017). RH influences the LCS readings by changing the particle size and the refractive index when water condenses onto particles (Di Antonio et al., 2018;Molnár et al., 2020). The water moistening effect also partially explains the typical overestimation of LCSs, which is especially evident when RH exceeds 75%. Temperature interferes with the nature of the aerosol samples and impacts the sensor performance, especially in the ambient environment (Olivares and Edwards, 2015). However, how the sensors respond to the temperature is less studied and still unexplained.
The seven variables were combined into five datasets (Table 2). For example, Dataset 3 included three variables while Dataset 4 used seven variables. Dataset 0 that uses PM2.5 concentration as the single explanatory variable was used as the baseline for comparison. Other variables were gradually included according to their importance values obtained from the single variable test (Tables S3 and S4).

Accuracy metrics
We used the coefficient of determination (R 2 ) for quantifying the portion of the variation in the dependent variable that can be predicted from the model and the independent variables. Root mean squared error (RMSE) was used as indices of the respective average absolute error. In this paper, we reported how those algorithms respond to adjustments in training data and parameters. Accuracy values were used as an indicator for the degree of response. However, we

Effects of Algorithm and Parameter Settings
We compared the effects of different algorithms on sensor data calibration by using the baseline Dataset 0 and tested with the default and most optimal parameter setting (Fig. 3, Table 3). Between the two major categories, the regression-based methods achieve overall high accuracies, except for Lasso. The ensemble methods show the largest discrepancies in their performance, with GTB proving to be the best and AB the worst. NN slightly outperforms some models, although at the higher computational cost.
On average, the KNN models tend to perform best using a uniform weight function with a lower number of input features and a distance-based weight function with a higher number of variables (Fig. S3). For NN, neuron count and layer count seem to have similar levels of impact on the performance; both increase the model's ability to create a representation of the input, but more neurons increase the amount of information gained while the number of layers increases attention to increasingly fine details (Fig. 4).
In the ensemble methods, AdaBoost performs best on average when using a smaller number of classifiers (Fig. S4). The RF is very sensitive to the number of trees when only a few trees (seven) are used. The performance largely stagnates with an increasing number of trees (Fig. S5). This is because the trends in the data can be largely accounted for using only 7 or more trees. Most outliers are eliminated and overfitting to a particular input is diminished, so increasing the number of   trees has little effect. Gradient Tree Boosting shows improving performance with the number of trees increasing until past 50. However, the sample fraction shows mild performance changes (Fig. S6). Ridge regression, a modification of the linear regression model, performs nearly identically to MLR (Figs. S7 and S8). It appears that introducing a small amount of bias to the linear regression model does not significantly change the performance. Lasso regression performs the worst among all models. As the value of alpha increases, the model performs even worse. As the alpha value gets lower, meaning the lasso regression is approaching regular linear regression, the model better fits to the data (Fig. S9). Bayesian ridge regression performs similarly to regular ridge regression, with a very slight increase in R 2 for the final dataset (Fig. S10). For SVM, the kernel plays a big role in determining the model performance. Default RBF outperforms the linear and polynomial kernels. The linear kernel increases in performance relatively slower compared to the other two kernels. RBF shows good performance whereas the polynomial kernel performs poorly, indicating that it is not a good fit for this dataset (Fig. S11). The underperformance of linear kernel is likely because the dataset is not linearly separable due to the nature of PM2.5. Similarly, the relatively simple 3rd-degree polynomial kernel used in this study does not fit well, especially to the datasets with fewer variables as these are likely more linearly separable, as shown by the similar performance of RBF and linear kernels with the less-variable datasets.

Sample Size Effects
Most algorithms show positive responses to increased training sample sizes, except for Lasso (Fig. 5). The algorithm most affected by the training sample size is AB, of which the R 2 raised to 100% from using one-tenth to 80% of the whole data. SVM, Lasso, and NN are the least affected. With a very small dataset (i.e., two weeks of hourly data, about 340 data points in this study), SVM, NN, and Lasso can produce relatively good results. When the dataset is rich (i.e., half a year's hourly data), nine out of eleven algorithms reach the R 2 higher than 0.8, with NN and RF especially high (over 0.9). Generally speaking, the sample size effect is most evident when the sample size drops below 30%.
Calibration duration has been recognized as a non-neglectable factor in calibrating LCSs. The sample size effect can also provide insights into the optimal time length to co-locate LCS sensors with a reference instrument. Using our compiled national dataset, there is a consensus among various algorithms that the accuracy improves the most when the sample size increases to approximately 1000 records, which is equivalent to six weeks of continuously collected hourly data. Passing this threshold, the accuracy improves more slowly or remains stable.

Effects of Predictors
Figs. 4 and S3-S11 displayed the results of the predictor selection. As a comparison, we applied the US-wide correction model (Formula (1)) to our dataset, which obtained R 2 as 0.76 and RMSE as 2.63. For KNN, the input variables play a more significant role in the performance than the model parameters as would be expected, with sharp performance increases between the first, second, and third datasets and a moderate performance boost between the third and final datasets (Fig. S2). As a non-parametric method, more variables create a higher dimensional space for the distance calculation, which typically leads to more refined predictions. As the dimension gets higher, the advantages of multivariate distance calculation become weaker. In NN, the effect of the number of layers and neurons depends on the sample size. Higher numbers of both layers and neurons improve the model's performance with higher values beginning to stagnate in a performance increase. This contrasts with the results at low levels of data, where the lower values perform better. This is likely because, at lower sample sizes, the larger neural networks are more likely to overfit the small amount of training data since there is not enough data to get a good generalization with that number of details.
AdaBoost performs best on average when using a smaller number of estimators. This performance trend is especially apparent for Datasets 3 and 4 where performance drastically decreases as the number of estimators increases (Fig. S3). Gradient Tree Boosting shows significant increases in performance with the larger datasets than with the smaller ones.
The MLR model performance increases slowly after the second dataset is introduced. For both Ridge and Bayesian ridge regression, the dataset used does not significantly increase performance, except for going from the first dataset to the second. For Lasso, the dataset used makes almost no difference in the poor performance. For SVM, the inclusion of more predictors can lead to about a 10% increase in R 2 from Dataset 0 to Dataset 4, regardless of the kernel type used.
In general, the inclusion of humidity and PM 2.5 count can improve model performance, as these two factors demonstrated the heaviest weights of coefficients in the four regression models (Table S3). PM 2.5 also obtained the highest importance score in the three tree-based models (Table S4). Temperature and particle count at other sizes only slightly influence the outcome. Although when the single variable was evaluated against the reference data, temperature shows a slightly better correlation than RH (Table S2). This can be attributed to two reasons. First, ambient temperature has not been proved to significantly influence the physicochemical property of PM particles. Second, all particle counts are strongly correlated. Including highly correlated variables can introduce multicollinearity and data redundancy issues to the model. Particle counts at the different sizes all show high correlation (Table S1) and they can be good proxies for PA concentration data when the count to concentration conversion formula is not publicly available.
We need to note that some predictors that may be important are not included in the analysis due to data limitation, such as sensor age. Dust sensors lose sensitivity and the accuracy drifts over time (De Vito et al., 2020;Jiao et al., 2016), which becomes another potential source of measurement artifact (Hasenfratz et al., 2012). PurpleAir sensor has a shorter shelf life than high-end reference instruments and the accuracy is found to degrade after 1 to 1.5 years after deployment (informal communication through PurpleAir User Group). Other meteorological factors influencing LCS performance include wind speed, sensor temperature, and sensor type (Liang, 2021).

CONCLUSIONS
Failure to invest in calibration may leave large uncertainties in retrieving reliable LCS data that further hinders its broader applications. As a result, field calibration of LCS has been recognized by a larger user group as a critical and necessary step before the LCS deployment for evaluating their reliability and improving the accuracy. Despite the increasing interest, there is an evident knowledge gap on how data-driven algorithms affect calibration performance. This paper aims to provide a first-hand report on the performances of each algorithm, and the impacts of sample size and predictor selection. The key findings are summarized below.
Algorithms respond differently to the baseline dataset and there exists a large variation. While this study implies that NN and GTB slightly outperform the other methods, the users should test the algorithms on their own as the datasets behave differently. Regression-based methods show the most consistent high accuracy, and we thus recommend it as a viable option for studies with limited effort in parameter tuning and method selection.
The sample size effect is evident in our experiment, especially when the sample size is small. Regardless of the algorithm type, the accuracy drops significantly when the calibration model was trained using less than 1000 records, which is equivalent to six weeks of continuously collected hourly data. However, more training data doesn't always lead to higher accuracy. The accuracy plateaued when the training sample reaches a certain level, which varies slightly among algorithms.
More predictors lead to better accuracies, but the boosting is most evident when PM2.5 particle count and humidity were added to the data models. Temperature and particle counts at other sizes play a minor role. Considering the tradeoffs between computational efficiency and more predictors, we suggest the inclusion of PM 2.5 concentration, particle count, and humidity in the model establishment.