Multiple PM Low-Cost Sensors, Multiple Seasons’ Data, and Multiple Calibration Models

In this study, we combined state-of-the-art data modelling techniques (machine learning [ML] methods) and data from state-of-the-art low-cost particulate matter (PM) sensors (LCSs) to improve the accuracy of LCS-measured PM 2.5 (PM with aerodynamic diameter less than 2.5 microns) mass concentrations. We collocated nine LCSs and a reference PM 2.5 instrument for 9 months, covering all local seasons, in Bengaluru, India. Using the collocation data, we evaluated the performance of the LCSs and trained around 170 ML models to reduce the observed bias in the LCS-measured PM 2.5 . The ML models included (i) Decision Tree, (ii) Random Forest (RF), (iii) eXtreme Gradient Boosting, and (iv) Support Vector Regression (SVR). A hold-out validation was performed to assess the model performance. Model performance metrics included (i) coefficient of determination (R 2 ), (ii) root mean square error (RMSE), (iii) normalised RMSE, and (iv) mean absolute error. We found that the bias in the LCS PM 2.5 measurements varied across different LCS types (RMSE = 8– 29 µ g m –3 ) and that SVR models performed best in correcting the LCS PM 2.5 measurements. Hyperparameter tuning improved the performance of the ML models (except for RF). The performance of ML models trained with significant predictors (fewer in number than the number of all predictors, chosen based on recursive feature elimination algorithm) was comparable to that of the ‘all predictors’ trained models (except for RF). The performance of most ML models was better than that of the linear models. Finally, as a research objective, we introduced the collocated black carbon mass concentration measurements into the ML models but found no significant improvement in the model performance.


INTRODUCTION
Over the last decade, air pollution low-cost sensors (LCSs) have become popular and are complementing the existing air pollution monitoring capacity around the world (Kumar et al., 2015;Rai et al., 2017;Gupta et al., 2018;Morawska et al., 2018). The number of air quality studies using LCSs has tremendously increased in recent years. LCSs are easier to handle, install, and maintain than reference-grade monitors. Further, they are capable of measuring air pollutants at a high temporal resolution and can improve the granularity of monitoring. Strategically placed LCSs can provide detailed information on air quality and its variability within a region.
The affordability and simplicity of LCSs, however, come with the trade-off of accuracy. In general, LCS measurements of air pollutants are often less accurate than reference-grade measurements (Clements et al., 2017). In case of particulate matter (PM) LCSs, most LCSs quantify PM mass concentrations using the light scattering (nephelometric) principle. This technique is sensitive to aerosol microphysical properties and environmental factors (e.g., aerosol size distribution, aerosol refractive index, and humidity) in addition to the particle mass concentration. Moreover, LCSs suffer from declining sensing accuracy with age. These aspects can introduce bias in PM measurements, thereby requiring evaluation and correction to ensure accuracy. A common practice for evaluating the performance of LCSs and deriving correction factors for LCS-measured pollutant concentrations is to collocate the LCS and reference-grade instrument and analyse the collocation data. Several studies have applied a range of training-based models (from simple linear regression to machine learning [ML] algorithms) to the collocation data to derive regression coefficients/functions that have been used to correct the LCS-measured PM2.5 (e.g., Barkjohn et al., 2020;deSouza et al., 2022). Studies have shown that data correction increases the accuracy of LCS PM2.5 measurements and the corrected values are comparable to reference-grade PM2.5 measurements (Tryner et al., 2020;McFarlane et al., 2021;deSouza et al., 2022;Sreekanth et al., 2022).
One of the most extensive and systematic evaluation of LCS measurements has been performed by the Air Quality Sensor Performance Evaluation Center (AQ-SPEC; www.aqmd.gov/aqspec) of the South Coast Air Quality Management District, United States. The program evaluated 39 PM LCSs based on chamber experiments and field collocations and found that the performance of the LCSs considerably varied among manufacturers and models (AQ-SPEC 2019). However, similar institutional-level LCS evaluation facilities are lacking in other countries, especially in developing countries where regulatory PM2.5 monitoring devices are scarce and sparsely located (Brauer et al., 2019). Given the extreme pollutant concentrations and heterogeneous sources in LMICs that differ from those in high-income countries with a capacity of extensive LCS testing, it is critical that LCSs are rigorously evaluated in different LMIC settings and the measurements are corrected accordingly for accuracy.
In India, studies have applied statistical models to correct LCS measurements of PM2.5 Sreekanth et al., 2022), but a limited number of studies have used ML (Kumar and Sahu, 2021) methods. In this study, we investigated the performance of multiple PM2.5 LCSs and trained several ML models using collocation data to correct the hourly mean LCS PM2.5. The collocation experiment was conducted in Bengaluru city (south India), and the study period covered all major seasons (December 2021-August 2022). To our knowledge, this is one of the first studies from LMICs to evaluate multiple PM2.5 LCSs at one geographical location. As a case study, we also introduced collocated black carbon (BC) mass concentration measurements as an additional predictor in the ML models to investigate any possible improvement in model performance.

MATERIALS AND METHODS
In total, nine (Fig. S1) PM2.5 LCSs were collocated with a beta attenuation monitor (BAM, a reference instrument for measuring PM2.5) on the roof terrace of the Center for Study of Science, Technology, and Policy (CSTEP) building. CSTEP (13.04°N, 77.57°E) is located in the northern part of Bengaluru city. Bengaluru is the administrative capital of Karnataka and is located at an elevation of 900 m above mean sea level. The city experiences a tropical savanna climate around the year, with an annual rainfall of ~960 mm (June, July, August, and September are the monsoon months). The annual PM2.5 concentrations are about ~27 µg m -3 , with higher values during winter (~35 µg m -3 ), followed by pre-monsoon, post-monsoon, and monsoon seasons . In this study, collocated PM2.5 measurements from BAM and the LCSs were obtained for 9 months from December 2021 to August 2022.

BAM
In the current study, we used a BAM (BAM1022; Met One Instruments, Inc., Grants Pass, USA) to measure hourly mean ambient PM2.5 levels. BAM1022 is a United States Environmental Protection Agency-certified Federal Equivalent Method class instrument for measuring PM2.5 levels. BAM1022 uses C 14 as a beta particle source and operates at a nominal flow rate of 16.67 litres per minute. Based on the difference in the attenuation of the glass fibre filter tape before and after PM2.5 loading and the flow rate, BAM estimates the PM2.5 mass concentration. The PM2.5 measurement range of the BAM is between -15 µg m -3 and 10,000 µg m -3 . The BAM comprises a heater that removes moisture from the sampled ambient airflow. It is equipped with a meteorological sensor that is capable of measuring ambient temperature, relative humidity (RH), and pressure. More details on the BAM and the precision of its PM2.5 measurements have been reported previously (Kushwaha et al., 2022). PM2.5 data from the hourly channel of the BAM was used for our analyses and model training.

LCSs
All LCSs used in the study were compact, Internet of Things (IoT)-based devices comprising a laser PM and meteorological sensor. The following LCSs were used: Aerogram (https://aerogram.in/), Airveda (https://www.airveda.com/), Atmos I and Atmos II (http://urbansciences.in/), BlueSky (https://tsi.com/), PAQS (https://paqs.biz/), Prana Air (https://www.pranaair.com/), Prkuti (https://www.prkruti.com/), and PurpleAir (https://www2.purpleair.com). The internal laser PM sensor consists of a micro fan to draw the ambient air inside the optical chamber, where particles are detected using the light scattering technique. Data logging and averaging intervals of these LCSs varied between 30 sec and 30 min. All LCSs were equipped with a meteorological sensor capable of measuring the temperature and RH. The PM2.5 measurement range of the LCSs was between 0 and 1,000 µg m -3 . Most LCSs are equipped with a microSD card, which stores data locally in addition to cloud storage. Two versions of Atmos were used in this study, with a different basic laser PM sensor in each. Plantower-based Atmos was named Atmos I, whereas Sensirion-based Atmos was named Atmos II. Most LCSs investigated in this study had Plantower (PMS5003/PMS7003) as the internal laser PM sensor, whereas other LCSs were equipped with Sensirion, Nova, Honeywell, PAS-OUT-01, and Winsen laser sensors. PurpleAir was equipped with dual Plantower sensors and output PM2.5 data in two channels labelled CF_1 and CF_ATM (Barkjohn et al., 2020). We trained individual ML models for the PM2.5 values from PurpleAir for both CF_1 and CF_ATM channels. Technical and operational details of the LCSs are listed in Table S1.

Aethalometer (AE33)
We used a rack-mount Aethalometer (AE33, Aerosol Co. Ljubljana, SI) to measure BC mass concentrations. AE33 measures filter attenuation (before and after aerosol loading) at seven wavelengths and is capable of providing high temporal-resolution PM absorption mass concentration data. Values measured at 880-nm wavelength were considered BC mass concentrations. AE33 uses DualSpot™ technology that compensates for loading errors, which are commonly observed in most filter-based optical analysers. The instrument was configured to operate at a flow rate of 2 litres per minute and log 1-min average concentrations. A 2.5-micron cut cyclone was installed in the inlet of AE33 to allow particles smaller than 2.5 µm into the detection chamber.

ML Models
We explored four different ML models: Decision Tree (DT), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Support Vector Regression (SVR). All ML models (except XGBoost) were trained using the scikit-learn package in Python programming language. We used the xgboost package for training the XGBoost models.

DT
DT is a rule-based model that divides the entire dataset into homogeneous groups. Briefly, it iteratively splits the dataset into regions based on predictors, which result in a maximum reduction of the error. A new sample is estimated as the average value of the training set true value at the terminal node, assigned based on the rule defined by the tree on predictors. The DT regression models have few limitations: (i) high variance, (ii) less predictive performance if the relation between predictors and the response variable is not defined accurately, and (iii) limited predicted values (based on the number of terminal nodes).

RF
RF is an ensemble method where parallel DTs are grown on bootstrapped random samples, i.e., subset samples drawn with replacement from the dataset (Breiman, 2001). Each DT is built on a subset of predictors, which introduces a reduction of correlation between trees. For a new sample, the prediction from each tree is averaged. One of the drawbacks of RF regression is that the predictions are always within the range of the training samples' true values.

XGBoost
The XGBoost is an optimised distributed gradient boosting library which implements machine learning algorithms under gradient boosting framework also known as Gradient boosting machine (GBM). It incorporates ensemble model which grow in a stage wise manner where each weak learner (regression tree) is fitted on the residual from the previous learner (Chen and Guestrin, 2016). Unlike DTs, the leaves of each of the regression trees are assigned a score. Based on the rule of the DT, each new observation is classified to the leaves of the regression trees at each iteration, and the final prediction is the sum of the corresponding scores at the leaf of each tree. To prevent overfitting, an additional regularisation parameter is added to the objective function. The newly added leaf scores are shrunk by a factor eta, and a column subsample is used to find the best feature to split.

SVR
Linear SVR estimates the relation between input and output variables by finding a linear function such that the deviation of the training points from the true value is less than or equal to a specified margin called the maximum error epsilon (ε) and the function be as flat as possible (Smola and Schölkopf, 2004). The deviation ε from the true value is described as a tube around the function. The deviations outside the tube are accounted for errors. In case of non-linear SVR, the input space is mapped to a new feature space and linear SVR is applied to that space. The kernel functions of the support vector machine represent the dot product in the new feature space. The drawback of a support vector machine is that it is scale invariant and is more suited for smaller datasets as the computation cost or time increases with an increase in training data points. In our study, z-transformed datasets were used for training the non-linear SVR models.

Predictors
We used three types of predictor variables to train the ML models: (i) continuous variables, (ii) categorical variables, and (iii) cyclic variables. The continuous variables included mass concentrations of all PM size fractions obtained from the LCSs, meteorological parameters (temperature and RH) from the LCSs, and BAM PM2.5 measurements. As a case study, we used collocated BC data as a predictor, which is a continuous variable. Using the timestamp of the data, we created new categorical variables related to (i) hour of the day, (ii) month of the year, (iii) season, and (iv) weekend/weekday. These variables were transformed before using them in the modelling exercise. We converted weekend/weekday and season-related variables to dummy variables; 'hour of the day' and 'month of the year' were converted to cyclic values using both sine and cosine transformations. Reference PM2.5 (BAM PM2.5) was used as the response variable.

Hyperparameter Tuning
All four types of ML models were investigated for their performance in correcting the LCS PM2.5 measurements with and without hyperparameter tuning. ML models with default hyperparameters in the scikit-learn/xgboost packages were termed as untuned models, whereas ML models with hyperparameter tuning performed using the Grid Search algorithm were termed as tuned models. To evaluate the effect of tuning, all ML models using all predictors were trained with and without tuning. The hyperparameter space for the different LCS and ML models is presented in Table S2.

Recursive Feature Elimination (RFE)
Simple ML models with significant and uncorrelated features can run more efficiently by occupying less computational space and having less execution time. Therefore, we defined three sets of predictors for each of the LCSs for the ML model training: (i) all predictors, (ii) uncorrelated predictors, and (iii) significant predictors. To arrive at the uncorrelated predictor sets for each LCS, we calculated the pairwise Karl Pearson correlation coefficient for all predictors from the complete list and dropped one predictor if the coefficient was greater than the defined threshold of 0.9. In the next step, we used the RFE feature of the scikit-learn library to further narrow down the predictor list for each of the LCSs. We selected the optimal number of predictors (significant predictors) by comparing the R 2 value of linear regression models trained on different numbers of predictors selected using RFE. The significant predictors for each LCS were finalised when insignificant improvement in R 2 was observed even after adding further predictors to the linear regression. The list of predictors for all three types of predictor sets for each of the LCSs is provided in Table S3.

Cross-validation and Performance Metrics
To understand a model's performance on unseen data, we performed a hold-out validation exercise for all models trained, wherein 75% of the data were used for model training and remaining 25% were used for testing. In addition, the test data were unseen by the model hyperparameter tuning. The accuracy of model-corrected PM2.5 was quantified based on the (i) coefficient of determination (R 2 ), (ii) root mean square error (RMSE), (iii) normalised root mean square error (NRMSE), and (iv) mean absolute error (MAE).
where Yi, Y� , and Ŷi represent true value, mean of the true value, and estimated value, respectively, and n is the number of paired data points. An increase in R 2 indicates improvement in the performance, whereas a decrease in RMSE, NRMSE, and MAE indicates improvement in the performance.

RESULTS
The PM and meteorological data from the LCSs and BC from AE33 were averaged to 1-h intervals to match the temporal resolution of hourly BAM datasets. PurpleAir PM2.5 data were quality checked based on the difference between their values from the dual Plantower sensors, following Barkjohn et al. (2020). Fill values from all devices were removed from the analysis. The hourly PM2.5 data availability chart is shown in Fig. S2. Prkruti PM2.5 had the highest amount of data unavailability (due to instrument malfunction and IoT issues), and Atmos II was installed in February 2022. Records having data for all predictor variables and response variable were only considered for the model training.

Bias in LCS PM2.5
The average value of the BAM PM2.5 for the study period was ~32 µg m -3 . Detailed statistics on the hourly concentrations of PM2.5 from BAM and LCSs are given in Table S4. Scatter plots between hourly uncorrected LCS and BAM PM2.5 revealed that the bias of the LCS PM2.5 was different across various sensors (Fig. 1). All LCS PM2.5 values maintained a linear relationship with BAM PM2.5. Across the LCSs, the R 2 values of the linear fit varied between 0.63 and 0.89. The bias of the LCS PM2.5 (in terms of RMSE) varied between 8 µg m -3 and 29 µg m -3 . The NRMSE of the LCS PM2.5 ranged between 0.26 and 0.89. The observed bias could be because of the differences in the geometry of the optics chamber and the wavelengths used in the laser PM sensors (e.g., Hapidin et al., 2019). The highest bias was observed in PAQS PM2.5, while the lowest was in Atmos I PM2.5. PAQS highly underestimated the reference PM2.5. Comparably, Plantower-based LCSs (Aerogram, Atmos I, and PurpleAir) performed better. Atmos II and BlueSky consist of the same laser PM sensor (Sensirion), and their RMSE values ranged between 11 µg m -3 and 15 µg m -3 . The performance metrics of the uncorrected LCS PM2.5 are listed in Table 1. Further, the performance of all LCSs was evaluated on a seasonal scale. The calendar year was divided into four seasons, namely, winter (JF), pre-monsoon (MAM), monsoon (JJAS), and post-monsoon (OND). In the current study, OND comprised only of the December 2021 data. The PM2.5 values were lower during the monsoon season, followed by pre-monsoon, winter, and post-monsoon. Scatter plots between uncorrected LCS PM2.5 and BAM PM2.5 (data obtained during different seasons are shown in different colours) are shown in Fig. S3, and the corresponding performance metrics are given in Tables S5 and S6. Box plots of season-wise performance metrics of uncorrected LCS PM2.5 are given in Fig. 2. Relative to other seasons, the performance (in terms of NRMSE) of the LCSs during monsoon season was poor. No seasonality in the performance was observed during other seasons. The observed bias in the LCS PM2.5 was consistent with that in previous laboratory and field evaluations (Badura et al., 2018;Feenstra et al., 2019;Kim et al., 2019;Levy Zamora et al., 2019). Based on a multi-season field evaluation of PurpleAir sensors, Magi et al. (2020) reported an RMSE (MAE) of ~7.5 µg m -3 (5.8 µg m -3 ) for the uncorrected PM2.5. Feenstra et al. (2019) presented the field evaluations of 12 commercially available LCSs under ambient conditions as a part of the AQ-SPEC sensor evaluation program spanning over a 3-year period. Their performance evaluation revealed that 6 of 12 sensors performed with an average R 2 > 0.70 and MAE ranging between 4.4 µg m -3 and 7.0 µg m -3 (for PM2.5 concentration range < 50 µg m -3 ).
We also compared the temperature and RH measurements by the LCSs with the BAM ambient meteorological measurements. Compared with the BAM measurements, most of the LCSs overestimated the temperature and underestimated the RH. Temperature (RH) measurements by PAQS (Prana Air) were more accurate than those by other LCSs. The RMSE values of LCS temperature measurements ranged between 3°C and 7°C, whereas those of LCS RH measurements ranged between 7% and 30% (see Figs. S4 and S5). This could be due to the placement of the meteorological sensor with respect to the LCS electronics. In most of the LCSs, the electronics were compactly packed, and the heat emitted by these electronics could impact the temperature and RH measurements.   Fig. 4 depicts the performance of the tuned ML models trained using 'all predictors', 'uncorrelated predictors', and 'significant predictors'. The metrics are given in Tables S9 and S10. The performance metrics provided in the tables were derived based on the testing dataset. The performance of the models trained using 'all predictors', 'uncorrelated predictors', and 'significant predictors' in correcting the LCS PM2.5 was comparable. In terms of R 2 , the degradation in the models' performance due to its training by 'significant predictors' was around 10%, compared with the performance of models trained using 'all predictors'. In terms of NRMSE, the degradation in the performance of the ML models trained using 'significant predictors' was < 25% compared with that of the models trained using 'all predictors'. The highest increase (from 'all predictors'-trained models) in the NRMSE values for 'significant predictors'-trained models was observed for RF-corrected Aerogram PM2.5. The performance of ML models trained using 'uncorrelated predictors' was intermediate between that of 'all predictors'-and 'significant predictors'-trained ML models. Fig. 4. Comparison of R 2 , NRMSE, RMSE, and MAE of the corrected PM2.5 across the tuned 'all predictors' models, tuned 'uncorrelated predictors' models, and tuned 'significant predictors' models. For each of the LCSs, the best performing models are listed in Table 2. The best performing model was chosen based on NRMSE values of the corrected LCS PM2.5. If two models were characterised with the same NRMSE values, R 2 was chosen as the criteria. Out of ten LCSs, SVR emerged as the best performing model for nine and XGBoost as the best performing for one. Of note, the differences in the performances of the XGBoost and SVR models were marginal. The best performing models were 'all predictors'-trained models for four LCSs, 'uncorrelated predictors'-trained models for three LCSs, and 'significant predictors'-trained models for the other three LCSs. The scatter plots between the predicted hourly PM2.5 by the LCS-wise best performing ML models and the corresponding hourly BAM PM2.5 are shown in Fig. 5. As shown in Fig. S6, the quantile-quantile plots revealed that the residuals were normally distributed. The NRMSE of the best-performing models-corrected LCS PM2.5 was improved by 37%-81% compared with that of the uncorrected PM2.5. The highest improvement was observed for PurpleAir_CF1 and the least for Aerogram. PurepleAir_CF1 PM2.5 also showed the highest improvement in terms of RMSE (~77%, ~19 µg m -3 reduction in RMSE).

Performance of Linear Models versus ML Models
To investigate the improvement in the performance of the ML models over statistical models, we trained multi-linear regression (MLR) models using 'all predictors' for all LCSs and compared their performances against the best-performing ML models (see Fig. 6 and Table S11). The MLR Fig. 6. Comparison of R 2 , NRMSE, RMSE, and MAE of the corrected PM2.5 across the statistical model (a multi-linear regression model) and LCS-wise best performing ML models. models also improved the performance of LCS PM2.5. However, except for Aerogram, we observed that ML models performed better in correcting the LCS PM2.5 than the statistical models. For example, when PurpleAir_CF1 PM2.5 was corrected using the MLR model, its RMSE improved by 74%, whereas it improved by around 77% when corrected using the tuned SVR model trained using 'significant predictors'. For Aerogram, the performance metrics of the MLR model and SVR model were almost similar. Compared with MLR, the highest improvement in the ML model performance was observed for Prana Air (NRMSE improved by ~42%), followed by Airveda and Prkruti. ML models can capture more complex nonlinear effects that simple statistical models cannot. Earlier studies (Liu et al., 2019;Considine et al., 2021;Liang, 2021;Gupta et al., 2022;deSouza et al., 2022) have also demonstrated that ML models could perform better than statistical models in correcting the LCS PM2.5.

Case Study
As a case study, we included the collocated BC data as an additional predictor to the best performing ML models and investigated if there was any improvement in the model performance. As all LCSs quantify PM based on light scattering, the inclusion of information on the light absorbing PM (BC) can impact the model performance. A marginal improvement was observed in BC-added Fig. 7. Comparison of R 2 , NRMSE, RMSE, and MAE of the corrected PM2.5 across the LCS-wise best performing models and BCincluded best performing ML models.
ML models (Fig. 7). For example, the RMSE of corrected LCS PM2.5 using best forming ML models ranged between 4.2 µg m -3 and 7.7 µg m -3 , while it was between 3.5 µg m -3 and 7.5 µg m -3 for the models in which BC was also included as an additional predictor (see Table S12 for more details). With the addition of BC, the highest improvement in the model performance was observed for PAQS (22% in NRMSE), followed by BlueSky, and Prana Air. No improvement was observed for Aerogram and PurpleAir.

CONCLUSIONS
We collocated nine different PM LCSs with a BAM in Bengaluru and observed a range of bias in the LCS-measured PM2.5. ML models performed considerably better in improving the LCS PM2.5 accuracies than statistical models. We also observed that the performance of ML models (in correcting the LCS PM2.5) trained using RFE-shortlisted predictors was comparable to that of the 'all predictors' models. In case of RF models, the scikit-learn default hyperparameters performed better than the tuned hyperparameters. It is recommended to explore the default hyperparameters and conduct a thorough exploratory data analysis to eliminate insignificant predictors from the list. In this study, the effects of hyperparameter tuning and the choice of predictors on different LCS PM2.5 were different. Further, the model performance improved when variables related to the periodicities in the continuous variables were included. We created new variables related to the hour of the day, weekday/weekend, month, and season. The inclusion of collocated optical absorption-based BC mass concentration measurements in the ML models did not significantly improve the ML model's performance in correcting the LCS PM2.5.
The study has few limitations. Our study is limited to one geography; the bias in the uncorrected LCS PM2.5 and the performance of calibration models might vary for other geographies. The amount of data available from each of the sensors varied due to the intermittent malfunctioning/IoT issues of a few LCSs. It should be noted that the version of PAQS LCS used in the study was intended for indoor air pollution measurements. For the post-monsoon season, only December data were available. As the LCS technology is continuously evolving, a few of the LCSs used in the study were upgraded to new versions in terms of the internal laser PM sensors.