Assessment of Malaysia-wide PM 2.5 Forecasts from a Global Model

Airborne particulate matter with an aerodynamic diameter of less than 2.5 µ m (PM 2.5 ) is a major air pollutant worldwide. In Malaysia, transboundary ‘haze’ episodes with elevated PM 2.5 concentrations linked to fires are common, causing health and economic harms. To reduce impacts, forecasting PM 2.5 can enable effective PM 2.5 management and decision-making. Until now, PM 2.5 forecasts via a global mechanistic chemical transport model (CTM) have not been evaluated in the setting of Malaysia, where operational PM 2.5 forecasting systems for preventive warnings are not yet deployed. Hence, this study aims to evaluate the performance of PM 2.5 forecasts produced by a global CTM and to assess their suitability for use nation-wide in Malaysia. We used the surface PM 2.5 forecasts from the Copernicus Atmosphere Monitoring Service’s (CAMS) global atmospheric composition forecast dataset (CAMS-GACF) and evaluated them against hourly PM 2.5 observations recorded throughout Malaysia from 2018 to 2020 via exceedance and accuracy analyses. We found that cycle 46r1 CAMS-GACF performance in Malaysia was generally weaker (critical success index (CSI) = 31%, R 2 = 0.36) than reported in other studies (CSI = 20–54%, R 2 = 0.32–0.79) focused on other countries, across multiple metrics in both analyses. We found CAMS-GACF did not accurately capture local-scale spatiotemporal variations in PM 2.5 spatially and diurnally. However, we found CAMS-GACF captured better the increased regional PM 2.5 pollution during the transboundary ‘haze’ episode of 2019. Based on our findings, we also propose recommendations on integrating CAMS-GACF in early-warning systems in Malaysia and on improving forecasts via bias-correction.


INTRODUCTION
Airborne particulate matter (PM) is a major air pollutant. It has natural (e.g., wind-blown dust, sea-salt) and anthropogenic sources (e.g., combustion, forest fires) (Amil et al., 2016;Amin Jaafar et al., 2018;Ooi et al., 2015;Roberts and Wooster, 2021;Suradi et al., 2021). Globally, PM pollution reduces up to 10 million years of life expectancy every year, affects regional water availability, and contributes to climate change (Lelieveld et al., 2019). Southeast Asia is no exception to these, with frequent biomass burning and subsequent PM pollution throughout the whole region (Adam et al., 2021). In Malaysia, haze episodes (mostly transboundary) with heightened PM mass concentrations are regular, occurring almost annually in the last decade (Department of Environment Malaysia (DOE), 2022). These episodes have severe health and economic consequences (Amil et al., 2016;Phung et al., 2022;Sahani et al., 2014): the 2013 Southeast Asian Haze alone cost Malaysia MYR 410 million in hospitalisation bills, medical leaves, and personal protective equipment (PPE), and up to MYR 1 billion more to lost income opportunities (Manan et al., 2018). It is known that PM with an aerodynamic diameter of less than 2.5 µm (PM2.5) dominates the mix during these episodes (Adam et al., 2021;Kusumaningtyas and Aldrian, 2016). The U.S. Environmental Protection Agency (U.S. EPA, 2022) marked PM2.5 as "the greatest health risks", higher than the coarser PM, because it is respirable and can readily enter the bloodstream, affecting the respiratory and cardiovascular systems (Manan et al., 2018). Hence, PM2.5 is not just the major air pollutant in Malaysia, it is also very detrimental to Malaysia's health and economy.
One way to reduce the impacts of PM2.5 pollution is through PM2.5 forecasting. PM2.5 early-warning systems have been implemented in many countries and cities worldwide, often using forecasts derived from mechanistic, computer-driven chemical transport models (CTMs) (Casallas et al., 2020;Celis et al., 2022;Cho et al., 2021;Roux et al., 2020;Savage et al., 2013;Varga-Balogh et al., 2020). Forecasting should aid preparation for bad air quality in advance and early decision-making to reduce exposure and improve resilience via personal and institutional means, e.g., mandate PPEs, institute advanced quarantine orders, and implement dynamic abatement measures (e.g., abating traffic, industrial emission, fire) (Lyu et al., 2017;Zhou et al., 2010). To these ends, forecasting should, at minimum, be able to forecast elevated PM2.5 events such as the Southeast Asian haze episodes. Since PM2.5 can remain airborne long enough to be a regional and long-duration problem like the Southeast Asian hazes (Dahari et al., 2019;Fujii et al., 2016), CTMs for PM2.5 forecasts commonly cover large spatiotemporal scales. For effective PM2.5 management, forecasts should also be made 3 to 5 days in advance for decision-making to translate into actions (Lyu et al., 2017). Therefore, a global CTM with a medium-range forecast is appropriate for forecasting PM2.5.
However, the few air quality forecasting studies that focus on Malaysia are solely based on statistical and machine learning models (ML) applied over limited scale and resolution in space and time (Koo et al., 2020;Lim et al., 2008;Wong et al., 2021). In fact, Malaysia currently only provides reactionary warnings based on current observed pollution levels, limiting the efficacy of pollution response and decisions (Wong et al., 2021). There are currently no operational PM2.5 forecasts for preventive warnings in Malaysia, or any in the literature utilising CTMs, much less evaluating its performance Malaysia-wide. Accordingly, in this study, we aim to evaluate the performance of PM2.5 forecasts from a global mechanistic CTM, and to assess their suitability for use nationally in Malaysia. Specifically, we aim to investigate the difference in forecast performances between: (a) geographical regions, (b) non-haze and haze episodes, (c) forecast horizons, and (d) model versions, and to provide implications and recommendations when using a global CTM for PM2.5 forecasting in Malaysia.

Haze Episodes in Malaysia
Malaysia is located within maritime Southeast Asia, consisting of Peninsular Malaysia and Malaysian Borneo separated by the South China Sea (Fig. 1). PM pollution in Malaysia is thought to be highly dependent on the geographical region and the monsoon seasonality, i.e., the southwest monsoon (SWM) occurring around July, northeast monsoon (NEM) around January, and the inter-monsoons between them (INM) (Juneng et al., 2009). Haze episodes, informally defined as periods with impaired visibility and elevated PM2.5 concentrations, occurs almost every year in the last two decades according to DOE (2020DOE ( , 2019DOE ( , 2018DOE ( , 2017aDOE ( , 2022. Among all haze episodes Malaysia experienced between 2010 and 2019, episodes in June 2013, September-October 2015, and August-September 2019 were particularly severe, affecting Malaysia nationally (see Supplementary Material 1 (SM1)). These episodes all occurred during the regionally drier SWM that increases risks of wildfires, and all are thought to be largely transboundary, sourced from Indonesian forest and peat fires (Reddington et al., 2014;Tacconi, 2016;Zainal et al., 2021). Numerous HYSPLIT backward trajectory analyses also revealed that air parcels usually travelled from Sumatra and Kalimantan to various locations in Malaysia within 1 to 4 days during the SWM hazes (Dahari et al., 2019;Dotse et al., 2016;Kusumaningtyas and Aldrian, 2016;Reddington et al., 2014;Show and Chang, 2016;Zainal et al., 2021). This further highlights the regional nature of extreme PM2.5 pollution and the suitability of using a global forecast over several days to predict haze in Malaysia.

Model Details
We employed forecasts produced from Copernicus Atmosphere Monitoring Service's (CAMS) Integrated Forecasting System (IFS) as our PM2.5 forecasts in our analyses. IFS is a global forecast and assimilation system that was initially developed and used by the European Centre for Medium-range Weather Forecasts (ECMWF) solely for weather-forecasting, but extra modules were developed to also forecast atmospheric composition (henceforth known as CAMS-IFS) (ECMWF, 2022a). CAMS-IFS utilises a four-dimensional variational data assimilation (4D-Var) which combines meteorological and atmospheric composition observations with past forecasts to produce an initial state closer to reality (the analysis), improving the next sets of forecasts (Bannister, 2007;Benedetti et al., 2009). CAMS-IFS may be suitable for forecasting PM2.5 in Malaysia because it has a global extent, and the time-horizon is relevant for timely decision-making and for forecasting the SWM hazes.
The major module of interest within the CAMS-IFS is the IFS-AER. It models the aerosol components, including chemical transformations, transport, and deposition (Rémy et al., 2019). The aerosol emissions are obtained from a combination of natural and anthropogenic emission inventories derived from pre-established inventories (Granier et al., 2019). For example, the monthly anthropogenic emission inventory used (CAMS-GLOB-ANT) was derived by extrapolating EDGAR emissions using trends from CEDS, with approximately 10 km resolution (Crippa et al., 2018;Hoesly et al., 2018). Given our interest in fire-sourced haze, CAMS-IFS-AER also uses daily biomass burning emissions estimated by the Global Fire Assimilation System (GFAS) using real-time remotely sensed fires (ECMWF, 2022b(ECMWF, , 2022a. The transport, chemical transformation, and deposition of emitted aerosols were then modelled according to their size and chemical characteristics (Rémy et al., 2019). The resulting PM2.5 concentrations are calculated based on the simulated concentrations of different aerosols and their sizes at that time step.
Near-term forecasts provided by the previous run are first constrained via satellite aerosol optical depth (AOD) observations over a 12-hour assimilation window. CAMS-IFS-AER then provides 120 hours (5 days) of surface PM2.5 concentration forecasts with approximately 40 km spatial resolution every 12 hours at 08:00 and 20:00MYT, made available through CAMS's global atmospheric composition forecast dataset (CAMS-GACF) (https://ads.atmosphere.copernicus. eu/cdsapp#!/dataset/cams-global-atmospheric-composition-forecasts?tab=overview). Only the 20:00MYT forecasts were used in this study because they provide forecasts closest to the next day in Malaysia. CAMS-IFS underwent a major upgrade to cycle 46r1 in 2019, including increasing vertical resolutions, coupling CAMS-IFS-AER with chemistry modules to model nitrate and ammonium aerosols, and added diurnal cycles to emissions (ECMWF, 2019;Rémy et al., 2019), drastically affecting PM2.5 forecasts (Basart et al., 2019). The major 2019 transboundary haze episode also occurred during the operation of 46r1 CAMS-IFS. Therefore, this study will focus on forecasts produced by the 46r1 model, but other model versions within the study period were also assessed. More details on the model, its configurations and upgrades, and the CAMS-GACF are provided by Rémy et al. (2019) and ECMWF (2022a).

Datasets
(a) Ground observations. Malaysia has an air quality monitoring network since 1995, but PM2.5 was only included as an air quality monitoring parameter in 2017, and as a subindex in the air pollutant index (API) in 2018 (DOE, 2018(DOE, , 2017b. Hence, long-term consistent PM2.5 records are limited. As of 2022, there are 65 continuous air quality monitoring stations (CAQMS) currently in operation throughout Malaysia. All CAQMS sample PM2.5 using TEOM™ 1405-DF Continuous Dichotomous Ambient Air Monitors (https://www.thermofisher.com/order/catalog/product/ TEOM1405DF?SID=srch-srp-TEOM1405DF), with the data undergoing reasonable quality control and assurance by the operating company before being published. The PM2.5 concentration data are used with the national air quality index system and are also commonly used for air quality research in Malaysia (e.g., Ahmad Mohtar et al., 2022;Sobri et al., 2021).
We obtained hourly PM2.5 concentrations for all CAQMS from 1 January 2018 to 31 August 2020, provided by DOE. The 65 CAQMS were grouped by the DOE into five geographical regions, i.e., North, Central, South, East, and Borneo ( Fig. 1). In this study, North, South, and East were combined into one region named 'Peninsular'. The Central region was isolated as a distinct region as it contains the largest urban area in Malaysia, the Greater Kuala Lumpur region. We thus proceed with three regions defined: 'Peninsular', 'Central', and 'Borneo'.
(b) Model forecasts. The 20:00MYT hourly surface PM2.5 forecasts were obtained from CAMS-GACF. The forecasts were then bilinearly interpolated to the latitude-longitude coordinates of each CAQMS. During the two-and-a-half-year study period, CAMS-IFS was upgraded twice, i.e., on 26 June 2018 and 9 July 2019 (ECMWF, 2022a), resulting in forecasts produced by three model cycles: 43r3, 45r1, and 46r1. As mentioned before, this study will focus on the later model version, 46r1, but 43r3 and 45r1 are also evaluated and compared.

Forecast Evaluation
The ground-observed and model-forecasted hourly PM2.5 concentrations were first averaged through each time-horizon day, i.e., the 24 hours from 21:00MYT to 20:00MYT the next day. The first through to the fifth time-horizon days are known as F1-F5, covering 120 hours of the forecasts. Although we are particularly interested in the larger timescales of transboundary hazes, diurnal variations are also important when assessing CAMS-GACF (e.g., Wu et al., 2020). Thus, diurnal accuracies of the forecasts are also evaluated separately as described in Section 2.4.2.
The forecasts were then evaluated against the observations (assumed as true benchmarks) via exceedance and accuracy analyses, which are described in the next section. Our evaluation considers all five time-horizon days, except when explicitly evaluating differences across timehorizons (see Section 2.4.3).

Exceedance analysis
Exceedance analysis dichotomously classifies the PM2.5 status into 'normal' and 'bad' PM2.5 air quality and assesses the forecasts' ability to predict them. This dichotomous classification is typically intrinsic to decision-making and early warnings. The analysis would evaluate whether the model can produce functional forecasts that may be useful in PM2.5 management.
The threshold concentration levels to classify 'bad' and 'normal' PM2.5 levels can be rather arbitrary, defined more by policies (Doswell, 2004). Past health studies classify 'bad' PM levels as concentrations above pre-defined standards or guidelines (Phung et al., 2022;Sahani et al., 2014). In this study, we defined the threshold according to Malaysia's National Ambient Air Quality Standards (MAAQS). The MAAQS were set up based on three different standards: 75 (IT-1, 2015), 50 (IT-2, 2018), and 35 µg m -3 (IT-3, 2020) (DOE, 2014). Days with averaged PM2.5 concentrations (rounded to the nearest µg m -3 ) above these thresholds are considered exceedances. Using these thresholds, the forecasts' performances were evaluated using three metrics: probability of detection (POD), false alarm ratio (FAR), and critical success index (CSI) (see SM2). They reveal whether users can be confident in the forecasts to predict bad PM2.5 days.

Accuracy analysis
Accuracy analysis utilises PM2.5 concentration values directly to compute and aggregate some measures of accuracies using different metrics. While this analysis does not directly evaluate the forecasts' use in the policy domain, it links the forecasts' performance and improvement to specific areas in the forecasts. Since PM2.5 pollution is spatiotemporally heterogenous in Malaysia, we need to fully evaluate the forecasts in space and time. This was done via four characterisation methods (adapted and altered from Meroni et al., 2013) (1) Total characterisation. Accuracies were aggregated across all CAQMS and time: overall forecast performance is characterised by a number; (2) Spatial characterisation. Accuracies were aggregated across time: temporal performances are characterised for each CAQMS, represented by a map of Thiessen polygons; (3) Temporal characterisation. Accuracies were aggregated across CAQMS: spatial performances are characterised for each day, represented by a timeline; and (4) Diurnal characterisation. Since diurnal components were removed via the daily-averaging, diurnal variations in the raw hourly observed and forecasted PM2.5 were averaged across CAQMS and time. The accuracies were aggregated using five accuracy metrics: mean bias (MB), modified normalised mean bias (MNMB), root mean square error (RMSE), fractional gross error (FGE), and coefficient of determinant (R 2 ) (see SM3). They are all used by CAMS validation server in Europe (CAMS, 2022), while some are also commonly used to report model performance in the literature (e.g., Savage et al. (2013); see SM7). These metrics provide a comprehensive comparison between different regions and models in other studies. While MB and RMSE are intuitive because they

Space-Time-Horizon Data Cube
are expressed in real units (µg m -3 ), the normalised metrics, i.e., MNMB, FGE, and R 2 , are better for comparisons between CAQMS and periods with different observed PM2.5 concentrations. Therefore, we only used the normalised metrics for spatial and temporal characterisations. Since diurnal variations are important when evaluating CAMS-GACF, we also assessed diurnal characteristics by simple visual inspection of pattern differences between observations and forecasts, rather than through accuracy metrics.

Comparing variables
The forecasts' exceedance and accuracy performances were also evaluated against different comparing variables. Differences between our three geographical regions were assessed as a primary comparing variable, in addition to three secondary comparing variables (within which differences between regions were also assessed):

Overall Performance
The results of the exceedance analysis are shown in Table 1. Overall, the 46r1 CAMS-GACF performed better with lower exceedance thresholds. The highest POD, FAR, and CSI was obtained for the most stringent threshold (IT-3) at 46%, 52%, and 31%, respectively. POD and CSI increased while FAR decreased (i.e., all improved) with more stringent PM2.5 thresholds for all regions except Central, where FAR and CSI were the highest and lowest (i.e., both worst) when using IT-2. The poorer performance in Central using IT-2 can be attributed to the overall overprediction here, where IT-2 threshold labelled forecasted levels as exceedances and observed levels labelled as non-exceedances, causing higher (poorer) FAR.
The results of the accuracy analysis total characterisation are shown in Table2. During the study period, overall mean observed PM2.5 (o̅ ) was 17.6 µg m -3 , while overall mean forecasted (f� ) was lower at 14.3 µg m -3 . The 46r1 CAMS-GACF had an overall negative bias in Malaysia (underprediction; MB = -3.3 µg m -3 , MNMB = -0.38). In fact, Peninsular and Borneo had a negative bias (MB, MNMB) while Central had a positive bias. While RMSE was highest at Central followed by Borneo, FGE was highest at Borneo followed by Peninsular. But FGE, which measures proportional errors, are simply higher at regions with lower PM2.5 concentrations despite similar or lower RMSE, which   measures additive errors. The overall R 2 is 0.36: 36% of the spatiotemporal variations in observed PM2.5 can be explained by the model. However, the R 2 at Peninsular and Central were more than 0.4, but was only 0.25 at Borneo. These regional differences were also evident in the forecasts' spatial characters (Fig. 3). Forecasted PM2.5 looked evidently higher than observed around the Central region in Figs. 3(a) and 3(b), but vice versa elsewhere. Higher FGE were found at eastern Peninsular and central Borneo. The R 2 at each CAQMS were generally high around 0.75, but certain CAQMS at the northern and eastern Peninsular and central Borneo had low R 2 ; the temporal variations at these stations are not well represented in the forecasts.
When aggregating model accuracy spatially for each day (temporal characterisation), there were some temporal variations in MNMB Malaysia-wide and in the three regions (Fig. 4). FGE remained relatively constant except in Central and Borneo, at which FGE were slightly higher and lower during INMs, respectively. Malaysia-wide daily spatial R 2 were consistent around 0.3, similar to the total R 2 , except during the 2019 haze episode and the March 2020 INM when R 2 were lower. Regionally, spatial R 2 were generally lower and had more temporal variations than Malaysia-wide R 2 . Peninsular generally followed the Malaysia-wide trend in spatial R 2 . Central and Borneo R 2 were around 0.1 but were higher during the end of the haze and during February-March 2020. We also found anomalous overpredicted forecasts at Central from October to December 2019, causing high MNMB, FGE, and low R 2 during these periods. This anomaly is only briefly discussed below (more information in SM5).
Finally, overall diurnal variations in 120 hours of forecasted PM2.5 concentrations mostly fit the observations (Fig. 5). Forecasts were generally lower than observed throughout the whole day for all regions except Central, where night-time forecasts were higher. Regardless, the peaks in observed PM2.5 concentration at 9 am and 8 pm were not present in the forecasts.

Non-haze and Haze Episodes
Next, we compared CAMS-GACF performances during non-haze and haze episodes, with the results shown in Table S2. Firstly, the forecasts predicted exceedances better during haze than non-haze periods for all thresholds. POD, FAR, and CSI behaved as found above according to the different thresholds: overall, POD and CSI increased and FAR decreased (i.e., all improved) with more stringent PM2.5 thresholds Malaysia-wide and in the three regions during both non-haze and haze episodes. The forecasts predicted exceedances better using IT-3 during both periods.
CAQMS recorded three times higher o̅ during haze (42.1 µg m -3 ) than non-haze periods (13.5 µg m -3 ), while f� during both periods were lower (29.2 and 11.8 µg m -3 , respectively). Accuracy patterns during both non-haze and haze episodes were found to be similar to the overall performance: there were negative (positive) biases during both periods in Peninsular and Borneo (Central). However, negative biases were more negative while positive biases were less positive during haze. Although errors (RMSE, FGE) were higher during haze than non-haze episodes in Peninsular and Borneo, R 2 were higher during haze episodes across Malaysia and all regions.
In spatial characterisation, patterns observed during both periods were similar to that of the overall performance above, with positive (negative) bias around the Central region (elsewhere) and FGE being higher at eastern Peninsular and Borneo (Fig. 6). R 2 were visibly higher at all except a few CAQMS during the haze episode, suggesting that temporal variations were better represented in the forecasts during haze than non-haze periods. Lastly, the overall forecasts' diurnal variation largely followed that of the observed during both haze and non-haze periods (Fig. 5). However, the forecasted rise in PM2.5 concentrations during haze from normal levels was less than the observed rise (also found in Table S2). Besides that, forecasts at Central overestimated the rise in night-time concentration from daytime during haze, while underestimated the rise at Borneo.

Time-horizon Days
To assess how forecast accuracy varies with forecast time-horizon, Fig. S2 shows the results of the exceedance analysis (using IT-3) and accuracy analysis, segregated by different time-horizon days. In general, the first time-horizon day (F1) i.e., the first 24 hours of the forecasts, performed the best while the last time-horizon day (F5) performed the worst. Looking at the exceedance metrics, POD and CSI increased and FAR decreased (i.e., all improved) with decreasing time-horizons (from F5 to F1). All regions followed the same trend, except in Central where FAR increased and CSI decreased (i.e., both worsened) with decreasing time-horizons instead. Similar patterns were observed for IT-1 and IT-2. From the accuracy analysis, biases increased with decreasing time-horizons-forecasts at Peninsular and Borneo were less underpredicted, while forecasts at Central were more overpredicted. While errors decreased with decreasing time-horizons for most regions, they were the lowest at F2 and F3 in Central instead of F1. R 2 also increased with decreasing time-horizons, but R 2 at Peninsular and Central were highest at F2.
Comparing the time-horizons temporally (Fig. S3), the forecasted concentration tended to increase with decreasing time-horizons, i.e., forecasts made more recently tended to be higher than those made further back in the past. In fact, forecasted PM2.5 appeared to move closer to the observed with decreasing time-horizons. This F1-F5 gap was more pronounced during the 2019 haze than during non-haze and reflects the forecasts' diurnal variations (Fig. S4). Nevertheless, F1 forecasts in Central appeared distinctly higher than forecasts made at other time-horizons during both haze and non-haze periods, particularly at night when forecasted PM2.5 were much higher than other time-horizons.

Model Versions
Finally, we compared the performances of the three different model versions within our study period, i.e., 43r3, 45r1, and 46r1 (Table S3). Recall that there were no severe haze episodes occurring during cycles 43r3 and 45r1; only non-haze periods were compared. In general, both POD and FAR decreased (i.e., worsened and improved, respectively) with each new model version. To untangle this opposing trend, we used the CSI to determine 'good' or 'bad' model forecasts. Overall, 45r1 performed the best, followed by 46r1. However, the version that produced the best forecasts differed for different regions: 45r1 for Central and Borneo, and 46r1 for Peninsular.
Looking at the accuracy metrics, the 43r3 and 45r1 forecasts overall overpredicted PM2.5 concentrations, while 46r1 forecasts overall underpredicted PM2.5 concentrations. While RMSE were higher for 43r3 and 45r1, FGE of 46r1 was higher (again, lower forecasts tend to have higher FGE). 43r3 forecasts had the highest R 2 , while R 2 of 45r1 and 46r1 were similar. However, regionally, 45r1 forecasts' R 2 was higher than 46r1 in Borneo, but vice versa elsewhere. Regionally, biases were positive (overprediction) at all regions for 43r3 and 45r1; only 46r1 produced forecasts with negative biases (underprediction) at Peninsular and Borneo. Errors generally increased in Peninsular and Borneo and decreased in Central with each new version. Spatially, there were obvious changes with the 46r1 upgrade (Fig. S5). Firstly, as noted above, there was a switch from overprediction to underprediction at most CAQMS. Similarly, FGE was higher at areas that were previously low (eastern Peninsular, Borneo etc.), and were lower at areas previously high (Central), probably due to the lowered PM2.5 forecasts. Lastly, R 2 were lower at most CAQMS, suggesting that forecasts from the 46r1 model captured less of the temporal variations at most places than past versions.
Finally, the diurnal variations of cycle 46r1 were distinct from that of past versions, with lower night-time forecasted PM2.5 concentrations across all regions (Fig. 7). In fact, the diurnal cycle in 46r1 forecasts fitted observations better than past versions. Nevertheless, 46r1 still overpredicted night-time PM2.5 in Central, but to a lesser degree than past versions. The observed morning and evening PM2.5 peaks were not captured in all three versions.

Exceedances and Early Warnings
Firstly, we assessed CAMS-GACF fitness for use in the policy domain, such as in early-warning systems and the broader scope of PM2.5 management. Malaysia-wide, CAMS-GACF performed best in the exceedance analysis during both haze and non-haze periods when delineating exceedance levels using IT-3 (35 µg m -3 ), the newer MAAQS. This suggests the introduction of the new MAAQS, as well as its promising health benefits, would be associated with improved CAMS-GACF performance in predicting exceedances of PM2.5 in Malaysia. However, overall CAMS-GACF performed less well during non-haze periods (8% CSI), with an exceedance performance worse than found for regional CTMs and statistical forecasting models in other countries (38-54% CSI) (Celis et al., 2022;Cho et al., 2021;Huang et al., 2017) (see SM6). In contrast, CAMS-GACF performed on-par or better (44% CSI) than these studies during haze episodes (20-54% CSI), when PM2.5 levels are elevated and forecasts are of most value. The weaker performance of CAMS-GACF during non-haze periods should thus not devalue its potential in providing early warnings of extreme PM2.5 events in Malaysia.

Large-and small-scale variability
CAMS-GACF performed as expected for a global model forecast-it performed better at larger spatiotemporal scales than at smaller ones. We assessed this using the R 2 metric, which measures the proportion of variability explained by the forecasts. We found higher Malaysia-wide spatial R 2 (in temporal characterisation) than for the smaller regions of Malaysia at most times. The 40 km and 10 km resolutions of CAMS-IFS and its emission inventories hinder representation of processes with smaller spatial-scales which often affect local-scale variations in PM2.5.
The reduced robustness of CAMS-GACF at local scales is limited to the spatial dimension. R 2 was not improved by removing the temporal component from the data through temporal characterisation. Rather, Malaysia-wide total R 2 was lower than the regional ones; and regional total R 2 were lower than local ones (i.e., at individual CAQMS). Hence, a large proportion of the reduced robustness (or R 2 ) at local scales can be attributed to poor representation of Malaysia's PM2.5 spatial heterogeneity in the forecasts.
Nevertheless, some temporal variations were also not accounted for in the forecasts. MNMB and FGE showed some intra-annual variations, while the diurnal variations and the local-scale processes affecting it (e.g., traffic peaks) were poorly captured by the forecasts. This conformed to other studies that also used CAMS products (Varga-Balogh et al., 2020;Wu et al., 2020). However, the poor intra-annual temporal representation is only limited to CAQMS in eastern Peninsular and central Borneo with high FGE and low temporal R 2 .
Conversely, larger scale variability was captured by the forecast. The forecast performed better in all regions in terms of R 2 during haze than non-haze periods. This highlights the key strength of CAMS-IFS in that emissions from fires are captured through GFAS and their regional transport are modelled well.

Emission sources and diurnal cycle
Our analysis also showed the differing accuracy of CAMS-GACF between periods where PM2.5 pollution is either most influenced by local or external emission sources. For example, during the 2019 transboundary haze event, with external sources dominant, the Central region PM2.5 was less overpredicted (i.e., improved accuracy) while elsewhere became more underpredicted (i.e., less accurate). R 2 was also higher during the 2019 haze than non-haze periods. We can make two related deductions: (1) 46r1 CAMS-IFS underestimated the amount of PM2.5 transported away from pollution sources, and would underpredict PM2.5 concentrations during transboundary haze and overpredict when local pollution is dominant; and (2) CAMS-IFS can forecast well regionalscale PM2.5 pollution like the 2019 transboundary haze, but is less adept at forecasting non-haze periods when local factors dominate. This conformed to our exceedance analysis results, in that CAMS-GACF can detect major transboundary haze but not regular, minor locally driven exceedances.
CAMS-GACF also overpredicted PM2.5 in Central during both haze and non-haze periods, but underpredicted elsewhere. Wu et al. (2020) also found similar overprediction for more populous and polluted areas in China. CAMS-GACF appeared to overestimate the retention of PM2.5 at pollution sources. The night-time overprediction only in the urban, more polluted Central region points towards inaccurate diurnal modelling of a nocturnal inversion layer (NIL). NIL can inhibit vertical mixing and cause accumulation of pollutants on the surface at night. Figs. S8 and S9 reinforce the hypothesis on NIL modelling, where greater PM2.5 retention was found for areas with higher emissions in Central and Peninsular. While high observed night-time PM2.5 is theoretically possible in the Central region, local-scale processes (e.g., potentially greater rainfall, shorter-lasting or less shallow NIL than modelled) probably caused lower observed night-time PM2.5 (Sani, 1977). Similarly, CAMS-IFS uses extrapolated monthly emission inventories which are likely unrepresentative of current emissions in Malaysia. Diurnal emissions are also likely modelled through a simple function that did not capture local reality. Again, CAMS-IFS and its inventories are not designed to capture these local-scale processes. This finding is consistent with past CAMS-related studies (Marécal et al., 2015;Wu et al., 2020).

Assimilation across time-horizons
The difference in forecasts between time-horizon days can likely be attributed to the 4D-Var assimilation in CAMS-IFS. As forecast time-horizon decreases (i.e., F5 to F1), forecasted PM2.5 concentration agrees better with observations (Fig. S3). F1 forecasts performed the best, which is expected of any forecasts that employ data assimilation. The F1-F5 gap showed improvements in forecasts due to assimilation-if no gap was observed i.e., similar forecasts produced over five days, either the forecasts were accurate, or there are insufficient satellite observations that can be assimilated. The wider F1-F5 gap during haze suggests that haze forecasts were most benefited by the assimilation system. Since the gap is present for all regions at most times, it suggests useful satellite observations e.g., from MODIS (Benedetti et al., 2009;Rémy et al., 2019), commonly exist in the region. However, periods like the March INMs and regions like Central and Borneo with poor spatial R 2 might be suffering from poor satellite observations (e.g., due to cloud cover). Wu et al. (2020) also found increasing accuracies from F5 to F1 but with an evident diurnal variation (also seen in this study). Satellite assimilation might not be sufficient to improve diurnal variations of PM2.5 concentrations, particularly at night when aerosol-related observations are poor or unavailable.
Peculiarly, F2 and F3 forecasts were better than F1 in Central despite data assimilation. The higher PM2.5 forecasted during F1 were mainly caused by the higher night-time forecasts, pointing again to a problem with diurnal and NIL modelling, satellite AOD assimilation methods (e.g., vertical distributions that amplify modelled night-time concentrations), and/or erroneous emission inventories. These issues might also be the cause of the October-December 2019 anomaly. However, the sudden change in PM2.5 forecasts right on 1 st January 2020 is more indicative of a change in emission inventory, though the true cause remains uncertain (see details in SM5).

Past and Future Performances
Finally in this section, we assess the implications of the model upgrades for the prospect of using CAMS-GACF for PM2.5 forecasts in Malaysia. Among the three model versions we studied, cycle 45r1 performed the best when we considered only 'non-haze' periods. However, a minor transboundary haze not defined in this study as a 'haze episode' occurred during 45r1 operation and might inflate performance. Cycle 45r1 may also have been more calibrated after continual upgrades (hence, more accurate) before a major change that was 46r1. Nevertheless, regional R 2 of 46r1 forecasts at Peninsular and Central regions were higher than 45r1 despite lower R 2 at each CAQMS, suggesting that spatial variations of PM2.5 in Peninsular and Central were better represented with the newer CAMS-IFS version. The opposite was true for Borneo. Regardless, 46r1 improved PM2.5 forecasts through an improved diurnal cycle. While poor diurnal representation still persists in more urban and/or polluted environments, it was improved in other regions (e.g., from changes leading to better particle-size binning, better emission, chemistry, and deposition diurnal cycles).
Given CAMS-GACF past changes, we can expect frequent upgrades to the forecasting system. Since at the time of writing 46r1 is no longer the newest version, some characteristics highlighted in this study might change (see SM5). Nevertheless, we can expect CAMS-GACF to improve in the future.

Recommendations
Overall, we found that CAMS-GACF performance was weaker in Malaysia than is reported in other studies, focused on different countries and/or with different forecasting models, in both exceedance and accuracy analyses. CAMS-GACF cannot be expected to capture local-scale variability, which hinders accurate forecasting of PM2.5 pollution that normally have both regional and local influences. Nevertheless, CAMS-GACF performed on par or better than forecasts from those studies when we consider haze periods only, can capture regional-scale PM2.5 variability, employs data assimilation to improve forecasts, and provides suitable forecasting time-horizons for timely decision-making. It is thus interesting to provide some recommendations on how best to take advantage of CAMS-GACF despite its shortcomings.
Firstly, we recommend development of a robust early-warning system for use in Malaysia. CAMS-GACF should not be used as the sole information for early warning, with 100% confidence given to the forecasts. Confidence of the forecasts depends on the forecasts' systematic errors (which can be reduced via local bias-correction; see below) and on other auxiliary information (e.g., known hotspot locations, consistent exceedances predicted by most time-horizons). Uncertainty from forecast background cannot be reduced, only quantified based on human interpretation and/or good early-warning framework incorporating relevant auxiliary information (Doswell, 2004;Gerapetritis and Pelissier, 2004). Early warnings are then issued when confidence of exceedance events exceeds a threshold defined based on early-warning goals, the basic units (e.g., CAMQS, district, states), and the evaluation metrics (e.g., CSI, potential economic loss prevented by warnings). With a robust early-warning system in place, CAMS-GACF can contribute to improved early warnings' performance than reported here simply as exceedances, even without biascorrection.
Secondly, CAMS-GACF can be improved by incorporating the uncaptured local-scale variability via local bias-correction techniques. Being a global model, CAMS-GACF will not change to capture local-scale variability in the foreseeable future. Hence, the influences of local-scale processes can be incorporated via two methods. First, we can downscale CAMS-IFS by using a regional/local CTM. Cho et al. (2021) downscaled CAMS-IFS output via a regional CTM and found satisfactory results. Second, we can incorporate local-scale variability via statistical techniques. Many studies employ this method, ranging from simple statistical models like regression linking past forecasts/biases to future (corrected) forecasts (Konovalov et al., 2009), to more complex frameworks classifying forecasts to past analogues and associated bias-correction models (Huang et al., 2017;Lyu et al., 2017;Neal et al., 2014). These studies reported improved performance, particularly in correcting over-and underpredictions. Therefore, we recommend exploration of downscaling CAMS-GACF using a regional CTM, and of correcting the resulting bias via statistical models and bias-correction frameworks. Statistical and ML models used in Malaysia-focused forecasting studies (Koo et al., 2020;Lim et al., 2008;Wong et al., 2021) can be repurposed for the latter. Regardless of the correction methods, model users that wish to derive maximum benefits from CAMS-GACF might want to prioritise the improvements at more polluted and populous regions like Central, while those that wish to improve overall CAMS-GACF accuracy might want to prioritise improvements at Borneo and eastern Peninsular where FGE were high.
Finally, some additional considerations were provided. Since PM2.5 chemical composition can be used to determine its source(s) (Adam et al., 2021), forecasted and observed composition can aid in both early warnings and bias-corrections as auxiliary information. Similarly, since PM (as aerosols) and meteorology are interdependent (Adam et al., 2021;Dahari et al., 2020;Ku Yusof et al., 2019;Sobri et al., 2021), existing weather forecasting and PM2.5 forecasting are synergistic and can mutually improve each forecast. Finally, due to CAMS-IFS frequent upgrades, we recommend developing robust early-warning systems and simple bias-correction frameworks that ensure easy re-calibration to new upgrades. The forecast analogue approach might help in this regard.

CONCLUSION
In this study, we evaluated the performance of a global mechanistic CTM forecast, CAMS-GACF, in forecasting PM2.5 in Malaysia qualitatively and quantitatively. It provided a regional outlook on CAMS-GACF PM2.5 forecasting performance in Malaysia for the first time. In summary, the change in MAAQS would not jeopardise but rather improve CAMS-GACF performance in predicting exceedance events. The model performed slightly worse than forecasting models used in other countries, but it performed on-par or better when forecasting is of most value, i.e., during the 2019 haze event. Accuracy-wise, CAMS-GACF performed worse in Malaysia than in other countries. It tended to overpredict PM2.5 in polluted urban areas but underpredict elsewhere, likely due to emission inventory limitations, and challenges in diurnal and NIL modelling. Data assimilation of CAMS-IFS has proved effective, with improving forecasts from F5 to F1; haze forecasts benefited most from this feature. However, CAMS-GACF performed poorly at small scales in Malaysia, particularly in the spatial dimension. Short-term temporal variations were also not fully represented in the forecasts, particularly in the diurnal variations in polluted urban areas. CAMS-GACF also performed poorly when forecasting local pollution and exceedances. Nevertheless, CAMS-IFS receives frequent upgrades, and we can expect improvements to PM2.5 forecasts in Malaysia in the future.
PM2.5 is a dangerous and prevalent pollutant in Malaysia that has both local and external factors influencing its concentrations. Currently, Malaysia only issues air quality deterioration warnings based on observed concentrations (Wong et al., 2021). Preventive warnings based on forecasts can benefit Malaysia in mitigating the impacts of elevated PM2.5 by allowing governments and individuals to plan for exceedances (Celis et al., 2022;Lyu et al., 2017). CAMS-GACF is suitable for forecasting PM2.5, with relevant time-horizons for decision-making and its considerations of regional processes. Hence, we provided recommendations to allow us to take advantage of CAMS-GACF despite its shortcomings: (1) develop a robust early-warning system around CAMS-GACF to maximise early warning efficacies; and (2) correct inaccuracies of CAMS-GACF via downscaling and utilising statistical bias-correction techniques. Potential focus areas and synergies were also highlighted.
This study provides a comprehensive initial review on CAMS-GACF performance in forecasting PM2.5 in Malaysia. Future studies should further quantify the degree CAMS-GACF capture localscale variations by scrutinising forecasts at individual CAQMS (particularly on the diurnal variations), and the improvements after applying bias-correction techniques recommended above. Future studies should also evaluate CAMS-GACF performances in forecasting PM2.5 composition species, and indeed for other air pollutants. If the outcomes are satisfactory, CAMS-GACF could form the basis for a working air quality forecasting system for Malaysia.