**Rick Pernak ^{}, Matthew Alvarado^{}, Chantelle Lonsdale^{}, Marikate Mountain^{}, Jennifer Hegarty^{}, Thomas Nehrkorn^{}**

Atmospheric and Environmental Research, Lexington, MA 02421, USA

Received:
December 12, 2018

Revised:
June 6, 2019

Accepted:
October 6, 2019

Download Citation:
||https://doi.org/10.4209/aaqr.2018.12.0464

Cite this article:

Pernak, R., Alvarado, M., Lonsdale, C., Mountain, M., Hegarty, J. and Nehrkorn, T. (2019). Forecasting Surface O3 in Texas Urban Areas Using Random Forest and Generalized Additive Models. *Aerosol Air Qual. Res.* 19: 2815-2826. https://doi.org/10.4209/aaqr.2018.12.0464

**Highlights**

- GAM quantitative and probabilistic models were built for six Texas urban areas.
- A random forest machine learning algorithm was applied for classification.
- The probabilistic models show no skill.
- Quantitative and classification models exhibit some degree of success.
- Predictive success increased in urban areas with less extreme ozone events.

**ABSTRACT**

We developed and evaluated three types of statistical forecasting models (quantitative, probabilistic, and classification) for predicting the maximum daily 8-hour average concentration of ozone based on meteorological and ozone monitoring data for six Texas urban areas from 2009 to 2015. The quantitative and probabilistic forecasting models were generalized additive models (GAMs), whereas the classification forecast used the random forest machine learning method. We found that for the quantitative forecasting models, five of the eight predictors (the day of week, day of the year, water vapor density, wind speed, and previous day’s ozone measurement) were significant at the *α* = 0.001 level for all urban areas, whereas the other three varied in significance according to the location. The quantitative forecasting for the 2016 ozone season agreed well with the associated measurements (*R*^{2} of ~0.70), but it tended to under-predict the ozone level for the days with the highest concentrations. By contrast, the probabilistic forecasting models showed little accuracy in determining the probability of concentrations exceeding policy-relevant thresholds during this season. The success rate for the random forest classification models typically exceeded 75% and would likely increase if the training data sets contained more extreme events.

Keywords:
Keywords: Ozone MDA8; Ozone prediction; Generalized additive models; Random forest.

**INTRODUCTION**

Ozone (O_{3}) at the surface can have adverse effects on public health. Lippmann (1989) summarized many studies that suggest lungs in people age quicker, lung capacity diminishes, and air flow resistance increases with sustained exposure to ozone. Devlin *et al.* (1997) examined lung inflammation and changes in lung function and found that these afflictions are likely due to elevated ozone and that those with pre-existing respiratory conditions are more prone to experience them. Studies performed on data from the United States, Canada, and Europe have linked air pollution to chronic obstructive pulmonary disease (COPD), increased hospitalization for respiratory illnesses (e.g., asthma), and lower forced expiratory volume (FEV). Anderson *et al.* (1997) confirmed an association with ozone and other pollutants with COPD in six European cities with different climates. Sixteen Canadian cities are the subject of Burnett *et al.* (1997), and they found that, in a population of over 12 million people over more than ten years, even low levels of surface O_{3} can lead to an increase in hospitalizations due to respiratory diseases. FEV was found by Brown *et al.* (2008) to decrease a statistically significant amount among young healthy adults after long exposures to ozone in the 60–80 ppb range. Finally, Bell *et al.* (2004) studied data that spanned over a decade for 95 US cities (and 40% of the population) and show that mortality rates increase a small but statistically significant amount when O_{3} levels are raised by just 10 ppb. The EPA provides many more references on the effects of ozone on public health on their website (https://www.epa.gov

/ozone-pollution-and-your-patients-health/references-ozone-and-your-patients-health) and in the Integrated Science Assessment (2013).

Given the harmful side effects of ozone inhalation, it is imperative that people know when to expect poor air quality days due to elevated ozone and what do in such cases. State and local agencies across the country in conjunction with the EPA and AirNow recognize “Action Days” (https://www.airnow.gov/index.cfm?action=airnow.actiondays) when ozone and particulate matter concentrations are expected to be high. AirNow provides suggestions for residents of these areas to reduce the amount of pollutants in the air on these Action Days (https://www.airnow.gov/index.cfm?action=resources.whatyoucando), and the EPA provides services like the Air Quality Alert Program so that individuals who want poor air quality alerts can be emailed or notified on their phone regarding a poor air quality day (https://www3.epa.gov/region1/airquality/smogalrt.html). State and local governments can also issue alerts via radio and television outlets so that individuals can decide to stay in their homes and reduce exposure to pollution. These kinds of precautions require accurate forecasting of air quality, so it is advantageous to have an ability to predict tropospheric O_{3} concentrations so that sufficient precautions can be implemented. O_{3} production is dependent on meteorological and photochemical conditions, so knowledge of both is necessary for accurate predictions.

Numerical models exist for this purpose, but statistical modeling techniques have been attempted to forecast surface O_{3} as well. For example, Thompson *et al.* (2001) reviewed statistical methods for modeling the dependence of surface level O_{3} on meteorology and emissions to obtain air quality forecasts, estimate O_{3} time trends, and increase understanding of the underlying photochemistry that is responsible for the generation of ozone. These statistical methods were classified as regression (linear, tree-based, and non-linear), extreme value, and spatio-temporal modeling, none of which were found to be more appropriate than the others. Thompson *et al.* (2001) suggest that the best type of model to use is dependent on the region being considered and the machine learning technique employed.

Most of the literature focuses on forecasting modeling application to urban areas, with perhaps a similar application to rural areas as a baseline. For example, Feister and Balzer (1991) developed non-linear forecasting models for five (mostly non-urban and non-industrial) stations in Germany using over 300 predictors and found that recent O_{3} surface levels and solar irradiance were the most significant but explained no more than 46% of the variance. Camalier *et al.* (2007) determined the effect of meteorology on O_{3} in 39 urban areas in the eastern United States using a generalized linear model and found that ozone concentration increases with the temperature and decreases with increasing relative humidity. They also show that the temperature dependence is more pronounced in the cities in their study at higher latitudes, while humidity is the more significant predictor at lower latitudes. Kgabi and Sehloho (2012) investigated the effect of both emissions and meteorology by monitoring surface O_{3} in a rural area (Botsalano Game Reserve near the Botswana-South Africa border) and an industrial area (Marikana, South Africa) in Africa and measuring the associated temperature, relative humidity, wind speed, and wind direction. They observed consistently higher ozone concentrations at the industrial location, a negative correlation between O_{3} and humidity, and a positive correlation between temperature and O_{3}. More recently, statistical models have been used to identify unusual conditions, such as wildfires, that may have contributed to ozone formation. For example, Gong *et al.* (2017) developed a Generalized Additive Model (GAM) approach to estimate O_{3} in several cities and then showed that days with measurable fire smoke impact tended to have positive residuals, suggesting the fires were contributing to O_{3} formation on these days.

In addition to the GAM approach, other statistical forecasts for surface O_{3} have used artificial neural networks, or ANNs (Abdul-Wahab and Al-Alawi, 2002; Luna *et al.*, 2014), and random forests. ANNs are non-linear in nature, and the relationships that are discovered are deduced with limited prior knowledge of a given process. They are thus helpful in forecasting ozone for particular locations and provide the ability to model the change in O_{3} concentrations due to meteorology and photochemical processes together (by including concentrations of pollutants, specifically primary pollutants like NO* _{x}* and NMHC, as predictors in the model). Abdul-Wahab and Al-Alawi (2002) implemented this procedure for Kuwait, while Luna

*et al.*(2014) applied it to Rio de Janeiro in Brazil. Both studies produced ozone predictions that yielded > 90% correlation with measurements in training and testing, thus providing models that account for nearly all of the variation in the observations. Abdul-Wahab and Al-Alawi (2002) found that meteorology explains up to 41% of the variance in their models, while variability in the emissions accounted for the rest of the variability in O

_{3}concentrations. Luna

*et al.*(2014) included CO, NO, moisture content, and NO

*as inputs into their ANN model and showed that the statistical model produced forecasts that are consistent with our knowledge of the chemical processes of ozone production.*

_{x}Random forests are classifying algorithms that utilize a large number of independent, identically distributed (i.i.d.) decision trees with randomly selected predictors and average the results to reduce noise (Breiman, 2001; Hastie *et al.*, 2017). Yu *et al.* (2016) apply this technique to categorize air quality conditions in Shenyang, China, using the Chinese AQI and found that their random forest algorithm (RAQ) was more precise and accurate than other machine learning classification methods (Naïve Bayes, Logistic, etc.), though their application was not necessarily a forecast of ozone levels since the Chinese AQI is a maximum value for a number of pollutants (SO_{2}, NO_{2}, CO, PM_{2.5}, PM_{10}, and O_{3}).

In this work, we developed statistical forecasting models of surface O_{3} for six urban areas in Texas (Table 1). We focused on forecasting ozone during eastern Texas’s “ozone season” of May through October. We used generalized additive models—GAMs (Wood, 2006)—to describe the potentially non-linear relationship between the measured maximum daily 8-hour average ozone concentrations (*O*_{3,MDA8}), selected meteorological variables from weather forecasts, and measurements of the previous day’s surface O_{3} levels based on data for 2009 to 2015. We developed two GAM-based models, a quantitative forecast that predicts the numerical value of *O*_{3,MDA8} and a probabilistic model that predicts the probability that *O*_{3,MDA8} will exceed a given threshold. We then used the random forest method to develop a classification forecast of the O_{3} air quality index (AQI)—to our knowledge this is the first time this method has been used to forecast surface O_{3} concentrations directly. Here we describe our model development and the data used to train and evaluate the models. We then assess the performance of the models in forecasting *O*_{3,MDA8} during the 2016 O_{3} season in the urban areas of interest.

METHODS

METHODS

Training and Evaluation Data

Training and Evaluation Data

Training and Evaluation Data

To develop the quantitative and probabilistic forecasting models, we fitted GAMs to calculate the predicted *O*_{3,MDA8} for day *D _{i}* in each urban area listed in Table 1 using ozone season data from 2009 to 2015, with the day

*D*

_{i-}_{1}

*O*

_{3,MDA8}calculated from the Texas Commission on Environmental Quality (TCEQ) monitoring data as a predictor. The other predictors used in these models are shown in Table 2.

*O*

_{3,MDA8}values from the Texas Air Monitoring Information System (TAMIS) for

*D*were the only measurements that were used in the forecasting model (Available at https://doi.org/10.5281/zenodo.2032330). We also used seven forecasted (

_{i-1}*D*) meteorological parameters from the Model Output Statistics (MOS) post-processing of the National Center for Environmental Prediction (NCEP) 12-km North American Model (NAM-12) forecasts (Available at https://doi.org/10.5281/zenodo.1979000). We discuss the selected predictors for each data set further in this section.

_{i}

**Texas Commission on Environmental Quality Monitoring Data**

**Texas Commission on Environmental Quality Monitoring Data**

TCEQ provided air quality data from the air quality monitoring network operated by TCEQ, its grantees, or local agencies whose data are stored in the TAMIS in and near the urban areas listed in Table 1. Historical data for the time period spanning 2009–2015 were provided by TCEQ and used in the training of the forecasting models. We used these data to calculate *O*_{3,MDA8} for a given day over all sites around an urban area with acceptable data. For the 2016 ozone season evaluation period, the operational forecasting models gathered these *O*_{3,MDA8} data in near-real time (NRT) from the TCEQ website (https://www.tceq.texas.gov/cgi-bin/compliance/monops/daily_average.pl), which provided raw hourly measurements of O_{3} from all of the monitoring stations in its network. Note that while the training data were subjected to a quality control process by TCEQ, this was not true of the NRT data used in the operational forecasts. These processing differences had a small effect on our forecasts. For example, the mean and standard deviation for the NRT Austin *O*_{3,MDA8} data were 43.17 and 9.68 ppbv, respectively, while the moments of the corresponding quality-controlled distribution were 44.46 and 9.95 ppbv.

**North American Mesoscale Numerical Forecasts**

**North American Mesoscale Numerical Forecasts**

NCEP provides daily output from the 12-km (NAM-12) weather forecasting model. For a given model runtime, there are a number of forecast hours and grid types, but the product most relevant for this application is the one produced for the continental United States on a 12-km Lambert-conic conformal grid. Every 6 hours, model output is written to a single file and archived in the NOAA Operational Model Archive and Distribution System (NOMADS; Rutledge *et al.*, 2006), which is available to the public (http://nomads.ncep.noaa.gov/pub/data/nccf/com/nam/prod). Historical data (data set ds609.0) and forecasts (ds335.0) are archived in the Research Data Archive (RDA; NCEP/NWS/NOAA/ECMWF/UCAR, 2003) at the University Cooperation for Atmospheric Research (UCAR) (https://rda.ucar.edu/). Both the NRT and historical data were processed in the same way: Grid points closest to the coordinates of the areas of interest (Table 1) were determined; then the temperature, relative humidity, and geopotential height fields were extracted for the 500, 700, 850, and 925 mbar pressure levels. As mentioned, there are four forecast times (00:00, 06:00, 12:00, and 18:00 UTC), but only the 12Z forecast was used in this work, as this was the time found to give the best *O*_{3,MDA8} fit for cities in the eastern U.S. by Camalier *et al.* (2007). This forecast time is at 07:00 CDT for our urban areas during the ozone season.

**Model Output Statistics Forecasts**

**Model Output Statistics Forecasts**

Model output statistics (MOS) is a technique used by the National Weather Service (NWS) to objectively interpret numerical model output (Carter *et al.*, 1989) and produce site-specific guidance (Glahn *et al.*, 2008). Archived and NRT-forecasted parameters that were used in this study include temperature, dew point, wind speed, and wind direction. The forecasts are provided every 3 hours for the subsequent ~2.5 days. All are calculated for the same day (*D _{i}*) relative to the model runtime (which is at midnight). MOS forecasts are archived by the Iowa Environmental Mesonet (IEM) at Iowa State University (https://mesonet.agron.iastate.edu/mos/fe.phtml)—the archived NAM-12 forecasts were used in the model training for this study. NRT NAM-12 forecasts are provided by NWS (http://www.nws.noaa.gov/mdl/synop/products/bullform.met.php#Texas) and were used for forecasting. Queries were performed for the MOS sites listed in Table 1.

**National Air Quality Forecast System**

**National Air Quality Forecast System**

The National Air Quality Forecast System (NAQFC; Pan *et al.*, 2014) is a physical modeling system provided by NOAA that couples the Weather Research and Forecasting Non-hydrostatic Mesoscale Model (WRF-NMM) (Janjic, 2003) and the Community Multiscale Air Quality (CMAQ) (Byun and Schere, 2006) regional chemical transport model. CMAQ is a multi-scale 3D Eulerian chemical transport model that is used to model air quality (e.g., tropospheric O_{3} and PM_{2.5}) on urban to regional scales. During this study, the CB05 chemical mechanism (Yarwood *et al.*, 2005) and AERO5 aerosol module (Carlton *et al.*, 2010) are used in CMAQ v4.7.1 within the NAQFC system. The lateral boundary conditions used in the simulation are monthly averaged profiles extracted from GEOS-Chem (Bey *et al.*, 2001) simulation results. The current operational model has a horizontal resolution of 12 km. The O_{3} forecasts are produced twice daily from the 06:00 and 12:00 UTC cycles. Forecast time projections extend from a minimum of 30 hours (06:00 run) to a maximum of 48 hours (12:00 run). Note that NOAA provided their forecast output for a subset of our testing period, 19 May 2016 through 22 September 2016.

**Forecasting Models**

**Forecasting Models**

Quantitative and Probabilistic Models

Quantitative and Probabilistic Models

Quantitative and Probabilistic Models

In these two procedures, we used the “mgcv” GAM modeling library (version 1.8–23) in R (version 3.2.2) discussed in Wood (2006) to fit the *O*_{3,MDA8} value for day *D _{i}* using a number of meteorological and air quality parameters as predictors. The general GAM formulation used in our forecasting models can be written as follows:

*g*(

*µ*) =

_{i}*β*+

_{o}*f*

_{1}(

*x*

_{i,1}) +

*f*

_{2}(

*x*

_{i,2}) + ⋯

*f*(

_{n}*x*) +

_{i,n}*f*(

_{p}*D*) +

_{i}*D*(1)

_{W}where *g*(*µ _{i}*) is the expected value of

*O*

_{3,MDA8}for day

*i*in the quantitative models and a Boolean value indicates whether the

*O*

_{3,MDA8}exceeded a given threshold in the probabilistic models.

*g*(

*µ*) is the link function: A log-link is used for the quantitative forecasting models, and a logistic link is used for the probabilistic forecasting models. The

_{i}*j*= 1,

*...*,

*n*meteorological and air quality parameters are denoted with

*x*, with the corresponding

_{i,j}*f*(

_{j}*x*) being an initially unknown smooth function of

_{i,j}*x*made from a cubic-spline basis set. Following Camalier

_{i,j}*et al.*(2007), two other predictors are also included: a smooth function

*f*(

_{p}*D*) of the Julian day of the year (

_{i}*D*) and a factor for the day of the week

_{i}*D*. As we are only fitting O

_{W}_{3}data during the O

_{3}season (May–October),

*f*(

_{p}*D*) is built with a non-periodic cubic spline basis.

_{i}

**Random Forest O**_{3} Classification Forecasts

**Random Forest O**

_{3}Classification ForecastsWe also used the random forest approach to develop a third forecasting model. Bosch *et al.* (2007) and Hastie *et al.* (2017) provide detailed documentation on how decision trees and their associated random forests operate (Also see slides 25–36 in http://www.robots.ox.ac.uk/~az/lectures/ml/lect4.pdf), which we will only summarize. Given a set of data and parameters that will be used in event classification, a set of true–false tests are performed (tree nodes), and the end result is a number of end nodes (leaves, where the tests stop) that yield a percentage of each class that each leaf contains. The set of parameters that defines a class is the one that yields the highest probability of a given class. Extending this methodology to a number of trees—each of which contains a random subset of the training data and performs node tests on a random subset of the given predictors—is what is known as a random forest, and it reduces the risk of over-fitting. For the forest, the probability of a given class given a set of parameters is averaged over the ensemble of trees, and class designation is again defined by a maximum probability (this time, over the ensemble average). For random forest implementation in this study, we used the same parameters that were selected for the quantitative and probabilistic forecasting models (Table 2) and the color-coded classification scheme for O_{3} air quality indices (AQIs) in Table 3.

**Predictor Selection**

**Predictor Selection**

We performed several sensitivity studies that assessed model performance for different combinations of predictors with other data sources. The sensitivity studies included:

- a base model with the parameters listed in Table 2;
- replacing the
*O*_{3,MDA8}derived from TCEQ measurements with the EPA AirNow AQI; - utilizing a stability parameter—
*Δ**T*—defined as the difference between the NCEP-predicted temperature at 925 mbar,_{NCEP}*T*_{925}, and the temperature at 850 mbar,*T*_{850}; - using relative humidity or dew point temperature as a proxy for water vapor density;
- replacing MOS and NCEP forecasts with TCEQ current conditions;
- using HYSPLIT 24-hour back trajectory distances and bearings in addition to the TCEQ, MOS, and NCEP parameters; and
- using HYSPLIT-deduced synoptic types instead of the bearings and distances.

Ultimately, each sensitivity study produced statistics—best-fit line slope and intercept, Pearson *r* correlation coefficient, and mean and standard deviation of residuals—that were generally worse than the model described in Table 2 when validated with data from the evaluation period.

Δ*T _{NCEP}*, the difference in NCEP-forecasted temperature between 925 and 850 mbar, was only significant to the 5% level for Austin and San Antonio; thus, with this predictor, over-fitting of the data was a potential danger. In addition, by omitting this parameter, NCEP forecast gathering was no longer necessary. An added advantage with this improvement was that the forecaster became much more efficient—NCEP processing took approximately 40 seconds per city, so excluding this data set saved 4 minutes of runtime per forecast.

The moisture-related MOS-forecasted parameters *H _{R}* (relative humidity) and

*T*(dew point temperature) were attempted before water vapor density because of their usage in Camalier

_{DP}*et al.*(2007), and they were usually significant to at least the 1% level. The significance level was dependent on the urban site, particularly when

*T*and

_{DP}*H*were used together. Fit separately,

_{R}*T*is a significant predictor at the 0.1% level for all sites except DFW, while

_{DP}*H*is significant to the highest level (0.1%) for all locations. We also investigated how an absolute measure of moisture could affect the predictions by transforming these variables into water vapor density (

_{R}*ρ*). Like

_{WV}*H*,

_{R}*ρ*is significant to the highest level (0.1%) for all locations. All of these significances are summarized in Supplementary Table 1. Water vapor density better reflects the impact water vapor can have on the chemical production of ozone than relative humidity or dew point temperature (as chemical reactions depend on concentrations in units of molecules cm

_{WV}^{−3}), though, and consequently is the preferred moisture predictor.

*H*and

_{R}*T*were thus removed from our model and replaced with an average water vapor density that is a function of temperature (in °C) and relative humidity.

_{DP}*ρ*is computed every hour between 06:00 UTC and 21 hours later, and then an average

_{WV}*ρ*is found.

_{WV}

**RESULTS AND DISCUSSION**

To assess the performance of the forecasting models, we compared the observed and predicted *O*_{3,MDA8} for the 2009–2015 training data set and for the 2016 ozone season. Assessing accuracy of the models in the training period is important when, for example, determining whether models should be applied in an operational environment to compute NRT predictions of O_{3}. If the residuals do not follow a normal distribution, the fitting needs to be refined. To quantify model accuracy completely, though, one must calculate the O_{3} level predictions and then compare them with associated measurements.

Evaluating the Statistical Fit

Evaluating the Statistical Fit

Evaluating the Statistical Fit

**Quantitative Forecasts**

**Quantitative Forecasts**

Table 4 presents the *R*^{2} correlation coefficient for the quantitative forecasting models for each urban area. The models typically can explain over 70% of the variance (*R*^{2} > 0.7), with the best fits found for Austin/Round Rock and the worst for Tyler/Longview/Marshall (the only model that does not explain at least 70% of the variance). Examination of the quantitative forecasting model residuals shows that the fits to the training data set are generally good, with a normal distribution of residuals, little trend in the residuals with the magnitude of the predicted or fitted values, and a linear relationship between the measured and fitted *O*_{3,MDA8}. An example is provided in Fig. 1 for the Austin/Round Rock quantitative forecasting model, which plots the quantile-quantile (Q-Q) comparison, residuals as a function of the log of predictor value, residual distribution, and modeled *O*_{3,MDA8} as a function of measured *O*_{3,MDA8}. Additionally, Fig. 2 exhibits the functional dependence of the *O*_{3,MDA8} predictions with each predictor. As expected, the forecasted *O*_{3,MDA8} increases as the previous day’s *O*_{3,MDA8} increases and as the forecasted afternoon temperature increases, but decreases as water vapor density and wind speed increase.

Supplementary Table S1 shows the estimated significance of each predictor in the quantitative forecasting models. We see that five of the eight predictors are significant for all urban areas to the *α* = 0.001 level. The remaining predictors are forecasted afternoon temperature, forecasted daily temperature difference, and average forecasted wind direction. The significance of each depends on the city, but even in most of these cases, the predictors are significant to at least the *α* = 0.05 (afternoon temperature) or *α* = 0.10 (diurnal temperature difference and wind direction) level. Wind direction at TLM is the one exception. *p*-values for separate runs with *T _{DP}*,

*H*,

_{R}*T*and

_{DP}*H*together, Δ

_{R}*T*, and

_{NCEP}*ρ*demonstrate that no significance is lost when omitting Δ

_{WV}*T*and

_{NCEP}*T*and that

_{DP}*H*and

_{R}*ρ*(both of which are derived from the MOS

_{WV}*T*and surface temperatures) are equally significant (i.e., to the α = 0.001 level) in predicting ozone levels. In selecting a moisture predictor, we decided to use

_{DP}*ρ*, which yielded a substantially higher correlation coefficient for Austin, Beaumont/Port Arthur, and Houston.

_{WV}**Fig. 1.** Training period residual analysis for the Austin/Round Rock quantitative model. The top-left panel is a Q-Q plot which implies a normal distribution if the data points exhibit a y = x relationship (represented by the red line). The top-right panel plots the data-model residuals as a function of the log of the measurements. A residual distribution is presented in the bottom-left panel. Finally, model predictions as a function of their associated measurements are plotted in the bottom-right panel. For ARR, the plots show that the quantitative model produces residuals that are normally distributed.

**Fig. 2.** Effects (i.e., weights) of each predictor as a function of predictor value, as represented by smoothed curves determined in the GAM fitting. Effective degrees of freedom (EDF) for each of the smooth functions is given in the title of each panel. 95% confidence levels are displayed as the red regions around the black curves. As an example, one interpretation of the *T*_{2100} plot is that its effect on the GAM response (predicted O_{3}) is estimated by a smooth curve with 3.56 degrees of freedom (Wood, 2006).

**Probabilistic Forecasts**

**Probabilistic Forecasts**

We trained probabilistic forecasting models that determine the odds of predicted *O*_{3,MDA8} being greater than or equal to four thresholds—55 ppb, 71 ppb, 86 ppb, and 105 ppb—given the MOS forecasts and the previous day’s *O*_{3,MDA8}. However, most urban areas did not have sufficient historical data to fit GAMs for the upper two thresholds, and none had enough data to fit a 105-ppb model. Thus, most urban areas only have 55-ppb and 71-ppb probabilistic forecasting models, while DFW and HGB also have an 86-ppb model.

Supplementary Table S2 shows the *R*^{2} and percentage variance explained for the probabilistic forecasting models. In general, the explained variance was relatively low for the probabilistic models, with values between 45% and 60%. Most, but not all, predictors were significant to at least the *α* = 0.10 level, as shown in Supplementary Table S3. There is much more variance in predictor significance with the probabilistic models—the *p*-value changes with not only the urban area but the threshold level—compared to the quantitative models, whose predictors were typically significant to the *α* = 0.001 level.

Examining the model fits shows some interesting differences between the urban areas. For example, the left panel of Fig. 3 shows results for HGB for a threshold of 71 ppb. Only the dependence of *O*_{3,MDA8} on *T*_{2100}, and *W _{S}* are shown, with all other variables held constant at their mean values. These

*O*

_{3,MDA8}forecasts are primarily a function of forecasted afternoon temperature (

*T*), showing a strong increase in the probabilities between 23°C and 35°C. However, the results for SA in the right panel of Fig. 3 show that in this area,

_{2100}*O*

_{3,MDA8}and forecasted

*W*are more important in predicting when the

_{S}*O*

_{3,MDA8}will exceed 71 ppb.

**Fig. 3.** Plot of the probabilities of forecasted *O*_{3,MDA} being ≥ 71 ppb in Houston-Galveston-Brazoria (HGB) and San Antonio (SA) when the *O*_{3,MDA} from the previous day is 70 ppb. The *x*-axis is the MOS NAM-12 forecast of afternoon mean temperature for a given day while the *y*-axis is the MOS NAM-12 forecast of daily averaged wind speed for a given day. All other model predictors are held constant at their mean values.

**O**_{3} Classification Forecasts

**O**

_{3}Classification ForecastsAccuracy in classification models can be computed by utilizing the associated confusion matrix, which is a table of what the forecaster predicted compared to what was observed (Kuhn, 2008). The diagonal elements of the matrix are successful classifications, and accuracy is given by the ratio of the number of successes to the total number of events. The sample size for the training period is 1280, and the success rate for each site during the training set is given in Table 5. Success rates for our classification models range between 65% and 85%, depending on the urban area of interest.

**Forecasting Performance**

**Forecasting Performance**

**Quantitative Forecasts**

**Quantitative Forecasts**

The quantitative models were used to forecast the entire 2016 evaluation period and then validated with the observations from TCEQ monitoring stations. Statistics to specify the forecast-measurement agreement were then calculated—these included the coefficients of the linear ordinary least-squares (OLS) fit (slope, *a*, and intercept, *b*), Pearson linear correlation coefficient (*r*), and the moments of the forecast-data residuals (mean bias, *µ*, and standard deviation, *σ*). These statistics are provided in Table 6. While not a one-to-one correspondence with the observations, the models do exhibit some predictive power, given that the correlation coefficients for all cities are on the order of 0.7 or higher.

We emphasize the slopes of the fits, all of which are less than unity, suggesting that the higher *O*_{3,MDA8} measurements are underestimated and the lower measurements are overestimated in the models. This behavior is an artifact of the regression model—expectation values will be biased toward the mean because the mean minimizes the loss error in the fit to the training data. However, there are a few reasons this kind of disagreement might be expected. As we alluded to in the TCEQ monitoring data training set section, NRT and training data were not subject to identical quality control protocol, and the training data are provided as floating point numbers, while the NRT data are given in integers. While these two quality control procedures will at most only account for an error of a couple ppb at a given data point, they can contribute to some of the noise that is exhibited in the validation statistics (e.g., correlation coefficient *r* < 1).

Another possible and probably more significant factor is a change in emissions, which is not accounted for in our models. With our training set, we assume that year-to-year changes in emissions and other variables are small enough throughout the training period such that the year was not used as a predictor. Furthermore, we assume that the emissions during the 2016 testing period resemble those from the 7-year training period. While the former assumption is mostly true, the latter is perhaps not accurate. In Table 7, the spread of the evaluation period is much smaller than the spread of any of the training data years. When comparing 2016 to the training sample, its spread has a *z*-score of 3.6, so it is clear that 2016 was anomalous. This behavior does not explain our test period error, but it does highlight a limitation of this type of forecasting.

Finally, one must consider the complexity of the model when contemplating what the test error is relative to the training error. An overly complex model with respect to degrees of freedom (DOF) will yield very low error in the training period while exhibiting substantial error in the testing period; this is because the probability of over-fitting the data increases with the DOF. Though Table 2 indicates only 8 fitted parameters, each one of the predictors has a corresponding value for the effective degrees of freedom (EDOF), which is calculated in the cubic spline of the associated smoothed function for the predictor (these smoothed functions are fitted in the GAM; see Fig. 2). For example, in the Austin quantitative model, there are 32.017 EDOF when considering the smoothing of all of the predictors. Test error as a function of degrees of freedom can be characterized as high bias and low variance at small DOF and low bias and high variance at large DOF (Hastie *et al.*, 2017). The function is minimized at some DOF, and it is possible that our models are beyond this test error minimization point.

To further assess the quality of the quantitative models, we compared the performance of the quantitative statistical model with the WRF/CMAQ-based NAQFC numerical forecast for our six urban areas. NOAA provided us with the output from NAQFS numerical forecasts for the six Texas urban areas but only for a subset of the evaluation period (19 May 2016 through 22 September 2016) used for the statistics in Table 6. Evaluation statistics for these WRF/CMAQ forecasts are given in Table 8 (labeled “NOAA”) along with the statistics for the quantitative statistical forecasts (labeled “GAM”) in the same time period (19 May 2016 through 22 September 2016). Our statistical models have a consistently higher correlation coefficient with the measured maximum *O*_{3,MDA8} than the NOAA numerical forecasts, and the standard deviation of the residuals is much smaller in our statistical models, both of which suggest that our statistical models are able to capture some of the variability in *O*_{3,MDA8} that the NOAA numerical forecasts do not. Furthermore, our forecasts show a smaller absolute mean bias for all urban areas except Dallas/Fort Worth. This may be because the NOAA forecasts use the CMAQ model, which is most commonly used to investigate high O_{3} events in similar large urban areas that are in non-attainment of the National Ambient Air Quality Standards (NAAQS), and thus may tend to over-predict O_{3} in less polluted environments.

**Probabilistic Forecasts**

**Probabilistic Forecasts**

To assess our probabilistic model performance, we attempted to use the reliability and relative operating characteristic (ROC) area statistics, which assess how well the predicted probabilities of an event correspond to their observed frequencies and how well the model discriminates between events and non-events, respectively. For the purpose of our study, an event is defined as the *O*_{3,MDA8} exceeding a given threshold for a given urban area on a given day. The reliability and ROC area for the models are given in Supplementary Table S4. Perfect scores for reliability are zero and for ROC area are unity, so our models ostensibly exhibit substantial capability in their event forecasting. However, it is difficult to ascertain anything conclusive given the small number of extreme events that occurred during the evaluation period. Our test period consists of a modest sample of 184 days, but for probabilistic validation, this sample is broken down even further—first into five probability bins for both statistics (where most of the forecasts are contained in the first bin) and subsequently into events and non-events (most of which are non-events) in the ROC metric. The reliability and ROC area are thus calculated at small sample sizes. Additionally, given the *R*^{2} in the fits to the training data (Supplementary Table S2), there is little reason to believe that the probabilistic models will be as successful as implied by Supplementary Table S4 when applied to a larger evaluation sample.

**O**_{3} Classification Forecasts

**O**

_{3}Classification ForecastsEvaluation of classification models is routinely done using the confusion matrix utilized to assess the performance of the training set. In our study, diagonal elements of the matrix are considered successful classifications or true positives, elements above the diagonal are false negatives (i.e., a poor air quality event was missed or the severity was under-predicted by the forecaster), and elements below the diagonal are false positives (i.e., the forecaster incorrectly predicted poor air quality). The confusion matrix for each model (i.e., each urban area) is presented in Supplementary Tables S5–S10. Total model success rate is defined as the ratio of true positives to all events, which is effectively the sum of the diagonal matrices divided by the total number of elements in the confusion matrices. Likewise, the false alarm rate is defined as the ratio of false positives to all events, and the miss rate is defined as the ratio of false negatives to all events. These rates are given in Table 9.

Overall, the O_{3} classification models perform well, but the tables in the supplemental data show that the more extreme the events become, the less accurate the forecasting classifier is—most of the success is due to the “green” days, and the areas that do have more “orange” days (Dallas and Houston) are significantly less accurate. This again is likely due to the lack of extreme events on which to train the models and is the random forest equivalent of the underestimation of high* O*_{3,MDA8} days that the quantitative forecasts exhibit. The classification models can thus be considered binary air quality forecasters—they can successfully predict either “Good” or “Bad” air quality days based on EPA AQI and *O*_{3,MDA8} specifications (https://www.tceq.texas.gov/cgi-bin/compliance/monops/ozone_summary.pl#interpret).

**Alternative Predictors**

**Alternative Predictors**

Our forecasting scheme predicts the ozone for a given day, based on meteorological forecasts for the day and ozone measurements the day before the models are run. However, for forecasting purposes, it may be more practical to use an earlier ozone measurement or predict levels on a more intermediate scale, since *O*_{3}_{,}_{MDA}_{8} calculations are not known until shortly before our forecast. To address this issue, we replaced the *O*_{3}_{,}_{MDA}_{8} from day *D _{i-1}* in our models with the

*O*

_{3}

_{,}

_{MDA}_{8}from day

*D*

_{i-}_{2}, and we predicted the

*O*

_{3}

_{,}

_{MDA}_{8}for both day

*D*and day

_{i}*D*

_{i+}_{1}.

Ozone levels are dependent on photochemistry, so it is also important to investigate the significance of solar radiation in our predictions. We accomplished this by utilizing the “CLD” field in the NAM MOS forecasts, which are classifications of total sky cover for a given hour. The categories are given in Supplementary Table S11. In training and forecasting, we assign a numerical value to the categories as shown in Supplementary Table S11 and then average over the same time period that was used for the *W _{S}*,

*W*, and

_{D}*ρ*daily averages.

_{WV}One limitation that exists with these new parameters is that the MOS forecasts only extend 6–72 hours from the NAM MOS runtime. This forecast range inhibits any of our models from predicting past day *D _{i+}*

_{2}. Additionally, the training data that we collected do not have forecasts for 21:00 UTC on day

*D*

_{i+}_{2}, and this time is required for

*T*

_{2100}and the predictors that are daily averages. Consequently, our new forecasts only use the new predictors for a day

*D*and

_{i}*D*

_{i+}_{1}forecast.

As we did for our quantitative models, we compared the predictions from these three new models with the observations by computing the linear coefficients of the best-fit lines, linear correlation coefficient, and mean and standard deviation of the differences for each model and observation set (i.e., for each of the 3 new models and 6 cities). Results are given in Supplementary Tables S12–14. When compared with Table 6, it is clear that using older ozone observations in the models and attempting to predict the levels further out is less precise than the baseline model using the predictors given in Table 2. Supplementary Table S12 provides statistics for a model where the clouds predictor is added to the baseline model and the *O*_{3}_{,MDA8} predictor was modified to be for day *D _{i-}*

_{2}. The altered model yields smaller slopes and correlation coefficients but larger best-fit intercepts, biases, and spread compared to the baseline. A model with the same predictors (clouds and

*O*

_{3}

_{,MDA8}for day

*D*

_{i-}_{2}) but whose target is the

*O*

_{3}

_{,MDA8}for day

*D*

_{i+}_{1}was also developed, and its agreement with the data exhibits similar trends as the previous model. This agreement is provided in Supplementary Table S13, which illustrates that the statistics further degrade as the prediction becomes more intermediate rather than short-term.

One more model we trained was the baseline model (Table 2) with only the clouds predictor added to isolate the effect of including cloud forecasts. Statistics are provided in Supplementary Table S14. Again, aggregate agreement worsens but only slightly compared to the baseline. It is worth noting that the slope and bias for San Antonio improve, as does the bias for Dallas, though for both of these sites, the spread in the model-data differences increases.

**CONCLUSIONS AND FUTURE WORK**

For six urban areas in eastern Texas, we developed three types of forecasting models that predict the maximum daily 8-hour-averaged O_{3} (*O*_{3,MDA8}) for a given day based on meteorological forecasts for the same day and the *O*_{3,MDA8} from the previous day. The three types of modeling were quantitative forecasting of *O*_{3,MDA8} (based on a GAM with a log-link function), probabilistic forecasting that predicted the probability of *O*_{3,MDA8} being above a given threshold (based on a GAM with a logit function), and random forest classification for predicting the AQI classifications. We performed a number of sensitivity studies to find the best set of predictors to use for all three types (and for all six urban areas). These parameters included the MOS-forecasted afternoon high temperature; MOS-forecasted daily temperature difference; daily averaged MOS wind speed, wind direction, and water vapor density; day of the week; day of the year; and *O*_{3,MDA8}. The models were then tested by forecasting the 2016 ozone season (May through October).

The quantitative forecasts explain 69% or more of the variance in each urban area except for TLM. Most of the predictors were statistically significant, and those there were not varied with location. The probabilistic forecasting models exhibited little accuracy and thus are not recommended for forecasting unless significant improvements can be implemented. The ozone classification forecast using the random forest technique correctly predicted the O_{3} AQI classification 67% of the time in HGB, 77% of the time in DFW, and over 86% of the time elsewhere. False negatives (where the severity of a poor air quality event is under-predicted) were generally more common than false positives (where air quality is predicted to be worse than observed).

While the quantitative and classification forecasting models exhibit accuracy in forecasting O_{3} and can thus be used to assist with issuing public statements and advisories regarding air quality, issues remain that must be addressed in future work. The quantitative model forecasts trend toward the mean and have less variance than the observations, resulting in slopes for the lines best-fitting the model and the data that fall between 0.5 and 0.75 rather than near unity. Additionally, our classification scheme can best be described as binary because it correctly identifies good (green) and poor air quality days but fails to distinguish between different degrees of elevated O_{3} (yellow, orange, red, and maroon). The reason for this characteristic is likely the small number of extreme events in the training set.

Funding limited the scope of this project to urban areas in Texas, but future research will focus on expanding the models to work in areas outside of Texas and improving their forecasting performance. Extending the forecasts to other states and perhaps countries should be possible, but more work is needed to determine how the methods in this paper will perform outside of Texas. We intend to address the performance with techniques such as weighting the aforementioned extreme events more heavily in the model training and applying neural networks instead of generalized additive models.

**ACKNOWLEDGMENTS**

This study was funded by TCEQ Contract No. 582-15-50414 to AER, and we would also like to thank E. Gribbin of TCEQ for providing meteorological and air quality data for our training and forecast evaluation. We thank Pius Lee, Jeff McQueen, and Daniel Tong of NOAA for providing us with their numerical O_{3} forecasts that we used in our validation. Finally, we thank the UCAR Data Archive support staff, specifically Douglas Schuster and Chi-Fan Shih, for greatly facilitating the collection of historical NCEP NAM-12 model output. Lastly, we express our gratitude to the anonymous reviewers that enriched the content of this publication.