Application of Regression Kriging to Air Pollutant Concentrations in Japan with High Spatial Resolution

The application of regression kriging to air pollutants in Japan was examined for the purpose of providing a practical method to obtain a spatial distribution with sufficient accuracy and a high spatial resolution of 1 × 1 km. We used regulatory air monitoring data from the years 2009 and 2010. Predictor variables at 1 × 1 km resolution were prepared from various datasets to perform regression kriging. The prediction performance was assessed by indicators, including root mean squared error (RMSE) and R, calculated from the leave-one-out cross validation results, and was compared to the results obtained from a linear regression method, often referred to as land use regression (LUR). Regression kriging wellexplained the spatial variability of NO2, with R values of 0.77 and 0.78. Ozone (O3) was moderately explained, with R values of 0.52 and 0.66. The reason for this difference in performance between NO2 and O3 might be the characteristics of these pollutants primary or secondary. Regression kriging outperformed the linear regression method with regard to RMSE and R. The performance of regression kriging in this work was comparable to that found in previous studies. The results indicate that regression kriging is a practical procedure that can be applied for the prediction of the spatial distribution of air pollutants in Japan, with sufficient accuracy and a high spatial resolution.


INTRODUCTION
Spatial distribution of air pollutants with sufficient accuracy and high spatial resolution is the key piece of information for assessing the risks to human health or evaluating the air quality policy quantitatively.A regression method has been used successfully for predicting the spatial distribution of the long-term mean concentrations of air pollutants mainly for epidemiological studies (e.g., Briggs et al., 2000;Ross et al., 2007).This approach, often referred to as land use regression (LUR), uses measured data of air pollutants and develops linear regression models using predictor variables that may influence the concentrations.The regression model obtained is then applied to unmonitored locations in the domain in order to predict the concentrations.In the case of NO 2 , for instance, predictor variables such as land use, traffic intensity and population are used to construct a regression model (e.g., Briggs et al., 2000;Hoek et al., 2008).This technique is significantly less complicated in terms of computation compared to widely-used chemical transport models which simulate physical and chemical processes including emission, transportation, transformation and deposition.The predictions of the regression method have been reported to correlate better with monitored concentrations than chemical transportation models for long-term statistics such as annual mean (Cyrys et al., 2005;Marshall et al., 2008).On the other hand, application of the regression method to finer temporal resolution (e.g., hourly mean) is difficult because the predictor variables at such high temporal resolution are usually unavailable.
A few studies have applied regression kriging for the prediction of the spatial distribution of air pollutants (e.g., Beelen et al., 2009;Pearce et al., 2009).Regression kriging is a geostatistical approach -ordinary kriging with the residuals of regression analysis.In this method, a variable of interest is modelled as a sum of the deterministic and stochastic components separately (Hengl, 2007).The deterministic component is estimated by a regression analysis, which is equivalent to the regression method described above.The residuals of the regression model are interpolated using ordinary kriging and added to the estimated deterministic part (Hengl, 2007).There are other terms such as universal kriging and kriging with external drift (KED), but these methods including regression kriging are basically the same technique (Hengl et al., 2003).Some studies compared regression kriging with the regression method and reported that the prediction performance of regression kriging was better (Beelen et al., 2009;Pearce et al., 2009;Mercer et al., 2011).This method has been successfully applied at the EU-wide scale at 1 × 1 km resolution for the prediction of annual means of air pollutant concentrations (Beelen et al., 2009), which would be computationally demanding for chemical transportation models.Taking account of these facts, regression kriging is a reasonable and practical method for the prediction of the spatial distribution of air pollutants for long-term statistics.
In Japan, air pollution is an unsolved problem and an issue of social concern: thus, the spatial distribution of air pollutants is needed.As regression kriging includes an empirical model in contrast to the chemical transportation models, limitations in transferability may be inevitable to some extent (Hoek et al., 2008).To our knowledge, however, no study has applied regression kriging to predict the spatial distribution of air pollutants in Japan or even in Asian countries to date.
In this study, we examined the possibility of applying regression kriging to air pollutants in Japan in order to provide a practical method to obtain a spatial distribution with sufficient accuracy and high spatial resolution.We validate the performance of regression kriging and discuss the possibility of improvement.

Study Area
Japan consists of four major islands stretching from north to south and has an area of 377914 km 2 and a population of 127 million.The study area included the main islands of ) but remote or small islands were excluded.We constructed a prediction grid at 1 × 1 km resolution only over land areas.The concentrations of air pollutants were predicted at each cell of the grid.

Monitored Air Quality Data
The air quality data for the year 2009 and 2010 was obtained from the database of the regulatory monitoring network, which is maintained by the local government according the national law.The monitored data is collected and summarized by the Japanese Ministry of Environment.There are two types of monitoring station: road side and general environment.The road side stations are located at crossroads or road sides and monitor air pollutants from automobile traffic, whereas the general environment stations are located where they are not affected by specific emission sources.We used only the general environment station data, because of the difficulty to model the small spatial variation near the road side with our prediction grid at 1 × 1 km resolution.The number of general environment stations in Japan in the year 2010 was reported to be 1503 according to the Japanese Ministry of Environment.
The objective pollutants were NO 2 and O 3 .Regarding NO 2 , we predicted the spatial distribution of the annual 98 percentile daily mean concentrations, which is the air quality standard for NO 2 in Japan.We used only the data from monitoring stations with annual data coverage of more than 6000 hours, because conformity to the air quality standards is evaluated only for those stations.As for O 3 , the air quality standard in Japan is set as an hourly mean.Some variables including meteorological parameters with the temporal resolution of one hour are necessary for the application of regression kriging to hourly mean concentrations of O 3 .Because the meteorological data with such high temporal resolution which covers the main islands of Japan is not publicly available, a meteorological model is required to obtain those variables.This would be computationally demanding and reduce the efficiency that is one of the advantages of regression kriging.Besides, it would be rational to apply a chemical transport model instead of regression kriging because that is designed to be driven by a meteorological model.For these reasons, instead of hourly mean values, we predicted the annual 26th highest daily maximum 8 hour mean concentrations which is the air quality standard in the EU.Also, we can compare our result to the previous studies (De Smet et al., 2010, 2011) that applied regression kriging to the EU standard.The daily maximum 8 hour mean concentrations were calculated only for the day when the valid operation time was more than 20 hours, and the stations with more than 250 daily maximum 8 hour means available for a year were provided for the analysis.The annual statistics for NO 2 and O 3 were calculated for a fiscal year in accordance with the report of the Japanese Ministry of Environment.

Datasets
The datasets used to construct the 1 × 1 km resolution predictor variables are presented in Table 1 and described in detail below.The selection of datasets was made principally in consideration of the key factors in the spatial distribution of air pollutants including emission, transportation, transformation and deposition.The accessibility and usability were also considered.If necessary, we spatially aggregated, disaggregated or resampled the original data into a 1 × 1 km prediction grid and/or calculated the annual means for a fiscal year from the data with finer temporal resolution (e.g., monthly).
Vegetation is one of the main emission sources of volatile organic compounds (VOCs), which are the precursors of O 3 .Normalized Difference Vegetation Index (NDVI) indicates the amount of vegetation.Annual mean NDVI were derived from Moderate Resolution Imaging Spectroradiometer (MODIS) Terra data (MOD13A3) downloaded through the online Data Pool at the NASA Land Processes Distributed Active Archive Center (LP DAAC), and USGS/Earth Resource Observation and Science (EROS) Center, and provided for a predictor variable representing the emission of VOCs.
For one of the predictor variables for O 3 , we used the amount of UV which promotes O 3 formation through a photochemical reaction.We obtained half monthly mean UV-A (wave length: 320-400 nm) from Japan Aerospace Exploration Agency (JAXA) Satellite Measurements for As for the meteorological parameters, we obtained TRMM Product 3B43 precipitation data (JAXA/NASA/ NICT) through Goddard Earth Science (GES) Data and Information Service Center (DISC), and modelled monthly mean surface temperature and wind velocity data from GLDAS NOAH025_M (Rodell et al., 2004) through Goddard Earth Science (GES) Data and Information Service Center (DISC).
The altitude data was obtained from ASTER Global Digital Elevation Model (GDEM) version 2, which is a product of METI and NASA.The population data was obtained from the National Census of the year 2005 through the Statistics Bureau of Japan.As these two variables are skewed to zero, we transformed them to a natural logarithmic scale before analysis.Build-up area ratio in a grid cell was calculated from land use data obtained from Global Map Japan version 1.2.1 downloaded from Geospatial Information Authority of Japan (GSI).
Transport is an important emission source for NO x (NO + NO 2 ).The concentrations of NO 2 decrease with distance from roads, whereas the concentrations of O 3 tend to decrease towards roads because it is scavenged by NO.Hence, the distance to a road was provided as a predictor variable.The road network data was obtained from Global Map Japan version 2 downloaded from GSI.In this data, road types are classified into three categories; primary, secondary and highway.The distance to a road was calculated for each grid cell centroid for each of these three categories.Likewise road length was obtained from the National Land Numeric Information Data downloaded through the Japanese Ministry of Land, Infrastructure, Transportation and Tourism.This road length data is classified into 10 categories depending on the road width.We reclassified them into three new categories; Road A (road width ≥ 19.5 m), Road B (13 ≤ road width < 19.5 m) and Road C (5.5 m ≤ road width < 13 m).
When typical land and sea breezes dominated, polluted air parcels are transported from industrial or urban areas in coastal regions to inland areas and O 3 concentrations increase via a photochemical reaction during the transportation (Kannari, 2006).Therefore, we used distance to coastline as a predictor variable for O 3 .This distance was calculated for each grid cell centroid as the nearest straight-line distance to coastline, which was obtained from Global Map Japan version 2.

Regression Kriging
Linear regression models were constructed prior to ordinary kriging on its residuals.Candidates for predictor variables of linear regression models for each pollutant are presented in Table 2 with the pre-specified direction of effect according to the physical or chemical relationship between the pollutants and the predictor variables (Beelen et al., 2009).Linear regression models were developed using backward stepwise procedure to select the significant variables (Hengl, 2007).The selected variables that had coefficients that conformed to the pre-specified direction of effect were retained in the final linear regression models, but others were discarded (Beelen et al., 2009).Ordinary kriging was performed on the residuals of the final linear regression models developed as above.The concentrations of pollutants were transformed to a natural logarithmic scale before analysis, and the predictions were back transformed after analysis.This procedure has the advantage that predicted concentrations are positive, which is found not to be the case when analyses are performed without transformation (Beelen et al., 2009).

Validation
The validation was carried out by leave-one-out cross validation.This procedure has been applied in many previous studies for model validations (e.g., Pearce, et al., 2009;De Smit et al., 2010).In this method, one measurement point is removed and then the concentration at that point is predicted by using the rest of the points.This procedure is repeated for all measurement points.The predicted and measured values are compared at all points.From the result of the cross validation, we computed root mean squared error (RMSE) and R 2 between the predicted and measured values as indicators of the prediction performance.RMSE should be as small as possible.
Data analysis was carried out using R statistical software m/km 2 + a +: positive direction, -: negative direction b ratio of land use type 2.15.1 (R Core Team, 2012) with the raster package (Hijmans et al., 2012) for the integration and construction of the grid data of predictor variables and with the gstat package (Pebesma et al., 2004) for the performance of regression kriging.

NO 2
Linear regression methods moderately explained the variability of the 98 percentile daily mean concentrations of NO 2 for both years with R 2 values of 0.53 and 0.58.The retained predictor variables and their coefficients are given in Table 3.Most of the potential predictor variables were retained in the models except for distance to primary road and road length C. The retained variables were the same in both years and their coefficients were similar in general to the corresponding ones in the other year.
The validation results of regression kriging are given in Table 4 with those of linear regression methods.Regression kriging well explained the spatial variation of NO 2 concentrations with R 2 values of 0.79 and 0.78, which were much higher than those for linear regression methods.RMSE values were less than 5 ppb and smaller than those for the linear regression methods.Figs.1(a) and 1(b) are the scatter plots of the predicted and observed concentrations of NO 2 from cross validation results in the year 2009 and 2010 respectively.These plots show good correlation between the predicted and observed values, but some over predictions are found at the higher concentrations.These overestimations generally occurred at locations at short distance to highway and long road length A and B that were the retained predictor variables in the regression models.These variables represent basically the same emission source from automobile traffic, and simultaneous selection of these variables in the regression models might bring some positive bias to the predictions.
There was no prediction value above the air quality standard in Japan of 60 ppb either in the year 2009 or 2010.Fig. 2 illustrates the spatial distribution of NO 2 in the year 2010 predicted by regression kriging.Relatively high concentration areas are located in urban areas such as metropolitan Tokyo and along major highways, which generally reflects the distribution of the selected predictor variables in the final regression model.

O 3
Linear regression methods poorly explained the spatial variation of the 26th highest daily maximum 8-hour mean concentrations of O 3 in the year 2009 with an R 2 value of 0.23, whereas they moderately explained spatial variation in the year 2010 with an R 2 value of 0.40.The retained predictor variables were not consistent for each year; UV-A and altitude were only included in 2009, while distance to primary road was retained only in the year 2010 (Table 5).Temperature, wind velocity, distance to coastline and road length B were included in the models for both years.
The validation results for regression kriging of O 3 are presented in Table 4 with those for regression methods.Regression kriging moderately explained the spatial variation of O 3 with R 2 values of 0.52 and 0.66, which were much higher than those for the linear regression methods.RMSE values were 4.9 and 4.8 ppb, marginally smaller than those for the linear regression methods.Figs.1(c) and 1(d) show the scatter plots of predicted and observed concentrations  where over/under predictions are not significant.However, remarkable over predictions at relatively lower measured values are not fully explained by this effect.This might be the influence of a near-by stationary source of NO that scavenges O 3 around these measurement points, which we were not able to model because such emission sources were not considered in our study.
For reference, we compared our prediction values to the EU standard of 120 µg/m 3 , which is transformed to 60 ppb assuming 1 ppb is 2 µg/m 3 (this will also apply for O 3 unit conversion where necessary).The percentages of the grid cells with predicted values above 60 ppb were 74% and 65% in the year 2009 and 2010 respectively.The population of areas above the standard value was calculated to be 94% and 91% in the years 2009 and 2010 respectively using the population grid data prepared as one of the predictor variables.The reported values of this percentage for Greece, which is located at a similar latitude of Japan, are 85 and 59% in the years 2008 and 2009 respectively (De Smet et al., 2010, 2011).
Fig. 3 illustrates the spatial distribution of O 3 in the year 2010 predicted by regression kriging.High concentration areas are generally located in inland areas of large cities, broadly reflecting the distance to coastline, which was a retained predictor variable in the regression model.This   spatial feature is remarkable for inland areas of metropolitan Tokyo (inset of Fig. 3), suggesting that typical land and sea breezes contributed to high concentrations of O 3 in these areas.

Prediction Performance
The retained predictor variables for NO 2 in each year were equivalent and their coefficients were broadly similar (Table 3).Also, R 2 values were moderately high for both years (Table 3).These facts demonstrate the robustness of the linear regression model for NO 2 developed in this study.However, the retained variables for O 3 were different in each year (Table 5) and R 2 values were lower than those for NO 2 (Table 4).The reason for this might be the difference in the characteristics of these pollutants.NO 2 is a primary pollutant, whereas O 3 is a secondary pollutant which is formed through a complicated photochemical reaction involving NO 2 and VOCs.Although UV-A and temperature were considered in our study as predictor variables related to secondary pollutant formation, these variables did not sufficiently contribute to the prediction performance.This resulted in a poorer performance using the linear regression methods with O 3 than with NO 2 .
Regarding the comparison of regression kriging to linear regression methods, RMSE values of regression kriging were smaller and R 2 values were significantly higher for all cases.Taking account of these validation results, regression kriging outperformed the linear regression methods, which is consistent with the reported results (Beelen et al., 2009;Pearce et al., 2009;Mercer et al., 2011) as described earlier.
We compare our result to those from previous studies.Beelen et al. (2009) applied regression kriging to the annual mean concentrations of NO 2 at the EU-wide scale and reported an R 2 value of 0.61.Mercer et al. (2011) predicted 2-week average NO x concentrations at the urban scale in Los Angeles, US and reported R 2 values of 0.72-0.75.These values are similar to the R 2 values of 0.79 and 0.78 in this study (Table 4).
We applied regression kriging to predict 26th highest maximum daily 8-hour mean concentrations of O 3 with RMSE values of 4.9 ppb and 4.8 ppb and R 2 values of 0.52 and 0.66 (Table 4).Beelen et al. (2009) predicted the annual mean concentrations of O 3 at the EU-wide scale and the reported R 2 value was 0.70.De Smet et al. (2010Smet et al. ( , 2011) estimated 26th highest daily maximum 8-hour mean for rural and urban areas separately and reported RMSE values of 4.1-4.7 ppb and R 2 values of 0.56-0.69.These RMSE and R 2 values from previous studies were similar to those of our study.Overall, the performance of regression kriging in our study was comparable to the performance reported in previous studies.

Further Development
In this study, we implemented predictor variables representing NO 2 emissions from automobile traffic, whereas variables representing stationary sources such as power plants were not included.The NO x emissions from automobile traffic accounts for 40% of all NO x emissions in Japan (Kannari et al., 2007).Thus, more than half of NO 2 emissions were not included in our linear regression models.Regarding VOCs emissions, we implemented NDVI as a predictor variable representing biogenic VOCs emissions, whereas anthropogenic VOCs were not considered.Kannari et al. (2007) estimated biogenic and anthropogenic VOCs emissions in Japan as 1.5 and 2.1 (Tg/year) respectively.Thus, more than half of the emissions of VOCs were not considered in this study.Therefore, providing more variables representing other emission sources should be considered in order to improve prediction performance.From a practical point of view, however, too many predictor variables would reduce the efficiency of this procedure.Applying emission inventory at sufficiently high spatial resolution is a good alternative to this, where the quantitative relationship between the classes of emission sources can be considered.Also, the traffic related variables in our regression models were proxies of emissions from automobile traffic and represented basically the same emission source.This could result in biased predictions when more than two of these variables are simultaneously retained in the regression models.In this regard, introducing emission inventory would improve the predictions.
Meteorological parameters such as temperature and wind speed were significant variables for the regression models in our study (Tables 3 and 5).However, the spatial resolution of these variables is 0.25 deg, which is much coarser than that of our prediction grid resolution of 1 km.Therefore, higher spatial resolution of meteorological parameters could improve the performance of the prediction.

CONCLUSIONS
This study aimed to develop a practical procedure to provide spatial distribution of air pollutants in Japan with sufficient accuracy at 1 × 1 km resolution.We applied regression kriging to the concentrations of NO 2 and O 3 with predictor variables representing the key factors in the spatial distribution of air pollutants including emission, transportation, transformation and deposition in two consecutive years.The performance of regression kriging assessed by cross validation was satisfactory, outperformed the conventional regression method and was comparable to previous studies.This brought us to the conclusion that regression kriging is a practical procedure that can be applied to the prediction of the spatial distribution of air pollutants in Japan with sufficient accuracy and high spatial resolution.Although further improvement is required, distribution maps produced using regression kriging can be a basis for risk assessment or policy evaluation.

Fig. 1 .
Fig. 1.Scatter plot of the predicted and observed concentrations of the annual 98 percentile daily mean NO 2 in the years a) 2009 and b) 2010, and the annual 26th highest daily maximum 8-hour mean O 3 in the years c) 2009 and d) 2010.The solid lines represent regression lines and dotted lines are 1:1 lines.

Fig. 2 .
Fig. 2. The spatial distribution of the annual 98 percentile daily mean concentrations of NO 2 for the year 2010 predicted using regression kriging at 1 × 1 km resolution (ppb).

Fig. 3 .
Fig. 3.The spatial distribution of the annual 26th highest daily maximum 8-hour mean concentrations of O 3 for the year 2010 predicted using regression kriging at 1 × 1 km resolution (ppb).

Table 1 .
Summary of the data used in this study.

Table 2 .
Predictor variables and predefined directions of effect a .

Table 3 .
Result of regression models for NO 2 (ppb).

Table 4 .
Validation results for linear regression methods and regression kriging.

Table 5 .
Result of regression models for O 3 (ppb).