Analysing the sensitivity of pollen based land-cover maps to different auxiliary variables

Realistic depictions of past land cover are needed to investigate prehistoric environmental changes and anthropogenic impacts :: to ::::::::: determine ::: the ::::: scale :: of ::::::::::::: anthropogenic ::::::::::: deforestation ::: on ::::::::::: environment ::: and ::: to ::::: study :::: long ::::: term :::: land :::::::::::: cover-climate :::::::: feedbacks. However, observation based reconstructions of past land cover are rare :::: while :::::::::: commonly :::: used :::::::: modelled :::::: based :::::::::::: reconstructions :::::: exhibit ::::::::::: considerable ::::::::: differences ::: and :::: give :::::::: diverging :::::: results ::::: when :::: used :: by ::::::: climate :::::::: modellers :: to ::::: study ::: the :::::: effects :: of ::: past :::::::::: land-cover :::::::: dynamics :: on ::::::: climate. Recently Pirzamanbein et al. (2015, arXiv:1511.06417) developed a statistical inter5 polation method that produces spatially complete reconstructions of past land cover from pollen assemblage. These reconstructions incorporate a number of auxiliary datasets raising questions regarding both the method’s sensitivity to the choice of auxiliary data and the unaffected transmission of observational data. Here the sensitivity of the method is examined by performing spatial reconstructions for northern Europe during three time periods (1900 CE, 1725 CE and 4000 BCE), based on irregularly distributed pollen based land cover, available for ca 25% 10 of the area, and different auxiliary datasets. The spatially explicit auxiliary datasets considered include the most commonly utilized sources of the past land-cover data — estimates produced by a dynamic vegetation (DVM) and anthropogenic landcover change (ALCC) models — and modern elevation. Five different auxiliary datasets were considered, including different climate data driving the DVM and different ALCC models. The resulting reconstructions were evaluated using deviance information criteria and cross validation for all the time periods. For the recent time period, 1900 CE, the different land-cover 15 reconstructions were also compared against a present day forest map. The tests confirm that the developed statistical model provides a robust spatial interpolation tool with low sensitivity to differences in auxiliary data and high capacity to un-distortedly transmit the information provided by sparse pollen based observations ::::: proxy :::: data. Further, usage of auxiliary data with high spatial detail improves the model performance for the areas with complex topography or where observational data is missing. 20


Introduction
The importance of terrestrial land cover for the global carbon cycle and its impact on the climate system is well recognized (e.g.Claussen et al., 2001;Brovkin et al., 2006;Arneth et al., 2010;Christidis et al., 2013).Many studies have found large climatic effects associated with changes in land cover.Forecast simulations evaluating the effects of human induced global warming predict a considerable amplification of future climate change, especially for Arctic areas.The amplification is due to a number of biogeophysical and chemical feedbacks brought by the northward advancement of boreal shrub and treeline (Zhang et al., 2013;Richter-Menge et al., 2011;Chapman and Walsh, 2007;Koenigk et al., 2013;Miller and Smith, 2012).The past anthropogenic deforestation of the temperate zone in Europe was lately demonstrated to have an impact on regional climate similar in amplitude to present day climate change (Strandberg et al., 2014).However, studies on the effects of vegetation and land-use changes on past climate and carbon cycle often report considerable differences and uncertainties in their model predictions (de Noblet-Ducoudré et al., 2012;Olofsson, 2013).
One of the reasons for such widely diverging results could be the differences in past land-cover descriptions used by climate modellers.Possible land-cover descriptions range from static present-day land cover (Strandberg et al., 2011), over simulated potential natural land cover from dynamic (or static) vegetation models (DVMs) (e.g.Brovkin et al., 2002;Hickler et al., 2012), to past land-cover scenarios combining DVM derived potential vegetation with estimates of anthropogenic land-cover change (ALCC) (Strandberg et al., 2014;Pongratz et al., 2008;de Noblet-Ducoudré et al., 2012).Differences in input climates, inherent mechanistic and parametrisation differences of DVMs (Prentice et al., 2007;Scheiter et al., 2013), and significant variation in land-use estimates between the existing ALCC scenarios (e.g.Kaplan et al., 2009;Pongratz et al., 2008;Klein Goldewijk et al., 2011;Gaillard et al., 2010) further contribute to the differences in past land-cover descriptions.These differences can lead to largely diverging estimates of past land-cover dynamics even when the most advanced models are used (Strandberg et al., 2014;Pitman et al., 2009).
The palaeoecological observation ::::: proxy : based land-cover reconstructions (LCR) recently published by Pirzamanbein et al. (2014Pirzamanbein et al. ( , 2015) ) were designed to overcome the above described problems.And to provide an alternative, observation :::: proxy : based, land-cover description applicable for a range of studies on past vegetation and its interactions with climate, soil and humans.
These reconstructions use the pollen based land-cover composition (PbLCC) published by Trondman et al. (2015) as a source of information on past land-cover composition.The PbLCC are point estimates, depicting the land-cover composition of the area surrounding each of the studied sites.To fill the gaps between these observations and to acquire a spatially continuous land-cover reconstruction, spatial interpolation is necessary.Conventional interpolation methods might struggle when handling noisy, spatially heterogeneous data (Heuvelink et al., 1999;De Knegt et al., 2010), but alternative statistical methods for handling spatially structured data exist (e.g.Gelfand et al., 2010;Blangiardo and Cameletti, 2015).
All above described sets of auxiliary data were up-scaled to 1 • ×1 • spatial resolution, matching the pollen based reconstructions, before usage as model input.) and :::::::: unforested ::: land : (UF).The remaining columns gives (from left to right) the pollen based land-cover composition (PbLCC, Trondman et al., 2015) and the four model based compositions that could be used as covariates: K-LRCA3, K-LESM, H-LRCA3, and H-LESM.Here K/H indicates KK10 (Kaplan et al., 2009) or HYDE (Klein Goldewijk et al., 2011) land use scenarios and LRCA3/LESM indicates vegetation model driven by climate from the Rossby Centre Regional Climate Model (Samuelsson et al., 2011) or Earth System Model (Mikolajewicz et al., 2007).The three rows representing (from top to bottom) the time periods 1900 CE, 1725 CE, and 4000 BCE.
To account for observational uncertainty in the compositions, the PbLCC are modelled as draws from a Dirichlet distribution given concentrated parameter α (controlling the uncertainty) and the vector of proportions Z, To account for the spatial dependence in the proportions, Z is modelled as a transformation, f , of a latent GMRF, η: The inverse of f is called the additive log-ratio transformation (alr, Aitchison, 1986), i.e.
The alr transformation is the multivariate extension of a logit transformation.
The latent field is modelled with a mean structure µ and a spatially dependent residual X, Here X is GMRF with a separable covariance structure; where Q(κ) is the precision matrix of a spatially dependent GMRF (Lindgren et al., 2011), κ is the scale parameter which controls the range of spatial dependency and ρ controls the variation within and between the fields X k (see Pirzamanbein et al., 2015, for details).
The mean structure is modelled as a linear regression µ = Bβ, i.e. a combination of covariates B and regression coefficients β.The main focus of this paper is to evaluate the model sensitivity to the choice of covariates (e.g. the auxiliary datasets).The PbLCC is modelled based on six different sets of covariates (Figure 1): 1) Intercept, 2) SRTM elev , 3) K-L ESM , 4) K-L RCA3 , 5) H-L ESM , and 6) H-L RCA3 .Table 2 shows the different models and the corresponding covariates included in the model.
The model description is completed by specifying prior distributions for the model parameters.Wide but proper priors are assigned for α, κ, ρ and β.Specifically, a Gamma prior is chosen for the uncertainty and scale parameters, α and κ, i.e.Γ(1.5, 0.1) and Γ(1.5, 0.1).A Gaussian prior for the regression parameters β, with zero expectation and small precision q β = 10 −3 .The ρ is assigned an inverse wishart prior, IW (I, 10), where I is a 2 × 2 identity matrix.

Inference and associated uncertainties
The Markov Chain Monte Carlo (MCMC) method (Brooks et al., 2011) is used to estimate the parameters and to reconstruct the land-cover composition, Z LCRs , with 100 000 MCMC samples and a burn-in sample size of 10 000.Details of the MCMC implementation can be found in Pirzamanbein et al. (2015).2. Six different models and corresponding covariates.SRTMelev is elevation (Becker et al., 2009), K/H indicates KK10 (Kaplan et al., 2009) or HYDE (Klein Goldewijk et al., 2011) land use scenarios and LRCA3/LESM indicates vegetation model driven by climate from the Rossby Centre Regional Climate Model (Samuelsson et al., 2011) or Earth System Model (Mikolajewicz et al., 2007).
In each MCMC iteration, the samples of η are obtained by adding the spatial dependency field X and the effect of covariates through Bβ.Applying the alr transformation to the η samples, MCMC samples for Z are obtained.The land-cover reconstruction is then computed by averaging the MCMC samples, giving Z LCRs = E(Z|Y PbLCC ).
The uncertainties of the land-cover reconstruction, V(Z|Y PbLCC ), are assessed by constructing predictive regions (PR) using the MCMC samples at each location.The predictive regions are constructed to represent the uncertainty associated with the reconstructions; including uncertainties in both model parameters (α, κ, ρ, and β) and underlying fields (Z).The predictive regions are constructed by first creating elliptical 95%-predictive regions (i.e. containing 95% of the MCMC samples) in R 2 (Figure 3 left plot) before transforming these to ternary predictive region in (0, 1) 3 (Figure 3 right plot) (see Pirzamanbein et al., 2015, for details).In order to compare the uncertainties of different model land-cover reconstructions, we report the fraction of the unit triangle covered by the ternary PR.This is done by distributing points in the ternary diagram and computing the fraction as the number of points laying inside the PR divided by total number of points in the ternary triangle.

Testing the model performance
To evaluate the model performance, we compared the land-cover reconstructions from different models for the 1900 CE time period with the European Forest Institute forest map (EFI-FM) by computing the average compositional distances (ACD).The compositional distances between two different compositions, U and V , are computed as (Aitchison et al., 2000) where u = alr(U ), v = alr(V ) and H is a 2×2 matrix, neutralizing the choice of denominator in the alr transformation, with elements H ij = 2 if i = j, and H ij = 1 if i = j.These distances are then averaged over all locations.This measure is similar to root mean square error in R 2 but it accounts for compositional properties, i.e. each component of the compositions is between (0, 1) and sum of all the components is 1.Since no independent observational data exists for the 1725 CE and 4000 BCE time periods, we applied a 6-fold crossvalidation scheme (Friedman et al., 2001, Ch. 7.10) for all six models and three time periods.The PbLCC data were divided into 6 randomly selected groups and, in each round, the distance between the left out data, Y PbLCC,l , and predictions for group l given a model fitted to the rest of the data, E(Z l |Y PbLCC,k k / ∈ l), were computed.
To compare the predictive performance of the models, the Deviance Information Criteria (DIC; see Gelman et al., 2014, Ch. 7.2) is also computed for all models and time periods.
Today, central and northern Europe have, at the subcontinental spatial scale, the highest density of palynologically investigated sites on Earth.However, the collection of pollen data is very time consumingand , : cannot be performed everywhere ::: and ::::: hardly ::::::::: applicable :: by ::::::: climate :::::::: modellers :: as ::::: input :: in :::: their ::::::: original :::::: format ::: and :::::::: therefore :::::: heavily :::::::::: underused.Regrettably, the available auxiliary datasets (Table 2) exhibit large variation in the extent of coniferous and broadleaved forests, and un-forested areas for all of the studied time periods (Figure 1).These substantial differences illustrate large deviances between model based estimates of the past land-cover composition due to differences in climate forcing and/or applied ALCC scenarios.Differences in climate model outputs (e.g.Harrison et al., 2014;Gladstone et al., 2005) and ALCC model estimates (Gaillard et al., 2010) have been recognized in earlier comparison studies and syntheses.The effect of the differences in input climate forcing and ALCC scenario on DVM estimated land-cover composition presented here are especially pronounced for central and western Europe, and for elevated areas in northern Scandinavia and the Alps (Figure 1).In general the KK10 ALCC scenario produces larger un-forested areas, notably in western Europe, compared to the HYDE scenario.Compared to the ESM climate forcing; the RCA3 forcing results in higher proportions of coniferous forest, especially for central, northern and eastern Europe.The described differences are clearly recognizable for all the considered time periods and are generally larger between time periods than within each time period.
To assess the impact of the different auxiliary datasets, the statistical model was used to create a set of observation ::::: proxy based reconstructions of past land cover for central and northern Europe during three time periods (1900 CE, 1725 CE and 4000 BCE; see Figures 4-6).Each of the reconstructions were based on the irregularly distributed observed pollen data (PbLCC), available for ca 25% of the area, together with one of the six models described in Table 2; each model uses a different combination of auxiliary data (Figure 1).The spatial dependence in the PbLCC data was modelled using GMRFs, along with additional spatial structure inferred from the auxiliary datasets (Pirzamanbein et al., 2015).
To illustrate the structure of the statistical model, step by step advancement from auxiliary data (model derived land cover) to final statistical estimates, for 1725 CE, are given in Figures 7 and 8. Figure 7 shows, for two locations, how the large differences in K-LRCA3 and K-LESM are reduced by scaling with the regression coefficients, β; capturing the empirical relationship between covariates and PbLCC data.The two land-cover estimates are then further subject to similar adjustments due to intercept and SRTM elev , and finally similar spatial dependent effects.The corresponding decomposition of contributions to the continental map for 1725 CE, from SRTM elev , K-L ESM , and the spatial effects are given in Figure 8.

Data availability
The database containing the reconstructions of coniferous forest, broadleaved forest and un-forested land, three fractions of land cover, for the three time-periods presented in this paper, along with reconstructions for 1425 CE and 1000 BCE using only the K-L ESM are available for download from https://behnaz.pirzamanbin.name/phd/.

Figure 3 .
Figure 3.The left plot shows the 95% elliptical predictive regionin R 2 .The right ternary diagram shows the transformed 95% predictive region together with the corresponding fraction, 60%, compared to the whole triangle.

Figure 4 .
Figure 4. Land-cover reconstructions using pollen based land-cover compositions (PbLCC) for the 1900 CE time periods (top row).The reconstructions are based on six different models (see Table 2) with different auxiliary datasets.Locations and compositional values of the available PbLCC data are given by the black rectangles.Middle row shows the compositional distances between each model and the Constant model.Bottom row shows the compositional distances between each model and the EFI-FM.

Figure 5 .
Figure 5. Land-cover reconstructions using local estimates of pollen based land-cover compositions (PbLCC) for the 1725 CE time period (top row).The reconstructions are based on six different models (see Table 2) with different auxiliary datasets.Locations and compositional values of the available PbLCC data are given by the black rectangles.Bottom row shows the compositional distances between each model and the Constant model.

Figure 6 .Figure 7 .
Figure 6.Land-cover reconstructions using local estimates of pollen based land-cover compositions (PbLCC) for the 4000 BCE time period (top row).The reconstructions are based on six different models (see Table 2) with different auxiliary datasets.Locations and compositional values of the available PbLCC data are given by the black rectangles.Bottom row shows the compositional distances between each model and the Constant model.

Figure 8 .
Figure 8. Advancement of K-LESM models for the 1725 CE time period: (a) shows the effect of intercept and SRTMelev, (b) shows the mean structure, µ, including all the covariates, (c) shows the spatial dependency structure and finally (d) shows the resulting land-cover reconstructions, ZLCRs, obtained by adding (b) and (c).

Table 4 .
interpolation model is robust to the choice of covariates.The model is suitable for reconstructing spatially continuous maps of past land cover from scattered and irregularly spaced pollen based observations ::::: proxy :::: data.The average compositional distances among the six models (see Table2) fitted to the data for each of the three time periods.

Table 5 .
Deviance information criteria (DIC) for each of the six models (see Table2) and three time periods.