Co-kriging and co-SGS with Markov Model 1 (MM1)#

Gator Glaciology Lab, University of Florida#

We often have more extensively sampled secondary data that can provide some constraint on interpolations. For example, here we examine the case where there is ice surface elevation data that correlates with bed elevation. Unlike with bed measurements, the ice surface elevation is known at all locations. So if the ice surface and bed elevation are related, how do we use the ice surface to help improve our interpolations?

To answer this question, we can use co-kriging or co-SGS, which are the multivariate forms of kriging and SGS, respectively. While kriging computes interpolated values as the weighted sum of nearby meausurements, co-kriging uses the weighted sum of measurements from both variables. So co-kriging and co-SGS must account for the spatial relationships within and between both the primary and secondary data. This means that, in theory, one must compute the primary variogram (for the bed elevation data), the secondary variogram (for ice surface elevation), and the cross variogram. The cross variogram describes the spatial relationship between two variables:

\[ \gamma_{12}(h) = \frac{1}{2 N(h)} \sum^{N}_{\alpha=1} (Z_1(x_\alpha) - Z_1(x_\alpha + h))(Z_2(x_\alpha) - Z_2(x_\alpha + h)) \]

where \(x\) is a spatial location, \(Z_1\) is the primary variable (bed elevation), \(Z_2\) is the secondary variable (ice surface elevation) and \(N\) is the number of lag distances. However, using all of these different variograms can lead to very large, unstable systems of equations that are difficult to solve.

So in practice, people only use the nearest neighbor secondary point to the grid cell being estimated. Usually the nearest neighbor secondary data value is co-located with the grid cell of interest, meaning it is at the same location. So this approach is often referred to as collocated co-kriging or co-simulation.

This simplification requires an assumption of conditional independence (a Markov assumption). There are two possible Markov models, referred to as Markov Models 1 and 2, respectively. Markov Model 1 is the most straightforward option to model, so we use this version in our package.

Markov Model 1 (MM1)#

MM1 makes the following assumption of conditional independence:

\[ E(Z_2(u) | Z_1(u) = z_1, Z_1(u') = z_1(u')) = E(Z_2(u) | Z_1(u) = z_1) \]

This reads as: The expected value of the secondary variable at location \(u\), given knowledge of the primary data value at that same location \(u\) as well as primary data at a further away location, \(u'\), is the same as the expected value of \(Z_2(u)\) when given primary data information only at location \(u\). It means that the statistical relationship between the primary and secondary variable can be simplified. Under MM1 assumptions, the cross-correlogram model is written as:

\[ \rho_{12}(h) = \rho_{12}(0)\rho_1(h) \]

where \(h\) is the lag distance and \(\rho\) is the correlogram, calculated by:

\[ \rho(h) = 1 - \gamma(h) \]

assuming that \(\gamma(h)\) is the variogram for data with a standard normal distribution. So the correlogram is just the variogram flipped upside down plus one. While the variogram measures spatial covariance, the correlogram measures spatial correlation. This means that \(\rho_{12}(0)\), the correlogram at a lag distance of zero, is simply the correlation coefficient between variables 1 and 2.

In summary, to perform co-kriging and co-simulation under MM1 assumptions, all you need is the variogram of the primary (bed elevation) data, and the correlation coefficient between the primary and secondary data. We don’t even have to model the variogram for the secondary data.

Note that the primary and secondary data MUST undergo a normal score transformation prior to invoking MM1.

# load dependencies
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
from sklearn.preprocessing import QuantileTransformer 
import skgstat as skg
from skgstat import models
import GStatSim as gs
import earthpy as et
import earthpy.spatial as es
import earthpy.plot as ep
import random
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/var/folders/3d/s79lt6q910z0whw_7c_z48lc0000gp/T/ipykernel_21622/3577745124.py in <module>
      7 import skgstat as skg
      8 from skgstat import models
----> 9 import GStatSim as gs
     10 import earthpy as et
     11 import earthpy.spatial as es

ModuleNotFoundError: No module named 'GStatSim'

Load and plot primary data#

df_bed = pd.read_csv('data/greenland_test_data.csv') # download data
df_bed = df_bed[df_bed["Bed"] <= 700]  # remove erroneously high values, presumably due to bad bed picks

# plot original data
im = plt.scatter(df_bed['X'],df_bed['Y'], c = df_bed['Bed'], vmin = -400, vmax = 600, marker=".", s = 0.5, 
                 cmap = 'gist_earth')       # scatter plot for location map
plt.title('Original Bed Data')                                    # add plot title
plt.xlabel('X (m)'); plt.ylabel('Y (m)')                          # set axis labels
cbar = plt.colorbar(im, orientation="vertical", ticks=np.linspace(-400, 600, 11)) # add vertical color bar
cbar.set_label("Bed (m)", rotation=270, labelpad=20)      # add labels to the color bar
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.0, top=1.0) # adjust the plot size
plt.axis('scaled')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
_images/9_cokriging_and_cosimulation_MM1_3_1.png

Load and plot secondary data#

df_surface = pd.read_csv('Data/greenland_surface_data.csv') # download data

# make hillshade plot for visualizing
res = 200
xmin = np.min(df_surface['X']); xmax = np.max(df_surface['X'])     # min and max x values
ymin = np.min(df_surface['Y']); ymax = np.max(df_surface['Y']) 

# reshape grid
ylen = (ymax + res - ymin)/res
xlen = (xmax + res - xmin)/res
elevation =  np.reshape(df_surface['Surface'].values, (int(xlen), int(ylen)))
elevation = elevation.T 

hillshade = es.hillshade(elevation, azimuth = 210, altitude = 10) # hillshade

fig, ax = plt.subplots(figsize=(10, 6))
ep.plot_bands(
    elevation,
    ax=ax,
    cmap="cividis",
    title="Ice surface",
    vmin = 500, vmax = 2500,
)
ax.imshow(hillshade, cmap="Greys", alpha=0.1)
plt.show()
_images/9_cokriging_and_cosimulation_MM1_5_0.png
#fig.savefig('Figures/surface_data.jpg', dpi=300, bbox_inches = "tight")

Grid and transform data. Get variogram parameters#

See variogram tutorials for details

# Grid data
res = 1000  # resolution
df_grid, grid_matrix, rows, cols = gs.Gridding.grid_data(df_bed, 'X', 'Y', 'Bed', res) # grid data
df_grid = df_grid[df_grid["Z"].isnull() == False]  # remove coordinates with NaNs
df_grid = df_grid.rename(columns = {"Z": "Bed"}) # rename column for consistency

# normal score transformation
data = df_grid['Bed'].values.reshape(-1,1)
nst_trans = QuantileTransformer(n_quantiles=500, output_distribution="normal").fit(data)
df_grid['Nbed'] = nst_trans.transform(data) 

# compute experimental (isotropic) variogram
coords = df_grid[['X','Y']].values
values = df_grid['Nbed']

maxlag = 50000 # maximum range distance
n_lags = 70 #num of bins

V1 = skg.Variogram(coords, values, bin_func = "even", n_lags = n_lags, 
                   maxlag = maxlag, normalize=False)

V1.model = 'exponential' # use exponential variogram model
V1.parameters
[31852.813443811916, 0.7027482245687205, 0]

These outputs are the variogram range, sill, and nugget, respectively of the primary variogram.

# ice surface normal score transformation
df_surface_grid, surface_grid_matrix, rows, cols = gs.Gridding.grid_data(df_surface, 'X', 'Y', 'Surface', res) # grid data
df_surface_grid = df_surface_grid.rename(columns = {"Z": "Surface"}) # rename column for consistency

# normal score transformation
data_surf = df_surface_grid['Surface'].values.reshape(-1,1)
nst_trans_surf = QuantileTransformer(n_quantiles=500, output_distribution="normal").fit(data_surf)
df_surface_grid['Nsurf'] = nst_trans_surf.transform(data_surf) 

Find correlation coefficient between primary and secondary variables#

Here we compute \(\rho_{12}(0)\). We start by finding all of the co-located primary and secondary values using the GStat-Sim find_colocated function.

# find co-located secondary data (this may take a few minutes)
df_colocated = gs.NearestNeighbor.find_colocated(df_grid, 'X', 'Y', 'Nbed', df_surface_grid, 'X', 'Y', 'Nsurf')
# plot normalized co-located data
fig = plt.figure()
plt.subplot(121)     
im = plt.scatter(df_grid['X'],df_grid['Y'],c=df_grid['Nbed'],vmin=-3,vmax=3, cmap = 'gist_earth', 
                 alpha=0.8, marker=".")       # scatter plot for location map
plt.title('Normalized Bed')                                    # add plot title
plt.xlabel('X (m)')                                                   # set axis labels
plt.ylabel('Y (m)')
cbar = plt.colorbar(im, orientation="vertical", ticks=np.linspace(-3, 3, 7)) # add vertical color bar
cbar.set_label("Bed", rotation=270, labelpad=20)      # add labels to the color bar
plt.axis('scaled')

plt.subplot(122) 
im = plt.scatter(df_grid['X'],df_grid['Y'],c=df_colocated['colocated'],vmin=-3,vmax=3, cmap = 'cividis', 
                 alpha=0.8, marker=".")       # scatter plot for location map
plt.title('Normalized co-located surface')                                    # add plot title
plt.xlabel('X (m)')                                                   # set axis labels
plt.ylabel('Y (m)')
cbar = plt.colorbar(im, orientation="vertical", ticks=np.linspace(-3, 3, 7)) # add vertical color bar
cbar.set_label("Surface", rotation=270, labelpad=20)      # add labels to the color bar
plt.axis('scaled')


plt.subplots_adjust(left=0.0, bottom=0.0, right=2.0, top=1.2, wspace=0.2, hspace=0.3)
plt.show()
_images/9_cokriging_and_cosimulation_MM1_13_0.png
#fig.savefig('Figures/colocated_surface_data.jpg', dpi=300, bbox_inches = "tight")
# compute correlation coefficient
rho12 = np.corrcoef(df_grid['Nbed'],df_colocated['colocated'])
corrcoef = rho12[0,1]
print('Correlation coefficient between bed and surface: ' + str(corrcoef))
Correlation coefficient between bed and surface: 0.5716847267189373
# scatter plot
fig = plt.figure()
im = plt.scatter(df_grid['Nbed'],df_colocated['colocated'],alpha=0.5,marker='.')
plt.title('Bed vs. surface')  
plt.xlabel('Normalized bed')                                                   # set axis labels
plt.ylabel('Normalized surface')
plt.subplots_adjust(left=0.0, bottom=0.0, right=1, top=1, wspace=1, hspace=1) # adjust the plot size
plt.axis('scaled')
(-5.719271340871025, 5.719271340968869, -5.632308154734274, 3.8930444320971134)
_images/9_cokriging_and_cosimulation_MM1_16_1.png

Note that the scatter plot must be homoscedastic. In other words, if it looks like the strength of the correlation changes throughout the plot, then the secondary data shouldn’t be used. \(Z_1\) and \(Z_2\) must be linearly related for cokriging and cosimulation to work.

Initialize grid#

First we need to define a grid to interpolate. This stores an array of coordinates for the simulation.

# define coordinate grid
xmin = np.min(df_grid['X']); xmax = np.max(df_grid['X'])     # min and max x values
ymin = np.min(df_grid['Y']); ymax = np.max(df_grid['Y'])     # min and max y values

Pred_grid_xy = gs.Gridding.prediction_grid(xmin, xmax, ymin, ymax, res)

Cokriging#

To perform cokriging, just enter the primary variogram and correlation coefficient parameters into the cokrige_mm1 function:

# set variogram parameters
azimuth = 0
nugget = V1.parameters[2]
# the major and minor ranges are the same in this example because it is isotropic
major_range = V1.parameters[0]
minor_range = V1.parameters[0]
sill = V1.parameters[1] 
vtype = 'Exponential'
vario = [azimuth, nugget, major_range, minor_range, sill, vtype] # save variogram parameters as a list

k = 100 # number of neighboring data points used to estimate a given point 
rad = 50000 # 50 km search radius

est_cokrige, var_cokrige = gs.Interpolation.cokrige_mm1(Pred_grid_xy, df_grid, 'X', 'Y', 'Nbed', 
                                     df_surface_grid, 'X', 'Y', 'Nsurf', k, vario, rad, corrcoef) 
100%|█████████████████████████████████████| 22500/22500 [04:12<00:00, 89.04it/s]
# reverse normal score transformation
var_cokrige[var_cokrige < 0] = 0; # make sure variances are non-negative
std_cokrige = np.sqrt(var_cokrige) # convert to standard deviation (this should be done before back transforming!!!)

# reshape
est = est_cokrige.reshape(-1,1)
std = std_cokrige.reshape(-1,1)

# back transformation
cokrige_pred_trans = nst_trans.inverse_transform(est)
cokrige_std_trans = nst_trans.inverse_transform(std)
cokrige_std_trans = cokrige_std_trans - np.min(cokrige_std_trans)
# make hillshade plot for visualizing

# reshape grid
ylen = (ymax + res - ymin)/res
xlen = (xmax + res - xmin)/res
elevation =  np.reshape(cokrige_pred_trans, (int(ylen), int(xlen)))
elevation = elevation 

hillshade = es.hillshade(elevation, azimuth = 210, altitude = 10) # hillshade

fig, ax = plt.subplots(figsize=(10, 6))
ep.plot_bands(
    elevation,
    ax=ax,
    cmap="gist_earth",
    title="Co-kriging with MM1",
    vmin = -400, vmax = 600,
)
ax.imshow(hillshade, cmap="Greys", alpha=0.1)
plt.show()
_images/9_cokriging_and_cosimulation_MM1_23_0.png
#fig.savefig('Figures/cokriging.jpg', dpi=300, bbox_inches = "tight")

Compared to the simple and ordinary kriging estimates, the elevation on the right side of the map is visibly higher. This is because the surface elevation data slopes upward to the right, which pulls the interpolated topography upwards.

Because only one secondary data point is used to estimate each grid cell, the collocated cokriging variance ends up being almost identical to the kriging variance. This can sometimes lead to an overestimation of the variance.

Sequential Gaussian co-simulation (Co-SGS)#

Now we will perform simulation using cokriging.

cosim = gs.Interpolation.cosim_mm1(Pred_grid_xy, df_grid, 'X', 'Y', 'Nbed', 
                                     df_surface_grid, 'X', 'Y', 'Nsurf', k, vario, rad, corrcoef) 
100%|█████████████████████████████████████| 22500/22500 [05:28<00:00, 68.41it/s]
# reverse normal score transformation
sgs_cosim = cosim.reshape(-1,1)
cosim_trans = nst_trans.inverse_transform(sgs_cosim)
# make hillshade plot for visualizing

# reshape grid
ylen = (ymax + res - ymin)/res
xlen = (xmax + res - xmin)/res
elevation =  np.reshape(cosim_trans, (int(ylen), int(xlen)))
elevation = elevation 

hillshade = es.hillshade(elevation, azimuth = 210, altitude = 10) # hillshade

fig, ax = plt.subplots(figsize=(10, 6))
ep.plot_bands(
    elevation,
    ax=ax,
    cmap="gist_earth",
    title="Co-simulation with MM1",
    vmin = -400, vmax = 600,
)
ax.imshow(hillshade, cmap="Greys", alpha=0.1)
plt.show()
_images/9_cokriging_and_cosimulation_MM1_28_0.png
#fig.savefig('Figures/cosimulation.jpg', dpi=300, bbox_inches = "tight")

Looks good! Compared to the previous simulation examples, the topography on the right is higher elevation.

Post-simulation analysis#

Let’s examine some of the statistical properties of the co-simulation:

# variogram comparison

# downsample random indices to speed this up
rand_indices = random.sample(range(np.shape(Pred_grid_xy)[0]),5000) # generate random numbers to keep

coords_s = Pred_grid_xy[rand_indices]
values_s = cosim[rand_indices]  # notice that we're using the simulated topography values prior to back transformation 
#because the experimental variogram was computed on normalized data

VS = skg.Variogram(coords_s, values_s, bin_func = "even", n_lags = n_lags, 
                   maxlag = maxlag, normalize=False)
# plot variogram

# extract variogram values

# experimental variogram (from beginning of script)
xe = V1.bins
ye = V1.experimental

# simple kriging variogram
xs = VS.bins
ys = VS.experimental

plt.plot(xe,ye,'og', label = 'Variogram of data')
plt.plot(xs,ys,'ok', label = 'Variogram of Co-SGS with MM1 simulation')
plt.title('Variogram Comparison')
plt.xlabel('Lag (m)'); plt.ylabel('Semivariance')  
plt.subplots_adjust(left=0.0, bottom=0.0, right=1.5, top=1.5) # adjust the plot size
plt.legend(loc='lower right')
<matplotlib.legend.Legend at 0x7ff388f08820>
_images/9_cokriging_and_cosimulation_MM1_32_1.png
# find correlation coefficient between total surface dataset and simulation 

rho12 = np.corrcoef(cosim,df_surface_grid['Nsurf']) # compute correlation coefficient
corrcoef = rho12[0,1]
print('Correlation coefficient between simulated bed and surface: ' + str(corrcoef))
Correlation coefficient between simulated bed and surface: 0.7260143741564793

Based on our post-simulation analysis, it appears that the MM1 simulation reproduces the variogram statistics of observations, but exaggerates the correlation betweehn the bed and the surface. This is a known issue with MM1 and is something to watch out for. This over-correlation can sometimes create simulation artifacts as well. Generally speaking, MM1 works best if the secondary data has a higher variance (i.e. is rougher) than the primary data.

VS.parameters
[18430.230400387863, 0.5389357123737512, 0]