# Masking NetCDF datasets using Rasterio

This post brings a couple of ideas from previous posts together, namely downloading data using NetCDF and masking raster layers with vector data.

The end result is the ability to query raster datasets using irregular shapes and only download the data you actually need.

For this we're going to be using the CHESS PET dataset, created by CEH: http://doi.org/10.5285/80887755-1426-4dab-a4a6-250919d5020c

Let's start by importing the modules we're going to use.

In [1]:
import netCDF4
import rasterio.features
from shapely.geometry import Point
import numpy as np
import matplotlib.pyplot as plt
from affine import Affine
%matplotlib inline


Next we define the URL to access the data. The data has been split into a different file for each month. We use a template to simplify the construction of the URL.

In [2]:
URL_TEMPLATE = "http://thredds-water.ceh.ac.uk/thredds/dodsC/PETDetail/chess_pet_wwg_{year:04d}{month:02d}.nc"

year = 2010
month = 5
url = URL_TEMPLATE.format(year=year, month=month)
url

Out[2]:
'http://thredds-water.ceh.ac.uk/thredds/dodsC/PETDetail/chess_pet_wwg_201005.nc'

Now we're ready to open the dataset. The object representation (repr()) gives us some useful information about the data. In particular, it tells us the dimensions of the data (3 dimensions: time, x, y) and the variables which include pet.

In [3]:
dataset = netCDF4.Dataset(url)  # open the dataset
dataset

Out[3]:

root group (NETCDF3_CLASSIC data model, file format DAP2):
title: Potential evapotranspiration over well-watered grass.
institution: CEH Wallingford - NERC
geospatial_lat_min: 49.771
geospatial_lat_max: 59.3952
geospatial_lon_min: -9.00452
geospatial_lon_max: 2.49124
time_coverage_start: 2010-05-01
time_coverage_end: 2010-05-31
version: 1.2
Conventions: CF-1.6
description: Daily potential evapotranspiration (mm/day) at 1km resolution over Great Britain. Calculated using the Penman-Monteith equation assuming a well-watered grass surface.
date_created: 2015-09-24
creator_name: Eleanor Blyth, Douglas Clark, Jon Finch, Emma Robinson, Alison Rudd
creator_email: chess@ceh.ac.uk
publisher_name: Centre for Ecology and Hydrology
publisher_url: http://www.ceh.ac.uk
publisher_email: enquiries@ceh.ac.uk
source: Climate Hydrology Ecosystem research Support System (CHESS)
licence: Licensing conditions apply (datalicensing@ceh.ac.uk)
dimensions(sizes): time(31), x(656), y(1057)
variables(dimensions): float32 time(time), float32 y(y), float32 x(x), int32 crs(), float32 lat(y,x), float32 lon(y,x), float32 pet(time,y,x)
groups: 

That's all well and good, but what does it actually look like? Let's plot the data for the first day in the timeseries.

To do this we slice the dataset with [0,:,:] which corresponds to the zeroth day, and all of the data in the x and y dimensions.

Note that the 0,0 coordinate is in the lower left, near Cornwall. This will make it easier to convert to geographic coordinates later.

In [4]:
plt.figure(figsize=(6,7))
plt.imshow(dataset['pet'][0,:,:], origin='lower')
plt.colorbar()
plt.xticks(np.arange(0, 800, 100))
plt.yticks(np.arange(0, 1100, 100))
plt.axis("equal")
plt.grid(True)
plt.show()


Alternatively we can take a slice that returns a timeseries for a single location.

In [5]:
PET = dataset['pet'][:, 200, 500]
plt.plot(np.arange(1, len(PET)+1), PET)
plt.xlabel("Day of month")
plt.ylabel("PET")
plt.grid(True)
plt.show()


Next we extract a slice of the data defined by a 50km buffer around a point. We use a lambda function to translate geographic coordinates (in British National Grid) into array coordinates.

In [6]:
# create a geometry
p = Point(450000, 96000)  # British National Grid coordinates for the Isle of Wight
circle = p.buffer(50000)  # 50km radius circle

# find it's bounding box in array coordinates
index = lambda x,y: (int(x // 1000), int(y // 1000))
bounds = circle.bounds
ll = index(*bounds[0:2])  # lower left
ur = index(*bounds[2:4])  # upper right

# extract a box of data
data = dataset['pet'][0, ll[1]:ur[1],ll[0]:ur[0]]

print("Shape:", data.shape)
print("")
print("Minimum:", data.min())
print("Maximum:", data.max())
print("Mean:", data.mean())

Shape: (100, 100)

Minimum: 1.60718
Maximum: 1.95777
Mean: 1.80544057783


Finally we're ready to use the rasterio.features.rasterize function to create a mask of an irregular shape to use with the dataset. In this case we're using the circle geometry that was created in the previous step, but this could be anything (for example, some data read from a shapefile).

One approach would be to create a mask with the same dimensions as the entire dataset. This isn't very efficient though, as it would require requesting all of the data in the slice even if the mask only revealed a single cell.

Instead we create a mask that's shifted to match the subset of data we're requesting. This is done by apply an X and Y offset in the affine transform.

See the documentation for rasterio.features.rasterize, the affine module and also the Wikipedia article on affine transformation matrices.

In [7]:
geometry = circle

# offset for the subset of data
xoff = ll[0]
yoff = ll[1]

# affine transformation from pixel coordinates to geographic coordinates
# 1 pixel = 1000m x 1000m
# the offset is needed because we're masking a subset of the dataset

a = 1000        # change in x with x
b = 0           # change in y with x
c = xoff * 1000 # x offset
d = 0           # change in y with x
e = 1000        # change in y with y
f = yoff * 1000 # y offset

shifted_affine = Affine(a, b, c, d, e, f)

# create a mask using the geometry
# if we needed more than one day of data we could re-use this mask
# anything the geometry touches with be 0, otherwise 1
[(geometry, 0)],
out_shape=data.shape,
transform=shifted_affine,
fill=1,
all_touched=True,
dtype=np.uint8
)

# make sure the data is a masked array
data = np.ma.array(data)

# apply the mask to the data (bitwise OR)