³⁶⁰⁴ views

Kernel: Python 3 (Anaconda)

NCI data access training

Monash University, July 2017

Trainers

Dr Jingbo Wang
Dr Joseph Antony
Dr Adam Steer

Aims:

Students should leave with an understanding of how to:

1. Access data from NCI remotely using a web coverage service request

directly in a web browser or programmatic web request
using Python's OWSlib

2. Access data remotely using the NetCDF subset service

directly in a web browser or programmatic request
using Python's Siphon library

Assumptions:

Some familiarity with Python 3 and the Jupyter environment
Some familiarity with netCDF files
Students have been provided NCI materials on data discovery

Caution - Most of NCI's example notebooks are developed using Python 2. This is a Python 3 environment - if in doubt about some aspect of the code please ask

In [1]:

# first, let's set up some basic libraries we need for the exercise. More specific software will be pulled in later

# we need matplotlib later to draw some pictures, let's import it now
import matplotlib.pyplot as plt
import numpy as np

#this is a Jupyter cell magic, which tells the notebook to draw plots in a notebook cell
%matplotlib  inline

#the next two modules let us stream data from a binary blob into a numpy array
# so that we can visualise data without saving and reopening a file.

#note - if you do this at home, you need to install pillow (an imaging library) to use scipy.misc

from scipy.misc import imread
import io

Task: extracting a data subset from a massive file - ocean colour, 15.65gb

We want to grab ocean colour in a small region off the east coast of Australia (say, the coast of Victoria), but we don't want to download the whole 15gb file to do that. Here is our dataset in the NCI THREDDS catalogue:

http://dapds00.nci.org.au/thredds/catalog/u39/public/data/modis/oc.stacked/v201503/catalog.html

We'll use two different services - Web Coverage Service and the NetCDF Subset Service to get some data.

1. Web Coverage services

Required libraries:

OWSlib
matplotlib
scipy
numpy
io

Reference notebooks:

This example is based on material here: https://github.com/geopython/OWSLib/blob/master/examples/wcs-thredds-prism.py

In [2]:

#import OWSlib - a library for handling many types of OGC web service requests

from owslib.wcs import WebCoverageService

In [3]:

#create WCS object - here, we add a URL to the data source

wcs = WebCoverageService('http://dapds00.nci.org.au/thredds/wcs/u39/public/data/modis/oc.stacked/v201503/chl_oc3.2015.nc?')

We might not know what is in our data - how can we find out?

In [4]:

# What's in our object? what layers are available?

print(wcs.contents)

We see two available layers - 'chl_oc3' and 'l2 flags'. Since we are after ocean colour and not QA flags, let's proceed with 'chl_oc3'

In [5]:

# Create an OWSlib object using out chosen layer, and call it something we can remember

oceancolour = wcs['chl_oc3']

In [6]:

# 'dir()' tells us what attributes are available to our object. Looking at the list of items without preceding underscores,
# we can find out some stuff about our layer

dir(oceancolour)

In [42]:

# what can we download parts of our file as?

oceancolour.supportedFormats

In [41]:

# what is the geospatial boundary of our file?
# question to ponder - why are no other bounding boxes defined?

oceancolour.boundingBoxWGS84

In [39]:

# what are the temporal bounds of our file?

oceancolour.timelimits

In [10]:

# what are our time slices?

print(oceancolour.timepositions)

In [44]:

oceancolour.supportedCRS

Now we know enough to build a query and get some data:

spatial and temporal boundaries
available file formats
CRS information is not available, let's try WGS84 (EPSG 4326, OGC:CRS84)

In [45]:

# get some data!
# here we choose a layer (identifier), a timestamp from the list above, a bounding box inside the data domain, and an output format.
# for THREDDS, choosing 'GeoTIFF' gets you data scaled to the range 0-255. 'GeoTIFF_Float' gets you unscaled data.

mychunkofocean=wcs.getCoverage(identifier='chl_oc3',time=['2015-05-04T00:00:00Z'],bbox=(140,-44,150,-36.5),format='GeoTIFF_Float', crs='OGC:CRS84')

In [46]:

#show what our URL would look like - try this in a web browser, you should get a geotiff file!

mychunkofocean.geturl()

In [52]:

# lets make that pretty - remove all the escaped characters.

from urllib import parse
parse.unquote(mychunkofocean.geturl())

Try pasting the URL above into a web browser - or modifying any of the parameters to see what you get

In [47]:

#visualise the data - we don't want to download the geotiff here, we can dump it straight into a numpy array
# here we read the binary stream from our WCS request into a numpy array

oceancolour_array = imread(io.BytesIO(mychunkofocean.read()),'rb')
print('data subset size: {:.2f} mb'.format(oceancolour_array.nbytes/(1024*1024)))

In [48]:

#make all our nodata values into NaNs for plotting

oceancolour_array[oceancolour_array < 0.0] = np.nan

In [49]:

# and use matplotlib to show it!

plt.imshow(oceancolour_array, cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5)
plt.colorbar()

In [16]:

# after all that, we can still write out a geotiff - since we have not modified the original request content!
# Write the file out to disk if you like, or are using this notebook on your own computer

f=open('mychunkofocean.tif','wb')
f.write(mychunkofocean.read())
f.close()

WCS summary

In this section we showed how to build a Web Coverage Service request using data held as NetCDF files at NCI. The reference notebook 'THREDDS_WMS_WCS.ipynb' shows in detail how to construct URLs and what all the components mean. Here, we used a Python library as a convenient tool to get data (1.15 mb from a 15 gb file) for a specific time and region, then do a quick visualisation in an interactive notebook without writing any files out.

If you're racing ahead, pick some other NCI data examples and try the same process. See what you come up with!

GDAL and cartopy are also available in this environment - can you make a prettier map?

2. NetCDF subset service and Siphon

Required libraries:

netCDF4
Siphon
numpy
matplotlib
datetime

Reference notebooks:

This material is mainly based on the notebook https://github.com/nci/nci-notebooks/blob/master/Data_Access/Using_Siphon/Python_Siphon_II.ipynb and material here: https://unidata.github.io/siphon/examples/ncss/NCSS_Example.html#sphx-glr-examples-ncss-ncss-example-py

In [17]:

from netCDF4 import Dataset
from siphon import catalog, ncss
import datetime

In [18]:

#choose a dataset - we'll use ocean colour again

url = 'http://dapds00.nci.org.au/thredds/catalog/u39/public/data/modis/oc.stacked/v201503/catalog.xml'

In [19]:

#create a siphon thredds catalog object

tds = catalog.TDSCatalog(url)

In [20]:

#get a list of dataset objects from the catalog object

datasets = list(tds.datasets.values())

In [21]:

#get a list of the names of dataset objects

datasetnames = list(tds.datasets.keys())

In [22]:

# show the datasets

datasetnames

In [23]:

# we will cheat madly and just count down the list to find the dataset we want - in this case ocean colour for 2015, 
# which is dataset number 14.
# since python is 0-indexed, we need dataset index 13. This could be automated, eg search the name list for 'oc3.2015' and use its index here.

oc_2015 = datasets[13]

In [24]:

#so, for our desired dataset, what are it's access urls?

print(oc_2015.access_urls)

In [25]:

# out of all those, let's use the NetcdfSubset URL to create a ncss object

ncss_oc2015 = ncss.NCSS(oc_2015.access_urls['NetcdfSubset'])

Now we have a dataset of choice, how can we find out more about it before we request data?

The next few cells are more or less executing various netCDF metadata queries and returning the results

In [26]:

# now, let's see what is in our file - what variables does it contain (this is a bit like doing ncdump -h)

ncss_oc2015.metadata.variables

In [27]:

#what are the axes of our variable(s)

ncss_oc2015.metadata.axes

In [28]:

# what is our spatial coverage?

ncss_oc2015.metadata.lat_lon_box

In [29]:

# what is the time span?

ncss_oc2015.metadata.time_span

Now that we know something about the data, we can construct a request to get part of it

In [30]:

#Now construct a netCDFsubset query object

query = ncss_oc2015.query()

In [31]:

# and add attributes to it...

# spatial bounds (note different coordinate order to WCS)
bbox = (140, 150, -44, -36.5)
query.lonlat_box(north=bbox[3],south=bbox[2],east=bbox[1],west=bbox[0])

# variable of interest
query.variables('chl_oc3')

#temporal range - new for NCSS! In WCS we could select one time slice.

query.time_range(datetime.date(year=2015, month=1, day=1), datetime.date(year=2015, month=5, day=1))

In [32]:

#now we download actual data - until now we have only looked at metadata and have not retrieved anything!

data = ncss_oc2015.get_data(query)

In [33]:

#what does our returned object look like?

data

In [34]:

# Siphon organises data into numpy arrays for us - so we can access it directly.

# We also don't need to write out the result, just handle it in memory until we're satisfied that we want to write out a result to disk.

data['time'][:]

In [35]:

#Let's see what ocean colour data we have here. Hopefully more than one time slice! let's take the first time slice

plt.imshow(data['chl_oc3'][0,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5)
plt.colorbar()

In [36]:

#...and the next time slice...

plt.imshow(data['chl_oc3'][1,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5)
plt.colorbar()

In [37]:

# ...and yes we do! Here is ocean colour at another time. If we spent a bit more time we could make a prettier way to
# query time - and then show a bunch of ocean colour plots with nice time labels. Feel free to do so!

plt.imshow(data['chl_oc3'][2,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5)
plt.colorbar()

In [38]:

# one last question - how big is our data chunk? Lots bigger than the WCS version! But it does have 124 time steps...

thedata = np.array(data['chl_oc3'])
print('data subset size: {:.2f} mb'.format(thedata.nbytes/(1024*1024)))

NCSS summary

We've shown here how to generate a programmatic request for data subsets using the NetCDF subset service available on NCI's THREDDS data server. We've also shown that we can visualise data subsets quickly and easily.

How is NCSS different from WCS?

NCSS primarily gives us the ability to pull n-dimensional data subsets. In the WCS example, we could grab a single time slice only. With NCSS we can create little blocks of data in 3 or more dimensions. Both are essentially lazy - the data can be queried before download. NCSS allows a little deeper interrogation than WCS since WCS has to not care about underlying data formatting, whereas NCSS is specifically designed to expose NetCDF file properties.

Which is better for your task?

That's your personal decision. Go forth and analyse!