Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
3604 views
Kernel: Python 3 (Anaconda)

NCI data access training

Monash University, July 2017

Trainers

  • Dr Jingbo Wang

  • Dr Joseph Antony

  • Dr Adam Steer

Aims:

Students should leave with an understanding of how to:

1. Access data from NCI remotely using a web coverage service request

  • directly in a web browser or programmatic web request

  • using Python's OWSlib

2. Access data remotely using the NetCDF subset service

  • directly in a web browser or programmatic request

  • using Python's Siphon library

Assumptions:

  1. Some familiarity with Python 3 and the Jupyter environment

  2. Some familiarity with netCDF files

  3. Students have been provided NCI materials on data discovery

Caution - Most of NCI's example notebooks are developed using Python 2. This is a Python 3 environment - if in doubt about some aspect of the code please ask
# first, let's set up some basic libraries we need for the exercise. More specific software will be pulled in later # we need matplotlib later to draw some pictures, let's import it now import matplotlib.pyplot as plt import numpy as np #this is a Jupyter cell magic, which tells the notebook to draw plots in a notebook cell %matplotlib inline #the next two modules let us stream data from a binary blob into a numpy array # so that we can visualise data without saving and reopening a file. #note - if you do this at home, you need to install pillow (an imaging library) to use scipy.misc from scipy.misc import imread import io

Task: extracting a data subset from a massive file - ocean colour, 15.65gb

We want to grab ocean colour in a small region off the east coast of Australia (say, the coast of Victoria), but we don't want to download the whole 15gb file to do that. Here is our dataset in the NCI THREDDS catalogue:

http://dapds00.nci.org.au/thredds/catalog/u39/public/data/modis/oc.stacked/v201503/catalog.html

We'll use two different services - Web Coverage Service and the NetCDF Subset Service to get some data.

#import OWSlib - a library for handling many types of OGC web service requests from owslib.wcs import WebCoverageService
#create WCS object - here, we add a URL to the data source wcs = WebCoverageService('http://dapds00.nci.org.au/thredds/wcs/u39/public/data/modis/oc.stacked/v201503/chl_oc3.2015.nc?')
We might not know what is in our data - how can we find out?
# What's in our object? what layers are available? print(wcs.contents)

We see two available layers - 'chl_oc3' and 'l2 flags'. Since we are after ocean colour and not QA flags, let's proceed with 'chl_oc3'

# Create an OWSlib object using out chosen layer, and call it something we can remember oceancolour = wcs['chl_oc3']
# 'dir()' tells us what attributes are available to our object. Looking at the list of items without preceding underscores, # we can find out some stuff about our layer dir(oceancolour)
# what can we download parts of our file as? oceancolour.supportedFormats
# what is the geospatial boundary of our file? # question to ponder - why are no other bounding boxes defined? oceancolour.boundingBoxWGS84
# what are the temporal bounds of our file? oceancolour.timelimits
# what are our time slices? print(oceancolour.timepositions)
oceancolour.supportedCRS

Now we know enough to build a query and get some data:

  • spatial and temporal boundaries

  • available file formats

  • CRS information is not available, let's try WGS84 (EPSG 4326, OGC:CRS84)

# get some data! # here we choose a layer (identifier), a timestamp from the list above, a bounding box inside the data domain, and an output format. # for THREDDS, choosing 'GeoTIFF' gets you data scaled to the range 0-255. 'GeoTIFF_Float' gets you unscaled data. mychunkofocean=wcs.getCoverage(identifier='chl_oc3',time=['2015-05-04T00:00:00Z'],bbox=(140,-44,150,-36.5),format='GeoTIFF_Float', crs='OGC:CRS84')
#show what our URL would look like - try this in a web browser, you should get a geotiff file! mychunkofocean.geturl()
# lets make that pretty - remove all the escaped characters. from urllib import parse parse.unquote(mychunkofocean.geturl())
Try pasting the URL above into a web browser - or modifying any of the parameters to see what you get
#visualise the data - we don't want to download the geotiff here, we can dump it straight into a numpy array # here we read the binary stream from our WCS request into a numpy array oceancolour_array = imread(io.BytesIO(mychunkofocean.read()),'rb') print('data subset size: {:.2f} mb'.format(oceancolour_array.nbytes/(1024*1024)))
#make all our nodata values into NaNs for plotting oceancolour_array[oceancolour_array < 0.0] = np.nan
# and use matplotlib to show it! plt.imshow(oceancolour_array, cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5) plt.colorbar()
# after all that, we can still write out a geotiff - since we have not modified the original request content! # Write the file out to disk if you like, or are using this notebook on your own computer f=open('mychunkofocean.tif','wb') f.write(mychunkofocean.read()) f.close()

WCS summary

In this section we showed how to build a Web Coverage Service request using data held as NetCDF files at NCI. The reference notebook 'THREDDS_WMS_WCS.ipynb' shows in detail how to construct URLs and what all the components mean. Here, we used a Python library as a convenient tool to get data (1.15 mb from a 15 gb file) for a specific time and region, then do a quick visualisation in an interactive notebook without writing any files out.

If you're racing ahead, pick some other NCI data examples and try the same process. See what you come up with!

GDAL and cartopy are also available in this environment - can you make a prettier map?
from netCDF4 import Dataset from siphon import catalog, ncss import datetime
#choose a dataset - we'll use ocean colour again url = 'http://dapds00.nci.org.au/thredds/catalog/u39/public/data/modis/oc.stacked/v201503/catalog.xml'
#create a siphon thredds catalog object tds = catalog.TDSCatalog(url)
#get a list of dataset objects from the catalog object datasets = list(tds.datasets.values())
#get a list of the names of dataset objects datasetnames = list(tds.datasets.keys())
# show the datasets datasetnames
# we will cheat madly and just count down the list to find the dataset we want - in this case ocean colour for 2015, # which is dataset number 14. # since python is 0-indexed, we need dataset index 13. This could be automated, eg search the name list for 'oc3.2015' and use its index here. oc_2015 = datasets[13]
#so, for our desired dataset, what are it's access urls? print(oc_2015.access_urls)
# out of all those, let's use the NetcdfSubset URL to create a ncss object ncss_oc2015 = ncss.NCSS(oc_2015.access_urls['NetcdfSubset'])
Now we have a dataset of choice, how can we find out more about it before we request data?

The next few cells are more or less executing various netCDF metadata queries and returning the results

# now, let's see what is in our file - what variables does it contain (this is a bit like doing ncdump -h) ncss_oc2015.metadata.variables
#what are the axes of our variable(s) ncss_oc2015.metadata.axes
# what is our spatial coverage? ncss_oc2015.metadata.lat_lon_box
# what is the time span? ncss_oc2015.metadata.time_span
Now that we know something about the data, we can construct a request to get part of it
#Now construct a netCDFsubset query object query = ncss_oc2015.query()
# and add attributes to it... # spatial bounds (note different coordinate order to WCS) bbox = (140, 150, -44, -36.5) query.lonlat_box(north=bbox[3],south=bbox[2],east=bbox[1],west=bbox[0]) # variable of interest query.variables('chl_oc3') #temporal range - new for NCSS! In WCS we could select one time slice. query.time_range(datetime.date(year=2015, month=1, day=1), datetime.date(year=2015, month=5, day=1))
#now we download actual data - until now we have only looked at metadata and have not retrieved anything! data = ncss_oc2015.get_data(query)
#what does our returned object look like? data
# Siphon organises data into numpy arrays for us - so we can access it directly. # We also don't need to write out the result, just handle it in memory until we're satisfied that we want to write out a result to disk. data['time'][:]
#Let's see what ocean colour data we have here. Hopefully more than one time slice! let's take the first time slice plt.imshow(data['chl_oc3'][0,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5) plt.colorbar()
#...and the next time slice... plt.imshow(data['chl_oc3'][1,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5) plt.colorbar()
# ...and yes we do! Here is ocean colour at another time. If we spent a bit more time we could make a prettier way to # query time - and then show a bunch of ocean colour plots with nice time labels. Feel free to do so! plt.imshow(data['chl_oc3'][2,:,:], cmap=plt.get_cmap('viridis'), vmin=0, vmax=2.5) plt.colorbar()
# one last question - how big is our data chunk? Lots bigger than the WCS version! But it does have 124 time steps... thedata = np.array(data['chl_oc3']) print('data subset size: {:.2f} mb'.format(thedata.nbytes/(1024*1024)))

NCSS summary

We've shown here how to generate a programmatic request for data subsets using the NetCDF subset service available on NCI's THREDDS data server. We've also shown that we can visualise data subsets quickly and easily.

How is NCSS different from WCS?

NCSS primarily gives us the ability to pull n-dimensional data subsets. In the WCS example, we could grab a single time slice only. With NCSS we can create little blocks of data in 3 or more dimensions. Both are essentially lazy - the data can be queried before download. NCSS allows a little deeper interrogation than WCS since WCS has to not care about underlying data formatting, whereas NCSS is specifically designed to expose NetCDF file properties.

Which is better for your task?

That's your personal decision. Go forth and analyse!