📚 The CoCalc Library - books, templates and other resources
License: OTHER
Loading and saving data
Often when we are working with models, we will need to load data that does not currently exist within the Python environment. Additionally, as we create data within the Python environment, we may want to save it to disk, so that we can have access to it again later (i.e., if the IPython kernel is restarted).
One common use case for this is if we have a computationally intensive model that takes a long time to run. If we save out the data from the model after it finishes running, then we don't have to wait for it to run every time we want to use that data.
Another other common use case is when you have data from a different source -- for example, someone else's dataset, or perhaps human data you collected by hand. In these cases, we need to be able to load the existing data into our Python environment before we can do anything with it.
Loading
Numpy arrays are typically stored with an extension of .npy
or .npz
. The .npy
extension means that the file only contains a single numpy array, while the .npz
extension indicates that there are multiple arrays saved in the file. To load either type of data, we use the np.load
function, and provide a path to the data.
If you go back to the notebook list, you'll see a directory called data
. This directory is on the file system of the server that hosts this notebook. If you click on the directory, you'll see three files, experiment_data.npy
, experiment_participants.npy
, and panda.npy
. For now, we'll be paying attention to the first two files (the panda.npy
file will be used later on in this problem):
(Note: it is possible that you will see more files that that, if you have already run through this notebook or other notebooks.)
Let's load the experiment_data.npy
file. We're currently in the Problem Set 0
directory, so we just need to specify the data
directory beneath that:
We can also load the participants:
Saving
Let's say we wanted to save a copy of data
. To do this, we use np.save
, with the desired filename as the first argument and the numpy array as the second argument:
After running the above cell, your data/
directory should look like this:
What if we wanted to save our data and our participants into the same file? We can do this with a different function, called np.savez
. This takes as arguments the path to the data, followed by all the arrays we want to save. We use keyword arguments in order to specify the name of each array: x
(the array holding our experiment data) will be saved under the key data
, and p
(the array of experiment participants) will be saved under the key participants
:
Note that we used the extension of .npz
for this file, to indicate that it contains multiple arrays.
After saving this data, we should see the file in our data/
directory:
Now, if we were to load this data back in, we don't immediately get an array, but a special NpzFile
object:
This special object acts like a dictionary. We can see what arrays are in this object using the keys()
method:
And we can extract each array by accessing it as we would if experiment
were a dictionary:
Exercise: Saving files (0.5 points)
Write code to save a new .npz
file which also contains the data
and participant
arrays (just like with data/experiment.npz
). Your new file should additionally contain an array called "trials", which is a 1D array of integers from 1 to 300 (inclusive). You will need to create this array yourself.
Save your file under the name full_experiment.npz
into the data/
directory.
Vectorized operations on multidimensional arrays
Let's take a look at the data from the previous part of the problem a little more closely. This data is response times from a (hypothetical) psychology experiment, collected over 300 trials for 50 participants. Thus, if we look at the shape of the data, we see that the participants correspond to the rows, and the trials correspond to the columns:
In other words, element would be the participant's response time on trial , in milliseconds.
One question we might want to ask about this data is: are people getting faster at responding over time? It might be hard to tell from a single participant. For example, these are the response times of the first 10 trials from the first paraticipant:
And these are the response times on the last 10 trials from the first participant:
At a glance, it looks like the last 10 trials might be slightly faster, but it is hard to know for sure because the data is fairly noisy. Another way to look at the data would be to see what the average response time is for each trial. This gives a measure of how quickly people can respond on each trial in a general sense.
To compute the average response time for a trial, we can use the np.mean
function. Here is the first trial:
And the last trial:
On average, response times are faster on the last trial than on the first trial. How would we compute this for all trials? One answer would be to use a list comprehension, like this:
However, the np.mean
function can actually do this for us, without the need for a list comprehension. What we want to do is to take the mean across participants, or the mean for each trial. Participants correspond to the rows, or the first dimension, so we can tell numpy to take the mean across this first dimension with the axis
flag (remember that 0-indexing is used in Python, so the first dimension corresponds to axis 0):
We can verify that we took the mean for each trial by checking the shape of trial_means
:
Checking the shape of your array after doing a computation over it is always a good idea, to verify that you did the computation over the correct axis. If we had accidentally used the wrong axis, our mistake would be readily apparent by checking the shape, because we know that if we want the mean for each trial, we should end up with an array of shape (300,)
. However, if we use the wrong axis, this is not the case (we instead have the mean for every participant):
Note that if we didn't specify the axis, the mean would be taken over the entire array:
So, be careful about keeping track of the shapes of your arrays, and think about when you want to use an axis argument or not!
Plotting
Now that we've computed the average response time for each trial, it would be useful to actually visualize it, so we can see a graph instead of just numbers. To plot things, we need a special magic function that will cause the graphs to be displayed inline in the notebook:
You only need to run this command once per notebook (though, note that if you restart the kernel, you'll need to run the command again). In most cases, we'll provide a cell with imports for you so you don't have to worry about remembering to include it. You additionally need to import the pyplot
module from matplotlib
:
Let's load our data in again, and compute some descriptive statistics, just to make it clearer what it is that we are going to be plotting. Recall that the trials
array is an array from 1 to 300 (inclusive), corresponding to the trial numbers. This array will serve as our -values:
To plot things, we'll be using the plt.subplots
function. Most of the time when we use it, we'll call it as plt.subplots()
, which just produces a figure
object (corresponding to the entire graph), and an axis
object (corresponding to the set of axes we'll be plotting on):
The distinction between the figure and the axis is clearer when we want multiple subplots, which we can also do with plt.subplots
, by passing it two arguments: the number of rows of subplots that we want, and the number of columns of subplots that we want. Then, the function will return a single object for the entire figure, and a separate object for each subplot axis:
In general, the figure object is used for things like adjusting the overall size of the figure, while the axis objects are used for actually plotting data. So, once we have created our figure and axis (or axes), we plot our data on the specified axis using axis.plot
. The third argument, k
, is a shorthand for the color our data should be plotted in (in this case, black):
Displaying multiple lines on the same axis is as easy as calling the axis.plot
method again with new data! Below, we plot the trial means and medians as separate lines on the same plot, with means in black ('k'
) and medians in blue ('b'
). We also introduce a new argument, label
, which we use to specify the name for each line in the legend. To display the final legend, we call plt.legend()
after we have added all the data we wish to plot to our specified axis.
A better visualization of this data would be points, rather than a line. To do this, we can change the 'k'
argument for trial means to 'ko'
, indicating that the data should again be plotted in black ('k'
) and using a dot ('o'
), and the 'b'
argument for trial medians to 'bs'
to plot them as blue ('b'
) squares ('s'
):
We can also specify the color using a keyword argument and a RGB (red-green-blue) value. The following plots trial_means
as green dots and trial_medians
as red squares:
For a full list of all the different color and marker symbols (like 'ko'
), take a look at the documentation on axis.plot
:
Frequently, we will want to specify axis labels and a title for our plots, in order to be more descriptive about what is being visualized. To do this, we can use the set_xlabel
, set_ylabel
, and set_title
commands:
Another plotting function that we will use is matshow
, which displays a heatmap using the values of 2D array. For example, if we wanted to display an array of pixel intensities, we could use matshow
to do so.
The file data/panda.npy
contains an array of pixels that represent an image of a panda. As before, we can load in the data with np.load
:
And now visualize the pixel intensity values using matshow
:
It's important to specify that the colormap (or cmap
) is 'gray'
, otherwise we end up with a crazy rainbow image!
Exercise: Squared exponential distance (4 points)
We are going to compute a squared exponential distance, which is given by the following equation:
where is a -length vector and is a -length vector. The variable corresponds to the element of , and the variable corresponds to the element of .
plot_squared_exponential
to plot these distances using the matshow
function.
plot_squared_exponential
for additional constraints -- going forward, in general we will give brief instructions in green cells like the one above, and more detailed instructions in the function comment.Now, let's see what our squared exponential function actually looks like. To do this we will look at 100 linearly spaced values from -2 to 2 on the x axis and 100 linearly spaced values from -2 to 2 on the y axis. To generate these values we will use the function np.linspace
.