Scientific Visualisation with Matplotlib

Overview

Teaching: 50 min
Exercises: 30 min

Questions

How can I visualise my data?

Objectives

Generate a heatmap of longitudinal, numerical tabular data.

Create graphs of mean, minimum, and maximum characteristics over time from data.

Create graphs showing multiple data characteristics within a single plot and separate plots.

Save a generated graph to local storage.

Write a script to visualise data from multiple data files.

Use a library function to get a list of filenames that match a certain pattern.

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture of its own, but we can explore a few features of Python’s matplotlib library here. While there is no official plotting library, matplotlib is the de facto standard.

Some IPython Magic

As we did yesterday, when using a Jupyter notebook you’ll need to use the following magic in order for your matplotlib images to appear in the notebook when show() is called:
%matplotlib inline
Again, note that you only have to execute this function once per notebook.

Lines beginning with a single percent are not python code: they control how the notebook deals with python code. Lines beginning with two percents are “cell magics”, that tell Jupyter notebook how to interpret the particular cell.

Visualising our Inflammation Data

First, we will import the pyplot module from matplotlib and use two of its functions to create and display a heat map of our data (you won’t need the line beginning data = if you’re continuing directly after the previous lesson and already have it loaded):

import matplotlib.pyplot

data = numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')
image = matplotlib.pyplot.imshow(data)
matplotlib.pyplot.show()

inflammation-heatmap-imshow

Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we can see, inflammation rises and falls over a 40-day period.

Let’s take a look at the average inflammation over time:

ave_inflammation = numpy.mean(data, axis=0)
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
matplotlib.pyplot.show()

inflammation-average-imshow

Here, we have put the average per day across all patients in the variable ave_inflammation, then asked matplotlib.pyplot to create and display a line graph of those values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect a sharper rise and slower fall. Let’s have a look at two other statistics:

max_plot = matplotlib.pyplot.plot(numpy.max(data, axis=0))
matplotlib.pyplot.show()

inflammation-maximum-imshow

min_plot = matplotlib.pyplot.plot(numpy.min(data, axis=0))
matplotlib.pyplot.show()

inflammation-minimum-imshow

The maximum value rises and falls smoothly, while the minimum seems to be a step function. Neither trend seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data. This insight would have been difficult to reach by examining the numbers themselves without visualization tools.

Make Your Own Plot

Create a plot showing the standard deviation (using numpy.std) of the inflammation data for each day across all patients.
Solution
std_plot = matplotlib.pyplot.plot(numpy.std(data, axis=0))
matplotlib.pyplot.show()

Multiple Plots: Single Graph

Perhaps we want to compare the minimum, maximum, and average plots overlayed together. This would allow us to see the range of values across each day in the trial. Let’s use PyCharm to build a script called overlay_graphs.py that positions our three graphs in a single plot, or ‘figure’.

So Matplotlib divides a figure object up into axes: each pair of axes is one ‘subplot’. To make a boring figure with just one pair of axes, however, we can just ask for a default new figure, with brand new axes. The subplots() function returns a (figure, axis) pair, which we can deal out with parallel assignment.

Given we have a stacked set of graphs in a single figure, we use legend() on our axes to add one which uses our given plot labels.

import numpy
import matplotlib.pyplot

data = numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

all_graphs, all_graphs_axes = matplotlib.pyplot.subplots()

all_graphs_axes.plot(numpy.mean(data, axis=0), label='average')
all_graphs_axes.plot(numpy.max(data, axis=0), label='max')
all_graphs_axes.plot(numpy.min(data, axis=0), label='min')
all_graphs_axes.legend()

matplotlib.pyplot.show()

inflammation-combined-imshow

Multiple Plots: Multiple Graphs

We can also group similar plots within a single figure using subplots next to each other within that figure. Let’s use PyCharm to build another script called multiple_graphs.py that positions our three graphs side-by-side and introduces a number of new commands.

The function matplotlib.pyplot.figure() creates a space into which we will place all of our plots. The parameter figsize tells Python how big to make this space.

Each subplot is placed into the figure using its add_subplot method. The add_subplot method takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom).

Each subplot is stored in a different variable (axes1, axes2, axes3). Once a subplot is created, the axes can be titled using the set_xlabel() command (or set_ylabel()).

import numpy
import matplotlib.pyplot

data = numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

avg_axes = fig.add_subplot(1, 3, 1)
max_axes = fig.add_subplot(1, 3, 2)
min_axes = fig.add_subplot(1, 3, 3)

avg_axes.set_ylabel('average')
avg_axes.plot(numpy.mean(data, axis=0))

max_axes.set_ylabel('max')
max_axes.plot(numpy.max(data, axis=0))

min_axes.set_ylabel('min')
min_axes.plot(numpy.min(data, axis=0))

fig.tight_layout()

matplotlib.pyplot.show()

inflammation-separate-imshow

The call to loadtxt reads our data, and the rest of the program tells the plotting library how large we want the figure to be, that we’re creating three subplots, what to draw for each one, and that we want a tight layout. (If we leave out that call to fig.tight_layout(), the graphs will actually be squeezed together more closely.)

Moving Plots Around

Save a new version of the program which displays the three plots vertically instead of horizontally.

Solution

import numpy
import matplotlib.pyplot

data = numpy.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

# change figsize (swap width and height)
fig = matplotlib.pyplot.figure(figsize=(3.0, 10.0))

# change add_subplot (swap first two parameters)
avg_axes = fig.add_subplot(3, 1, 1)
max_axes = fig.add_subplot(3, 1, 2)
min_axes = fig.add_subplot(3, 1, 3)

avg_axes.set_ylabel('average')
avg_axes.plot(numpy.mean(data, axis=0))

max_axes.set_ylabel('max')
max_axes.plot(numpy.max(data, axis=0))

min_axes.set_ylabel('min')
min_axes.plot(numpy.min(data, axis=0))

fig.tight_layout()

matplotlib.pyplot.show()

Saving our Plots

We can also save our plots to disk. Let’s change our overlay_graphs.py script to do that, by adding the following just before we call matplotlib.pyplot.show():

all_graphs.savefig('overlay_graphs.png')

When we re-run the script, you should see a new overlay_graphs.png file in the same directory as the script.

Dealing with Multiple Datasets

We also have other inflammation datasets, located in the data directory. Let’s try to generate and save visualisations for each of these datasets so we can compare them against each other, to increase our confidence that we have sensible datasets.

First, we need to have a way of determining a list of all our inflammation data files. Their filenames all follow the pattern inflammation-XX.csv, where XX refers to the number of that dataset. We can use the glob library here to help us get these filenames.

The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. We can use this to get the names of all the CSV files in the data directory which resides in the directory above like so:

import glob
filenames = sorted(glob.glob('inflammation*.csv'))
print(filenames)

Now, glob() returns a list of matching filenames (and directory paths) in arbitrary order, so we use the inbuilt sorted() Python function to sort this for us:

['../data/inflammation-01.csv', '../data/inflammation-02.csv', '../data/inflammation-03.csv', '../data/inflammation-04.csv', '../data/inflammation-05.csv', '../data/inflammation-06.csv', '../data/inflammation-07.csv', '../data/inflammation-08.csv', '../data/inflammation-09.csv', '../data/inflammation-10.csv', '../data/inflammation-11.csv', '../data/inflammation-12.csv']

This means we can loop over it to do something with each filename in turn.

We’d like to save each of the generated plots using the pattern inflammation-XX.png. Each of the filenames we have in filenames has a .csv. on the end. So how to go about replacing the file extension with a .png one? We can use the os library to extract the file path for us, e.g.

import os

filename = '../data/inflammation-01.csv'
base = os.path.splitext(filename)[0]
new_filename = base + '.png'
print(new_filename)

os.path.splitext() splits a filename into its path/filename, and file extension components. So we just append the .png extension to the path/filename part, and get:

'../data/inflammation-01.png'

Processing Multiple Inflammation Datasets

Modify our script that generates the three horizontal graphs in a single figure (multiple_graphs.py) so that it processes each of the inflammation datasets in turn (each with a filename of the form inflammation-XX.csv), generating the figure for each, and saving it to local disk as a PNG file with a filename of the form inflammation-XX.png.

Solution

import glob
import numpy
import matplotlib.pyplot

filenames = sorted(glob.glob('../data/inflammation*.csv'))
for f in filenames:
    data = numpy.loadtxt(fname=f, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    avg_axes = fig.add_subplot(1, 3, 1)
    max_axes = fig.add_subplot(1, 3, 2)
    min_axes = fig.add_subplot(1, 3, 3)

    avg_axes.set_ylabel('average')
    avg_axes.plot(numpy.mean(data, axis=0))

    max_axes.set_ylabel('max')
    max_axes.plot(numpy.max(data, axis=0))

    min_axes.set_ylabel('min')
    min_axes.plot(numpy.min(data, axis=0))

    fig.tight_layout()
    fig.savefig(f + '.png')

Refactor your graph generation code within a new function named generate_graph() that takes a NumPy array as an argument, generates the figure from the input data, and returns the generated figure. Use this function within your loop instead.

Solution

import glob
import numpy
import matplotlib.pyplot

def generate_graph(data):
    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    avg_axes = fig.add_subplot(1, 3, 1)
    max_axes = fig.add_subplot(1, 3, 2)
    min_axes = fig.add_subplot(1, 3, 3)

    avg_axes.set_ylabel('average')
    avg_axes.plot(numpy.mean(data, axis=0))

    max_axes.set_ylabel('max')
    max_axes.plot(numpy.max(data, axis=0))

    min_axes.set_ylabel('min')
    min_axes.plot(numpy.min(data, axis=0))

    fig.tight_layout()

    return fig

filenames = sorted(glob.glob('../data/inflammation*.csv'))
for f in filenames:
    data = numpy.loadtxt(fname=f, delimiter=',')

    figure = generate_graph(data)

    figure.savefig(f + '.png')

Key Points

Use %matplotlib inline to show our graphs immediately after creation within a Notebook.

Use matplotlib.pyplot.plot(data) to generate a graph from data.

Use matplotlib.pyplot.show() to display a generated graph.

Matplotlib allows us to add multiple graphs within a single plot, or within separate plots using a figure.

Set vertical axes labels using set_ylabel('label').

Save a generated graph using graph.savefig('filename').

Use glob.glob(pattern) to create a list of files whose names match a pattern.

Use * in a pattern to match zero or more characters, and ? to match any single character.

previous episode

Functional Programming and Scientific Analysis and Visualisation

lesson home

Scientific Visualisation with Matplotlib

Overview

Some IPython Magic

Visualising our Inflammation Data

Make Your Own Plot

Solution

Multiple Plots: Single Graph

Multiple Plots: Multiple Graphs

Moving Plots Around

Solution

Saving our Plots

Dealing with Multiple Datasets

Processing Multiple Inflammation Datasets

Solution

Solution

Key Points

previous episode

lesson home