This lesson is being piloted (Beta version)

Scientific Visualisation with Matplotlib

Overview

Teaching: 0 min
Exercises: 70 min
Questions
  • How can I visualise my data?

  • What is Matplotlib and what can I use it for?

Objectives
  • Generate a heatmap of longitudinal, numerical tabular data.

  • Create graphs of mean, minimum, and maximum characteristics over time from data.

  • Create graphs showing multiple data characteristics within a single plot and separate plots.

  • Save a generated graph to local storage.

  • Write a script to visualise data from multiple data files.

  • Use a library function to get a list of filenames that match a certain pattern.

The mathematician Richard Hamming once said, “The purpose of computing is insight, not numbers,” and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture of its own, but we can explore a few features of Python’s matplotlib library here. While there is no official plotting library, matplotlib is the de facto standard.

Introduction to Matplotlib (and NumPy)

If you haven’t already seen it, see topic video lecture, and PowerPoint slides used with per-slide notes.

Adding Matplotlib to our Virtual Environment

Similarly to installing NumPy, we also need to install the Matplotlib external library to use it. First, make sure you’re on the Bash command line, exiting Python if needed.

If we need to reactivate our virtual environment we can do:

cd
cd se-day2/code
source venv/bin/activate

And then, to install the package:

pip3 install matplotlib

Matplotlib makes use of a renderer to allow us to view generated graphs and plots as images. The default renderer (‘agg’) isn’t really suitable for this since it doesn’t allow us to view plots as we generate them in Python. So if you’re using one of the provided Ubuntu laptops (or the SABS virtual machine) we need to install another one via the Ubuntu operating system’s package manager:

sudo apt-get install python3-tk

You’ll be asked for the dtcse user’s password. Once entered, and you press Enter to confirm the installation, the package will be installed.

Visualising our Inflammation Data

Using Microsoft’s Windows Subsystem for Linux (WSL)?

If not, you can ignore this! We’re going to be using Matplotlib’s PyPlot show() function to display generated graphs. However, if you’re using Microsoft’s WSL, you’ll very likely find this doesn’t work, since WSL doesn’t support a graphical interface. We’ll be covering this later, but instead of doing this you can save the graph as an image to the filesystem instead, then open the generated image under Windows.

So where you see:

matplotlib.pyplot.show()

Instead use:

matplotlib.pyplot.savefig('output.png')

Then you can find and open the output.png file under Windows.

First, we will import numpy and the pyplot module from matplotlib and use two of its functions to create and display a heat map of our data (you won’t need the line beginning data = if you’re continuing directly after the previous lesson and already have it loaded):

import numpy as np
import matplotlib.pyplot

data = np.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')
image = matplotlib.pyplot.imshow(data)
matplotlib.pyplot.show()

inflammation-heatmap-imshow

Blue pixels in this heat map represent low values, while yellow pixels represent high values. As we can see, inflammation rises and falls over a 40-day period.

When we close the generated graph, note that running matplotlib.pyplot.show() again doesn’t show us the graph. This odd behaviour comes back to Matplotlib’s hidden state and is a design decision: show() represents the end of the expected graph creation process and is only intended to be used once. So annoyingly, in order to display it again we’d need to recreate the graph.

Let’s take a look at the average inflammation over time:

ave_inflammation = np.mean(data, axis=0)
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
matplotlib.pyplot.show()

inflammation-average-imshow

Here, we have put the average per day across all patients in the variable ave_inflammation, then asked matplotlib.pyplot to create and display a line graph of those values. The result is a roughly linear rise and fall, which is suspicious: we might instead expect a sharper rise and slower fall. Let’s have a look at two other statistics:

max_plot = matplotlib.pyplot.plot(np.max(data, axis=0))
matplotlib.pyplot.show()

inflammation-maximum-imshow

min_plot = matplotlib.pyplot.plot(np.min(data, axis=0))
matplotlib.pyplot.show()

inflammation-minimum-imshow

The maximum value rises and falls smoothly, while the minimum seems to be a step function. Neither trend seems particularly likely, so either there’s a mistake in our calculations or something is wrong with our data. This insight would have been difficult to reach by examining the numbers themselves without visualization tools.

Make Your Own Plot

Create a plot showing the standard deviation (using NumPy’s std() function i.e. np.std) of the inflammation data for each day across all patients.

Solution

std_plot = matplotlib.pyplot.plot(np.std(data, axis=0))
matplotlib.pyplot.show()

Multiple Plots: Single Graph

Perhaps we want to compare the minimum, maximum, and average plots overlayed together. This would allow us to see the range of values across each day in the trial. Let’s use VSCode to build a script called overlay_graphs.py that positions our three graphs in a single plot, or ‘figure’.

So Matplotlib divides a figure object up into axes: each pair of axes is one ‘subplot’. To make a boring figure with just one pair of axes, however, we can just ask for a default new figure, with brand new axes. The subplots() function returns a (figure, axis) pair, which we can deal out with parallel assignment.

Given we have a stacked set of graphs in a single figure, we use legend() on our axes to add one which uses our given plot labels.

import numpy as np
import matplotlib.pyplot

data = np.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

all_graphs, all_graphs_axes = matplotlib.pyplot.subplots()

all_graphs_axes.plot(np.mean(data, axis=0), label='average')
all_graphs_axes.plot(np.max(data, axis=0), label='max')
all_graphs_axes.plot(np.min(data, axis=0), label='min')
all_graphs_axes.legend()

matplotlib.pyplot.show()

inflammation-combined-imshow

Multiple Plots: Multiple Graphs

We can also group similar plots within a single figure using subplots next to each other within that figure. Let’s use VSCode to build another script called multiple_graphs.py that positions our three graphs side-by-side and introduces a number of new commands.

The function matplotlib.pyplot.figure() creates a space into which we will place all of our plots. The parameter figsize tells Python how big to make this space.

Each subplot is placed into the figure using its add_subplot method. The add_subplot method takes 3 parameters. The first denotes how many total rows of subplots there are, the second parameter refers to the total number of subplot columns, and the final parameter denotes which subplot your variable is referencing (left-to-right, top-to-bottom).

Each subplot is stored in a different variable (avg_axes, max_axes, min_axes). Once a subplot is created, the axes can be titled using the set_xlabel() command (or set_ylabel()).

import numpy as np
import matplotlib.pyplot

data = np.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

avg_axes = fig.add_subplot(1, 3, 1)
max_axes = fig.add_subplot(1, 3, 2)
min_axes = fig.add_subplot(1, 3, 3)

avg_axes.set_ylabel('average')
avg_axes.plot(np.mean(data, axis=0))

max_axes.set_ylabel('max')
max_axes.plot(np.max(data, axis=0))

min_axes.set_ylabel('min')
min_axes.plot(np.min(data, axis=0))

fig.tight_layout()

matplotlib.pyplot.show()

inflammation-separate-imshow

The call to loadtxt reads our data, and the rest of the program tells the plotting library how large we want the figure to be, that we’re creating three subplots, what to draw for each one, and that we want a tight layout. (If we leave out that call to fig.tight_layout(), the graphs will actually be squeezed together more closely.)

Moving Plots Around

Save a new version of the program which displays the three plots vertically instead of horizontally.

Solution

import numpy as np
import matplotlib.pyplot

data = np.loadtxt(fname='../data/inflammation-01.csv', delimiter=',')

# change figsize (swap width and height)
fig = matplotlib.pyplot.figure(figsize=(3.0, 10.0))

# change add_subplot (swap first two parameters)
avg_axes = fig.add_subplot(3, 1, 1)
max_axes = fig.add_subplot(3, 1, 2)
min_axes = fig.add_subplot(3, 1, 3)

avg_axes.set_ylabel('average')
avg_axes.plot(np.mean(data, axis=0))

max_axes.set_ylabel('max')
max_axes.plot(np.max(data, axis=0))

min_axes.set_ylabel('min')
min_axes.plot(np.min(data, axis=0))

fig.tight_layout()

matplotlib.pyplot.show()

Saving our Plots

We can also save our plots to disk. Let’s change our overlay_graphs.py script to do that, by adding the following just before we call matplotlib.pyplot.show():

all_graphs.savefig('overlay_graphs.png')

When we re-run the script, you should see a new overlay_graphs.png file in the same directory as the script. If we want to view this image, we can use an image tool called Eye of Gnome which is an Ubuntu default image viewer. To view the image, start a new terminal, change to the directory where this image is located, and then run:

eog overlay_graphs.png

Dealing with Multiple Datasets

We also have other inflammation datasets, located in the data directory. Let’s try to generate and save visualisations for each of these datasets so we can compare them against each other, to increase our confidence that we have sensible datasets.

First, we need to have a way of determining a list of all our inflammation data files. Their filenames all follow the pattern inflammation-XX.csv, where XX refers to the number of that dataset. We can use the glob library here to help us get these filenames.

The glob library contains a function, also called glob, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character * matches zero or more characters, while ? matches any one character. We can use this to get the names of all the CSV files in the data directory which resides in the directory above like so (assuming we’re in the code directory):

import glob
filenames = sorted(glob.glob('../data/inflammation*.csv'))
print(filenames)

Now, glob() returns a list of matching filenames (and directory paths) in arbitrary order, so we use the inbuilt sorted() Python function to sort this for us:

['../data/inflammation-01.csv', '../data/inflammation-02.csv', '../data/inflammation-03.csv', '../data/inflammation-04.csv', '../data/inflammation-05.csv', '../data/inflammation-06.csv', '../data/inflammation-07.csv', '../data/inflammation-08.csv', '../data/inflammation-09.csv', '../data/inflammation-10.csv', '../data/inflammation-11.csv', '../data/inflammation-12.csv']

This means we can loop over it to do something with each filename in turn.

We’d like to save each of the generated plots using the pattern inflammation-XX.png. Each of the filenames we have in filenames has a .csv. on the end. So how to go about replacing the file extension with a .png one? We can use the os library to extract the file path for us, e.g.

import os

filename = '../data/inflammation-01.csv'
base = os.path.splitext(filename)[0]
new_filename = base + '.png'
print(new_filename)

os.path.splitext() splits a filename into its path/filename, and file extension components. So we just append the .png extension to the path/filename part, and get:

'../data/inflammation-01.png'

Processing Multiple Inflammation Datasets

Modify our script that generates the three horizontal graphs in a single figure (multiple_graphs.py) so that it processes each of the inflammation datasets in turn (each with a filename of the form inflammation-XX.csv), generating the figure for each, and saving it to local disk as a PNG file with a filename of the form inflammation-XX.png.

Solution

import os
import glob
import numpy as np
import matplotlib.pyplot

filenames = sorted(glob.glob('../data/inflammation*.csv'))
for f in filenames:
    data = np.loadtxt(fname=f, delimiter=',')

    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    avg_axes = fig.add_subplot(1, 3, 1)
    max_axes = fig.add_subplot(1, 3, 2)
    min_axes = fig.add_subplot(1, 3, 3)

    avg_axes.set_ylabel('average')
    avg_axes.plot(np.mean(data, axis=0))

    max_axes.set_ylabel('max')
    max_axes.plot(np.max(data, axis=0))

    min_axes.set_ylabel('min')
    min_axes.plot(np.min(data, axis=0))

    fig.tight_layout()

    base = os.path.splitext(f)[0]
    new_filename = base + '.png'
    fig.savefig(new_filename)

Refactor your graph generation code within a new function named generate_graph() that takes a NumPy array as an argument, generates the figure from the input data, and returns the generated figure. Use this function within your loop instead.

Solution

import os
import glob
import numpy as np
import matplotlib.pyplot

def generate_graph(data):
    fig = matplotlib.pyplot.figure(figsize=(10.0, 3.0))

    avg_axes = fig.add_subplot(1, 3, 1)
    max_axes = fig.add_subplot(1, 3, 2)
    min_axes = fig.add_subplot(1, 3, 3)

    avg_axes.set_ylabel('average')
    avg_axes.plot(np.mean(data, axis=0))

    max_axes.set_ylabel('max')
    max_axes.plot(np.max(data, axis=0))

    min_axes.set_ylabel('min')
    min_axes.plot(np.min(data, axis=0))

    fig.tight_layout()

    return fig

filenames = sorted(glob.glob('../data/inflammation*.csv'))
for f in filenames:
    data = np.loadtxt(fname=f, delimiter=',')

    figure = generate_graph(data)

    base = os.path.splitext(f)[0]
    new_filename = base + '.png'
    figure.savefig(new_filename)

Key Points

  • Use matplotlib.pyplot.plot(data) to generate a graph from data.

  • Use matplotlib.pyplot.show() to display a generated graph.

  • Matplotlib allows us to add multiple graphs within a single plot, or within separate plots using a figure.

  • Set vertical axes labels using set_ylabel('label').

  • Save a generated graph using graph.savefig('filename').

  • Use glob.glob(pattern) to create a list of files whose names match a pattern.

  • Use * in a pattern to match zero or more characters, and ? to match any single character.