Comparing Student outcomes with Research Output (using R and ggplot2’s text labels)


, , , , , ,

In this post, I take a look at some league table data recently published by the Guardian. I also provide the R code for annotating the graphs for ggplot2.


It’s one of those fun aspects of teaching at a university that teaching itself isn’t the most important things on our minds. Students often complain that ‘staff are too busy with their research to care about teaching’. Is that true? Do those who’d rather run experiments and write papers care so little about students that they ditch the students and focus on their own needs instead?

"Please, Fry! I don't know how to teach. I'm a professor!"

There is a simple way to gain insight into whether this may be true. If academic staff care so much about research, and so little about what the students get up to, then logically, the universities that have the highest research output should also rank the lowest in how the students fare during their degrees. This is a simple correlation that I’ll now illustrate: higher scores for students should be correlated with lower scores for research output.

The Data

These data were obtained from the Guardians’ University Guide: Psychology (psychology is what I’m interested in as that’s what I teach/research) and their most recent research ratings.  First, the graph (click for a larger version, it’s big!):

Running a correlation on this gives a correlation of 0.69, which is significant (p<.0001). I haven’t tried to fit a line to the graph because I think there is enough on there already! Not only does this correlation go in the opposite direction that would have been expected, it’s a pretty strong and significant correlation, too. Higher scores for research output were correlated with higher scores for the students and their outcomes.


It therefore looks like the claim that ‘staff are too busy with their research to care about teaching’ isn’t necessarily true. Granted, this is a correlation rather than any attempt to get insights into the direct cause and effect going on here, but I think it’s still interesting to explore this. I intend to point students to this post next time they complain about something like this! I’ll leave it to the reader to think about why the correlation might be going in this direction.

Quick note: I don’t want to claim credit for thinking about doing these kind of analyses; I’ve heard this correlation that has been reported here discussed previously, but never actually seen it for real. That was part of the reason that I decided to take a look into it!

Quick note #2: there are other possible student metrics in the Guardian data that could be compared with research output. These may be worth exploring too, but I’ve focused here on the overall measure for students as that’s what is used to rank the league tables.

R Code

Here’s the R code for the graph and corelation:

aes(x=research, y=student_score,label=name)+
scale_x_continuous("Research Score")+
scale_y_continuous("Student Score")

cor.test(uni_data$research, uni_data$student_score)

ggsave(unis, file="unis.png")

To draw the text onto the graph, it’s a simple case of calling geom_text which draws the university names specified in the name column. This is set using the label aesthetic (aes). It’s surprising how easy it is to get graphs of this type together; though I do think this graph is a bit messy, simply because of the large number of names involved, many of which are quite long.


Bar Graphs in ggplot2


, , , ,

As part of my continuing fun and games getting to grips with ggplot2’s vast multitude of functions, here I give a basic intro to plotting bar graphs. Bit by bit, I’m slowly creating my own library of code to call on when needed!

The Setup

Let’s begin by making up some data. This is for a pretend experiment with four participants (labelled “ppt1”, “ppt2”, and so on). They take part at two different time points (Time 1 and Time 2 – cool name or what). Some kind of stimulus is presented to them in different regions in the display. We’ll call these two “Region A” and “Region B”. Their performance is indicated by mean_score. Here is the code to generate some fake data – note the use of randomisation here with rnorm, meaning your means will be different to those I illustrate here:

ppt <- rep(c("ppt1", "ppt2", "ppt3", "ppt4"), 8 )
time<- c(rep("Time 1",8), rep("Time 2",8))
mean_score<-c(rnorm(8, mean=500, sd=20), rnorm(8, mean=250, sd=80))
region <- c(rep("Region A",4), rep("Region B",4),rep("Region A",4), rep("Region B",4))
bar_data<-data.frame(ppt,time, region, mean_score)

This leaves us with a shiny new dataframe called bar_data. 

The Bar Graphs / Bar Charts

Now for the code. What we want to do is take a look at how each participant scored, with their mean_score for each Time and each Region.

aes(x=region, y=mean_score)+
ggsave(bars, file="bars.png")

We set up the x and y axes using the aes command. geom_bar() draws up the bar chart. Next, all we need is facet_grid which creates a grid-like arrangement for various combinations of the different factor levels in the columns. We specify facets = ppt~time which tells ggplot to draw one bar chart for each level of ppt and time . Finally, ggsave is a handy way to output a high-dpi image straight out to your current working directory. This gives publication-quality plots in an instant. At last, we can rejoice as we find it gives us the following graph:

Reshape Package in R: Long Data format, to Wide, back to Long again

In this post, I describe how to use the reshape package to modify a dataframe from a long data format, to a wide format, and then back to a long format again. It’ll be an epic journey; some of us may not survive (especially me!).

Wide versus Long Data Formats

I’ll begin by describing what is meant by ‘wide’ versus ‘long’ data formats. Long data look like this:

As you can see, there is one row for each value that you have. Many statistical tests in R need data in this shape (e.g., ANOVAs and the like). This is the case even when running tests with repeated factors.

In the example above, lets say that iv1 is a between-subjects factor and iv2 is a within-subjects factor. The same table, in a wide format, would look like this:

Here, each column represents a unique pairing of the various factors. SPSS favours this method for repeated-measures tests (such as repeated-measures ANOVAs or paired t-tests), and being able to move between the two formats is helpful when multiple people are working on a single dataset but using different packages (e.g., R vs SPSS).

Get in Shape! The Reshape Package

I’ll begin by going back to a dataset that I’ve been messing around with for some time. I’m going to select out the columns I need, and rename one of them. One of them ended up getting called “X.” because of the way the data were tabbed. Here, I rename the “X.” column into “rank”, which is what it really should have been in the first place.

full_list_cutdown = data.frame("rank"=full_list_dps$X., "class"=full_list_dps$class,
"spec"=full_list_dps$spec, "dps"=full_list_dps$DPS)

The data look like this:

Rows truncated to prevent from filling entire page

Let’s begin by converting these data into a wide format. To do that, all we need to do is use the cast function. This has the general format of:

cast(dataset, factor1 ~ factor2 ~ etc., value=value column, fun=aggregation method)

Here, dataset refers to your target dataset. factor1 ~ factor2 ~ etc lists the columns/factors that you want to split up the data by. value deals with the column that you want to select and calculate a value for. You can run all sorts of aggregation functions using the fun= command. The default is len, the count of the number of cells for that combination of factor levels. To make my dataset into a wide format, all I need to run is:

wide_frame = data.frame(cast(full_list_cutdown, rank~class, value=c('dps'), fun=mean))

Here, I create a wide dataframe based on the rank and class columns. The computed value is the mean of the dps column. It looks like this:

There and Back Again: Getting from Wide to Long Format

Say that we want to go back to the long format again (or, indeed, convert from wide to long in the first place!). How can we do that? We use the melt function!

melt(wide_frame, id=c("rank"))

This takes us right back to the start, where our exciting journey began.

One-way ANOVAs in R – including post-hocs/t-tests and graphs


, , , , ,

In this post, I go over the basics of running an ANOVA using R. The dataset I’ll be examining comes from this website, and I’ve discussed it previously (starting here and then here). I’ve not seen many examples where someone runs through the whole process, including ANOVA, post-hocs and graphs, so here we go.

This is also my first post over at My old site will still exist with my old posts, it was just starting to churn a bit with so many visitors. It started to mess up images to, so I have decided to move!

One-way ANOVA

So here I’m going to ask the question: which class scored highest on DPS? I won’t be breaking this down by spec yet, that will come in future posts. The syntax for running an ANOVA is this:

aov(DV ~ IV, data=dataset)

It’s so simple! If you want to look at an interaction, or have more than one factor, go for:

aov(DV ~ IV1 * IV2, data=dataset)

Running the ANOVA

Let’s go for it:

btwn <- aov(DPS ~ class, data=full_list_dps)

This shows us that there is a highly significant effect of class on the dps score. But how should we visualise it?

Visualising the Results

I’ve seen various guides offer different approaches to visualising ANOVA results. I researched quite a few before settling on ggplot2 to do this. I’ll begin by summarising the data:

graph_summary<-ddply(full_list_dps, c("class"), summarize,

This uses plyr which you can see details of in some previous posts I’ve made (e.g, here). For the graph that I will make, I’d like to have class ordered by dps. To do that, I run this code:

graph_summary$class <- reorder(graph_summary$class, graph_summary$AVERAGE)

It reorders the class column by the AVERAGE of each class.

Next up, we have the code for the graph:

aes(x=class, y=AVERAGE, colour=class)+
geom_errorbar(aes(ymax=AVERAGE+SE, ymin=AVERAGE-SE))+
opts(axis.text.x = theme_text(angle = 90, hjust = 0, size=11),
axis.title.x = theme_text(size=14),
axis.title.y = theme_text(angle = 90, size=14))+
scale_y_continuous("DPS (Mean)")

This then gives us the following:

Click for a larger, nicer quality version.


Next up are the post-hoc tests to run. You can run Tukey’s HSD tests with the simple command:


However, if you’re more inclined towards t-tests, there’s a great function that can do all of them for you. It’s this:

with(full_list_dps, pairwise.t.test(DPS, class, 
p.adj="bonferroni", paired=F))

The with command just tells R to use full_list_dps so we don’t have to write full_list_dps$DPS and so on when running the t-test. The syntax for the pairwise t-test follows the following formula:

pairwire.t.test(DV, IV, ADJUSTMENT, PAIRED??)

DV and IV I assume you understand. There are lots of p-value adjustment options, and here I’ve gone for bonferroni. As these are not paired data, I’ve said paired to FALSE by writing in paired=F. What pairwise.t.test does is run every combination of t-test for you based on your factor levels. Great! You’ll get results like the following, showing you p-values from t-tests comparing all the factor levels with each other:

I’ve converted the output into a table and rounded the values to a few decimal places, but I find tables in this format a very easy way to check for significant effects.

References/Further Reading

The method I have presented here for averaging data and plotting it with error bars in ggplot2 was taken/inspired from this post over at the fantastically-named i’m a chordata! urochordata! blog. I have no idea where the blog title comes from–but it sounds cool. I’m indebted to that author/site because, well, it was the first decent example I found that did what I was after!

I’d also like to reference this for helping me work out how to convert pairwise t-test results into a table like the one pictured above.

See also the personality project pages on ANOVAs for further reading.

Watching other People Watching other People/Events



Last year, I went to Stone Henge for the summer solstice. It wasn’t exactly as I was expecting. I was hoping for loads and loads of druids running around doing all sorts of entertaining things. Sadly, there weren’t many druids. It was basically just an all-night rave outside by a load of old stones. Some people who had run out of drugs and needed to sleep stole one of our blankets when we went on a two-hour trek to the toilets.

Anyway, one thing that did strike me during the summer solstice was the sunrise. The sunrise itself was, of course, very nice to look at. It was a perfect summer morning. What interested me, however, was how the people there had to experience it all through their mobile phones. Take a look.

I’ve picked up a load of similar pictures from other events since that time, but this one defines it best for me. I’m not trying to complain about mobile phones or people taking pictures. I’m just interested in how our own experience of something is not enough – instead we need a picture of it. I’m a hypocrite anyway, having immediately felt the need to take a picture. I can’t help but think if J.G. Ballard were still knocking around, he’d be writing books about the symbiosis of this kind of stuff and people’s minds, moving on from his days of writing about cars and tower blocks invading the human psyche.

Anyway, on to what inspired me to post this. There’s a link here which is a massive, high resolution image from the royal wedding. Now I don’t care about the wedding. At all. I’d be not even slightly bothered if all royalty everywhere suddenly fell off the planet. But it’s a great shot of hundreds and hundreds of people all experiencing an event through a camera.

Update: I found two more photos of this. In fact, it’s even better. Here’s the first:

Here, I have taken a photo of people watching a TV screen. You” notice on the TV screen is some footage of two men. One of them is looking through a TV camera. And one more:

Here I am, being photographed taking a picture of these people watching the TV screen. One day I’d like to create a massive chain of photographs of people taking photographs of others photographing something else. It’ll probably create a black hole and consume the planet.