One of the most positive developments in assessment in recent years is the increased focus on progress over attainment as a way of judging the effectiveness of a school. As Rachel, our data analyst, explained in an earlier blog post, attainment on its own is a poor way of handling school accountability. By adding progress into the mix you can build up a much better picture of how a school is doing.
However, the more we’ve looked at primary progress at Assembly, the more we’ve become concerned about the quality of the underlying data. The ever-excellent James Pembroke has already raised important concerns about how Writing contributes to the primary progress calculation. In this blog we’ll explain why we’re worried about the measure on a more fundamental level, and offer our thoughts on what the DfE could do to improve things.
But first, here’s a brief recap on the headline progress measures at both primary and secondary from 2016-17 onwards:
At secondary, Progress 8 considers the journey between a child’s end-of-Key-Stage-2 grades (the baseline), and their “Attainment 8” result (a composite measure of average attainment in 8 priority subjects). The government works out an expected “Attainment 8” outcome for each child's achievable baseline score, and then compares that to their actual outcome. So “zero” means you’ve make average progress; a positive score means you’ve made above-average progress; and a negative score means you’re progressing at a below average rate. The school's Progress 8 score is the average of their students' scores.
At primary, progress calculations use the similar principle of measuring performance from a baseline to a result. However, the “baseline” is not currently linked to aptitude on school entry (i.e. the first term of Reception), but to Key Stage 1 results (i.e. the end of year 2), three years after a child joins a school. That means that this year (and until 2019), the baseline score for KS2 attainment derives from KS1 levels. From 2020, the KS2 baseline will be calculated based on the new KS1 accountability system that has just been introduced.
The secondary calculation seems sound: the key stage 2 baseline is obtained in test conditions just before pupils enter secondary, so should be reliable. And Attainment 8 is a well-designed measure that captures average performance without obsessing about a particular threshold (like the C/D borderline that was so important in the old 5 A*-C measure, for example). Rachel blogged about the problem with threshold attainment measures in a post to accompany the launch of our secondary benchmarking tool.
The primary calculation, on the other hand, strikes us as less robust. To help explain why, here’s a recap on how Key Stage 1 data collection and baseline calculation works, including any notable changes introduced in 2016:
Children sit national tests in reading and maths in May. The tests are taken in the classroom, and do not need to be timed.
Schools mark these tests themselves. Until 2015, these tests gave an outcome in the form of a sublevel. From 2016 onwards, schools use KS1 conversion tables published by the government to turn the raw scores into scaled scores ranging from 85 to 115. 100 is the “expected standard”.
Schools then also rate a child’s progress in Reading, Writing and Mathematics, using their own judgment. Until 2015, this involved assigning a sub-level. From 2016 onwards, it involves scoring a child as “Below”, “Working Towards”, “Expected”, or “Greater Depth” in each area, based on a range of evidence that a teacher should way in consultation with the statements in the DfE's new Interim Assessment Framework. (NB - there are actually loads of sub-categories for “Below”, but that’s a detail for another day).
Schools submit their teacher judgments to their local authority, who moderate at least 25% of those judgments. However, they do not submit the test results to the government or local authority.
The government takes each child’s set of teacher judgments and assigns a new score to them. Until 2015, the wide range of potential sublevel results that could be achieved led to over 200 potential permutations in the "Value Added" calculation, as James Pembroke explained here. From 2016 onwards there are far fewer groupings; the Reading & Writing scores are averaged to create an English score, and then that average is added to the Maths average to get an overall composite baseline in the form of an Average Point Score. That process creates 21 different "prior attainment groups". (James Pembroke also recently tweeted the guidance doc along with a picture of the prior attainment group table for easy reference).
For every prior attainment group, an expected Key Stage 2 attainment score is calculated. (Note: as the 2016 conversion tables show, KS2 attainment uses a range of scaled scores from 80-120, not 85-115 as at KS1. I’m not sure why. I’m not sure anyone is.)
From 2016 onwards, progress is calculated separately for each subject by comparing the composite baseline with the student's results. A child that achieves average progress from their baseline to their expected attainment gets a score of 0; lower than average progress gets a negative score, and above average progress gets a positive score. (Until 2015, average progress was a composite measure of reading, writing and maths, with 100 signalling average progress.) A school’s progress measure for each subject is basically the average of each child’s progress score. There is no longer a composite progress measure at primary.
Here are three reasons why this concerns us:
Teacher judgments are a poor basis for a high stakes baseline. By far the best way of gathering any normative assessment data is to set a question-based assessment, ideally in controlled conditions, which can be objectively marked. We like teacher judgments for many purposes - plenty of schools use them very effectively for formative assessment in the classroom, for example. But teacher judgments are problematic when used for a baseline which needs to be comparable in different schools, because it’s so easy for teachers to interpret statements differently. If you’re not yet persuaded of this, read Daisy Christodoulou railing against performance descriptors as a basis of summative judgments, or the recent Schoolsweek article highlighting inconsistencies between local authorities in KS2 reading and writing results. It's also worth bearing in mind the concerns of James Pembroke, who points out that the broader categories and consequent limited number of prior attainment groupings from 2016 onwards make the baseline a more "blunt instrument".
Of course, the irony here is that the government requires schools to sit a test composed of questions designed to assess attainment objectively… and then doesn’t collect that data! Partly this is because some people are distinctly uncomfortable that these key stage 1 tests exist in the first place, given the young age of the children sitting the tests. Defenders of the exercise would counter that the test conditions are designed to avoid any feeling of pressure in the mind of the student. But whichever side you’re on, the current situation feels like a compromise designed to be equally unsatisfying to everyone.
The baseline doesn’t start when primary school starts. Unfortunately, that means it’s not really a baseline at all, but an interim checkpoint almost halfway into a child’s time at a primary school. That means that variations in progress from reception to KS1 are completely ignored, unfairly removing any credit from schools that excel during this phase.
The “High Stakes” nature of Key Stage 1 assessment further compromises data quality. The Education Datalab’s March 2015 report included a wonderful explanation of how primary schools use the leeway afforded by teacher judgments to deflate their Key Stage 1 baselines, thus achieving better progress scales. The report also explains how the different incentives of infant schools (who benefit from high Key Stage 1 outcomes, since it’s their final judgment on a student’s attainment) and junior schools (who benefit from outcomes, because it’s their baseline) lead to different judgments in each setting.
So what do we think the government should do? Well, the absolute non-negotiable is that it mustn’t give up on measuring primary progress. It’s hard to do well, and we sympathise with those who point out the risks of testing at such a young age. But we need to care deeply about progress, since it is quite simply the only meaningful way of establishing whether a school did a good job while it was responsible for the education of a child. That means we need a better way.
We think the best option would be to move, in a phased way, to a reception baseline based on objectively gathered information. There has been much controversy in the way the government introduced - and then dropped - reception baselines earlier this year. But the biggest problem here was self-inflicted: the government introduced the concept of an objective, comparable baseline but allowed multiple suppliers to offer those assessments while giving them considerable leeway in how those assessments were devised! This meant that (surprise surprise) the baselines ended up not being comparable - which defeats the object of the exercise. That said, there was much to like in many of the baselines individually, and they were designed not to feel like “tests” to the children completing them. The problem was largely just the extent of difference between them. That’s not a conceptual issue; it’s a design flaw, and one that would be pretty easy to solve.
In the absence of a better baseline, the current primary measure is still useful. However, its drawbacks should be understood when analysing primary data. When we use our primary benchmarking tool we always consider progress and attainment together, and we try to read between the lines. For example, a schools with low Value Added, above average attainment and above average FSM may still be a great school - but the available measures could be concealing rapid progress (from a low baseline) made by pupils in the first few years.
To use data effectively, it’s important to question its validity. Primary accountability metrics are not yet good enough, and we shouldn’t be afraid to say so. But things are improving, and they’d get even better with a proper baseline. Now that would be progress.
Update on 5/9/16: this post was edited to clarify the changes in the KS1 data collection process between 2015 and 2016