Data analysis – Martin Rosenbaum

The art of the deal, LabLib style

Data analysis, Elections / 13 April 2025

I’ve been analysing recently published data from the Electoral Commission about local campaign spending by candidates at last year’s general election.

Newly released figures on campaign spending at the last election confirm the suggestion that Labour sharply limited its electioneering efforts in seats which the Liberal Democrats might win from the Tories.

My analysis of Electoral Commission data shows that in potential LibDem target seats, Labour spent on average about £4,600 less than it did in other comparable constituencies.

This chart demonstrates that in most places where the LibDems could be hopeful of victory, Labour generally spent a fairly small proportion of the maximum legally allowed for local campaigning at the 2024 election. This was in stark contrast to its general pattern of expenditure.

On average Labour candidates in these LibDem targets used only 26% of the legal limit, whereas Labour’s average in other seats was 68%.

While it could be argued that some of these seats were low Labour priorities as probably unwinnable, my statistical analysis shows that Labour put much less effort into campaigning in LibDem targets even after controlling for how well or badly Labour was positioned after the previous election in 2019.

Looking at seats then held by the Conservatives, and after taking account of the swing Labour would have needed to take the seat in 2024, Labour’s local campaign spending was £4,597 lower on average in these possible LibDem targets.

Party strategies

This fits with the claims made during the 2024 campaign that Labour was diverting its resources away from these seats, sometimes to the annoyance of disgruntled party activists.

In their local campaigning the LibDems concentrated resources very tightly indeed on their priority constituencies. Since these were predominantly ones held by the Conservatives who were clearly in massive electoral difficulties, they did not spend much money in other seats which were Labour or Conservative-Labour marginals.

These strategies suited both parties nationally who could each successfully focus their energies on maximising Tory losses and not competing against each other, without the need for any official agreement which would have been highly difficult and controversial.

I’ve defined LibDem targets here as constituencies where after 2019 they were the party in second place after the Conservatives (along with the much smaller number of places they held). This is a clear and reasonable demarcation.

However, it would not be identical to the list decided on by their party HQ, which added or dropped seats in line with fluctuating political circumstances and local factors. In any case this is more about which seats Labour would regard as sensible LibDem targets than the LibDems themselves. Nevertheless, even if there is a discrepancy of a few locations, that would not invalidate the very strong pattern this analysis identifies.

The local data published last month also reveals a very interesting contrast between Labour and Conservative spending patterns, which reflected the different political situations they faced when going in to the election.

This is indicated in the next two charts, which compare the party’s spending in each constituency in 2024 to its strength there after the previous general election in 2019.

The ^ shape in the Labour graph shows the party tended to spend most in marginals, with significantly less expenditure in many hopeless seats and safe ones.

For the Conservatives, however, they had to concentrate resources on the seats they already held and were not able to assume any were safe. So their spending was much higher in places where they’d done better in 2019.

This expenditure refers to money parties use for electioneering in each constituency, including leaflets, posters, letters, meetings and office costs.

The legal limit in each place depends on the size of the electorate and whether the constituency is urban or rural. In 2024 the maximum allowed was generally in the range of £16,000 to £21,000.

Overall Labour spent £7.2 million on local electioneering, while the Conservatives spent a little less, £7 million. For the LibDems the total was £3.2 million, for Reform £1.3 million, and for the Greens £700,000.

This is separate to money used for national campaigning. The legal cap on that in 2024 was £34 million for parties standing candidates throughout Great Britain. The Electoral Commission has not yet released the details of the national sums spent by the major parties.

There is some evidence that spending on constituency campaigning does matter. I analysed how many extra votes each party got for an extra £100 of local spending, after controlling for the party’s vote share in that constituency in the 2019 election.

The figures are given in the table below. However it is difficult to read much into this, as parties would also be devoting more effort to places where they felt things were going well, especially the LibDems, Greens and Reform. So the overall direction of causation is far from clear.

* This is the number of extra local votes in 2024 for that party associated with an increase of £100 in its local spending, after controlling for its votes in 2019 to take account of its position going into the election. (In the case of Reform, I used the figure for how the Brexit Party did in 2019, although they didn’t stand in many seats).

The art of the deal, LabLib style Read More »

Where did the polls go wrong?

Data analysis, Elections / 6 January 2025 / politics, polling, psephology

The general election result last July was certainly a ‘Labour landslide’, but it wasn’t the even bigger, ginormous landslide which the polls predominantly predicted.

We were saved from the normal cliched headline ‘Polls Apart’, because the polls were all together on one side of reality, overstating Labour and understating the Tories.

I’ve been examining the reasons provided by those polling companies who have publicly tried to explain how these forecasts went wrong. They focus on the following factors: late swing, religion, turnout, ‘shy Tories’, and age.

The most recent company to publish its analysis was YouGov, which did so just before Christmas, also announcing that it would adopt a new methodology from January.

The election polls significantly overestimated Labour and underestimated the Conservatives, as shown in a chart from Will Jennings.

While this pattern has often happened, in terms of the difference between the two parties, this was their biggest miss since 1992, exaggerating the gap on average by 7 percentage points.

The constituency prediction models known as MRP polls were also all awry in the Labour direction, as demonstrated in the dataset collated by Peter Inglesby.

Of course some polls were much nearer to the actual outcome than others, as the companies that did reasonably well and got closest are naturally keen to stress, and as I myself have analysed in the past. But the industry as a whole clearly systematically over-predicted Labour, and that’s not good for the world of opinion research.

This is despite the fact that one can argue this was a tricky election to get right, with an increasingly volatile electorate, a very large swing, an important new party, the impact of independents, and changes in how demographic characteristics such as education and class link to voting behaviour.

If the result had been close, the level of polling error involved would have created a sense of chaos and surely have become a crisis for the industry. However the problem has been disguised by the fact that the only point at issue was the extent of the landslide, and so it did not disturb the central narrative of the election.

Pollsters are constantly seeking to improve their methods, and indeed the MRP models last July were a positive contribution to getting the overall impact correct, confirming the value of innovation. Companies have been reviewing their performance and what went wrong.

The British Polling Council (BPC) is collating relevant research from its members on its website. So far work from six organisations has been added. As well as YouGov, the others are BMG, Electoral Calculus, Find Out Now, More in Common, and Verian. It’s possible that more BPC members will add further submissions in due course.

I’ve been reading them to see what prevailing points emerge on an industry-wide basis.

It is important to note that given the companies have different methodologies, this implies there could also be variations in what each got wrong. But the fact that they were all out in the same pro-Labour/anti-Tory direction suggests that as well as any individual aspects there is also something significant which is shared.

Although there is no unanimity, their findings do reveal some common themes. (None of them discuss the issue of ‘herding’, the claim that error can be exacerbated if some companies sometimes take decisions in such a way that they stay in line with the crowd – a charge which is very unpopular within the industry).

Late swing

Three companies – BMG, More in Common and YouGov – attribute the error partly to ‘late swing’, due to people changing their mind about how to vote at the last minute after final opinion surveying ended. A cynic might say that this is the most convenient excuse for the industry, as it is the least challenging to the accuracy of their methods. Maybe, but the fact that it is convenient doesn’t necessarily mean that it is wrong.

Beyond the data presented, I have to say I also find this plausible based on anecdotal evidence, with the forecasts of a huge Labour victory nudging some intending supporters into eventually switching to vote for someone else, such as the Greens. In this sense the polls ironically could have been their own enemies, almost a kind of partially self-negating prophecy.

However Electoral Calculus finds no evidence of late swing, and in any case none of the companies thinks it can approach the full explanation, which still leaves a methodological challenge for the industry.

Religion/ethnicity

The pollsters seem to have failed to reflect the increasing fragmentation of the ethnic minority electorate, with some Muslim/Pakistani & Bangladeshi voters abandoning Labour, often for independent candidates who campaigned about the situation in Gaza, while Hindu/Indian voters drifted towards the Tories. This factor is referred to by BMG, More in Common and YouGov. It is clear that election analysis can no longer crudely treat voters of Asian heritage (let alone all ethnic minorities) as if they are one political bloc.

More in Common suggests that Muslims who currently take part in online market research panels are probably not representative of the overall Muslim population, being more likely to be second or third generation immigrants, and less likely to be born outside the UK or not speak English. The company says it will probably modify its weighting scheme.

Similarly YouGov says it will incorporate a more detailed ethnicity breakdown into its modelling in future.

However, the numbers of voters involved, while crucial in certain seats, mean that this could also only be a very partial factor nationally.

Turnout

Taking account of likelihood-to-vote is a notoriously difficult problem for pollsters, who employ a range of strategies to estimate how many of each party’s proclaimed supporters will actually go to the trouble of casting a ballot. Three companies – BMG, Electoral Calculus and YouGov – include the overpredicting of Labour voters’ turnout as a factor in the 2024 error.

YouGov argues the cause stemmed from panels which over-represented people who would actually vote, especially for low turnout demographic groups. The company says that from now on it will base turnout modelling purely on demographic data, rather than respondents’ self-reported likelihood-to-vote.

This sort of problem has been a general industry issue in the past, of over-sampling the more politically engaged (who tend to be keener to take part in this kind of survey).

However it is awkward for pollsters to get turnout adjustments correct. There is no guarantee that what worked best last time will be best next time, as the commitment of different groups to implement their asserted voting intentions may depend on the political circumstances of the moment. Ironically again, the forecast Labour triumph last July might have pushed some of the party’s less determined supporters into not bothering to go to the polling station on the big day.

Shy Tories

This has also been a traditional difficulty for the polling industry, where those of a Conservative outlook are somewhat less willing to express their allegiance – possibly because they feel in some sense disapproved of or intimidated (this is sometimes called ‘social desirability bias’), or perhaps alienated from polls. Again, the extent to which it happens can also depend on the political atmosphere of the time.

Over-estimating the Left and under-estimating the Right is not just a UK polling problem – it has cropped up as a fairly consistent (but not universal) pattern across many countries, as can be seen in the Deltapoll slide in this piece by Mark Pack.

The industry has tried to counteract this skew through various means of political weighting, such as using previous voting behaviour.

Electoral Calculus states there is indeed suggestive evidence of a ‘shy Tory’ effect in 2024, with people who refused to answer voting intention questions or who replied “don’t know” being more likely to be Tory voters. This is also consistent with the findings reported by BMG and by More in Common about ‘undecided’ voters who were then pressed.

YouGov suggests that its past vote weighting fell down in 2024 because at the previous election in 2019 the Brexit Party endorsed the Conservatives in many seats. The result was that its panels had too many 2019 Tories who actually preferred the Brexit Party and then voted Reform in 2024, and not enough firmly committed Conservatives. Their paper does not raise the issue of whether it is staunch Tories who are most likely to avoid voting intention opinion research, but it seems to me that this conclusion is compatible with their evidence.

Find Out Now (which only produced one unpublished poll during the 2024 campaign) argues against the ‘shy Tory’ hypothesis. But in my opinion their data only counters the hypothesis that online research panels under-represent Tories in general, as opposed to the hypothesis (advanced by Electoral Calculus) that Tories may be reasonably represented in panels but are disproportionately likely to refuse or reply “don’t know” when faced with a voting intention question in a survey.

More in Common also states that there is possible selection bias affecting online panels as the recruitment processes appeal to the ‘overly opinionated’.

Age

Age was very strongly associated with how people voted last July, with Tory support concentrated in the older section of the electorate.

The report from Verian (the polling company which came closest to the actual result on percentage vote shares) focuses entirely on the issue of age, and concludes that those companies whose samples contained a smaller proportion of over-65s (after weighting) tended to be less accurate. But its presentation adds that other biases would also have played a role.

Find Out Now raises a different possibility on age, that it failed to locate Conservatives who were younger and less politically engaged (a group that is hard for pollsters to reach).

Summary

At this stage we are left with the suggestion that perhaps four or five factors may have contributed together to the polling miss, and none explain it alone.

There can be a problem with this kind of analysis, dubbed the “Orient Express” approach by Electoral Calculus, where multiple possible causes are examined and all those which affect the error are deemed part of the solution. In other words, as in the Agatha Christie story, if everyone/everything is responsible for what happened, then eventually no one/nothing is actually held responsible, and nothing is done.

On the other hand, looking at the underlying fundamentals, it seems to me that predictive opinion polling is a difficult business given the level of precision required and the volatility of today’s voters. There are many sources of potential error (apart from normal sampling variation), arising from which people get contacted, whether they reply or tell the truth or change their minds later, how the electorate is modelled, and how the answers from different groups are weighted to aim at representativeness. And errors that arise are difficult to eliminate methodologically, as they depend on political circumstances which vary from one election to the next, and also on the communications technology for conducting research which is constantly evolving and in different ways for different social groups.

Inevitably therefore pollsters are bound to make some mistakes (and not all will make the same ones). When they are lucky, the errors may cancel themselves out, more or less, and nobody notices them. When the pollsters are unlucky, the errors largely or entirely mount up in the same direction.

Further, more thorough analysis will be possible once detailed data becomes available from the academic British Election Study and its extensive voter research.

The British Polling Council, to which all the main pollsters belong, is also planning to hold a public event to discuss these issues, probably in April.

Where did the polls go wrong? Read More »

The VAT cliff edge: How the threshold impedes small businesses

Data analysis, FOI / 9 October 2024 / HMRC, tax, VAT

As Rachel Reeves ponders her forthcoming budget and how to balance raising money against economic growth, one of her self-imposed constraints is her pledge not to raise the rate of VAT. However the impact of taxes also depends greatly on the thresholds from which they apply, even though this tends to get a lot less attention in public debate (as is certainly the case for income tax).

So what about the annual turnover level at which businesses have to register for VAT?

Data I have recently obtained from HMRC under the freedom of information law shows the dramatic impact of the VAT threshold in restricting the growth of some of the UK’s small businesses.

In 2021/22 the UK had 21,752 businesses with annual turnover in the range £84,000-£85,000, just below the then threshold. But there were only 10,096 businesses just over the limit, in the range £85,000-£86,000.

In other words the number of businesses clustered just under the VAT threshold was more than double the number just above, as businesses curtail their activities to remain outside the VAT registration system.

The graph above clearly shows the cliff edge in the data.

Many small businesses are desperate to keep their annual turnover under the VAT level, so that they avoid the bureaucracy and costs of registration and they don’t have to charge VAT to customers, which would make them less competitive. However the consequence is that they then won’t grow further into larger, more successful operations.

For some businesses the VAT threshold functions as a ceiling constraining their growth.

Research by Warwick University in 2022 concluded that earlier data of this kind reflected genuine curtailment of business activity rather than false reporting to HMRC.

This is the latest data available from HMRC, which says that more recent information is still being processed. The current VAT threshold is now £90,000, as the figure was increased by the Conservative government before the general election.

The UK’s VAT threshold is high compared to other European countries which tend to impose VAT registration on businesses at a much lower level. While the UK policy saves many small businesspeople from the compliance burden of VAT, the significantly lower thresholds elsewhere make it less likely that enterprises will be found bunched and held back just under the relevant level of turnover.

I also wanted to get a breakdown of the data by sector of the economy, to see which kinds of businesses were most affected. HMRC said it could provide this for 2019/20, as it had previously extracted the information involved, but that more recent breakdowns would probably exceed the FOI cost limit.

According to these 2019/20 figures, the most dramatic effect is in the construction sector.

This data shows 4,445 construction businesses with an annual turnover of £84,000-£85,000, but only 1,425 in the range £85,000-£86,000. So the number of construction businesses appearing to have kept themselves just below the limit is over three times the number who grew a little more and just exceeded it.

The chart shows the impact for construction and some other economic sectors with large numbers of small enterprises.

These FOI releases from HMRC constitute the latest and most thorough official evidence of what the tax expert Dan Neidle of Tax Policy Associates has called ‘the VAT growth brake’.

The full HMRC spreadsheets can be downloaded here:

1) Summary data for 2019/20, 2020/21, 2021/22

2) 2019/20 sectoral breakdown

The VAT cliff edge: How the threshold impedes small businesses Read More »

FOI: Which complaints are upheld by the ICO?

Data analysis, FOI / 18 July 2024 / ICO

Freedom of information requests can be rejected for a range of reasons, but some are much more likely to be overturned by the Information Commissioner’s Office than others.

The details of this are made clear by my analysis of a dataset recently released by the ICO covering nearly 22,000 decisions issued by the information rights regulator since FOI came into force.

For example, the ICO has upheld nearly half the complaints received from information requesters against FOI refusals linked to protecting commercial interests. But it has upheld only one in six objections to refusals based on international relations.

This table shows, for each of the legal grounds for dismissing FOI requests, the number of complaints about that reason which the ICO has ruled on and the percentage which it has upheld (ie backing the requester and overriding the public authority).

Subject matter (section of FOI Act)	Number of complaints	Percentage upheld
The economy (29)	27	56
Relations within UK (28)	17	53
Commercial interests (43)	1010	47
Future publication or research (22/22A)	213	44
Health and safety (38)	119	42
Policy formation (35)	622	38
Already accessible (21)	332	36
Effective conduct of public affairs (36)	967	35
Audits (33)	38	34
Confidential information (41)	605	34
Law enforcement (31)	860	30
Vexatious or repeated (14)	1498	23
Investigations (30)	318	21
Personal data (40)	3097	18
Monarchy and honours (37)	181	18
Defence (26)	41	17
National security (24)	299	17
International relations (27)	292	16
Legal privilege (42)	507	16
Otherwise prohibited (44)	406	14
Cost (12)	1491	12
Court records (32)	108	8
Security bodies (23)	304	7
Parliamentary privilege (34)	12	0

Source: Martin Rosenbaum, based on ICO data

Or in chart form:

So during FOI’s two decades of operation, the ICO has been much happier to overrule public authorities on matters like commercial interests and policy formation than on topics like defence, security and international affairs.

My analysis uses three spreadsheets with details of ICO rulings which were recently disclosed via the What Do They Know website, in response to a request from Alison Benson. The spreadsheets list the ICO’s formal decision notices from the first one in 2005 until last month.

The ICO maintains that it provided this material voluntarily ‘on a discretionary basis’, arguing that the information would be already available through its routine publication of decision notices.

However the supply of these three files makes the statistical analysis of ICO rulings much more practical than by trying to process all the individually published decisions. The ICO’s release of this dataset is therefore a positive and welcome step in terms of its own transparency.

Environmental information

Note that my analysis excludes environmental information, which falls under a different law, the Environmental Information Regulations. The EIR exceptions do not exactly correspond to the FOI exemptions, so the data cannot be combined.

The numbers of EIR cases are fewer than for FOI, but a similar pattern emerges. Thus the ICO has more frequently overruled public authorities when they base an EIR refusal on commercial confidentiality or the internal nature of communications, rather than when authorities rely say on protecting the course of justice.

Delay

It is also possible to analyse aspects of the dataset in more detailed ways. Here is one example.

This table shows the 15 public authorities against whom the ICO has most often upheld complaints about delay in processing FOI requests (under section 10 of the FOI Act), and how many times this has happened since 2005.

Public authority	Upheld complaints about FOI delay
Home Office	303
Ministry of Justice	173
NHS England	162
Cabinet Office	161
Dept of Health and Social Care	84
Metropolitan Police	82
Dept for Work and Pensions	79
Foreign Office	74
Sussex Police	74
BBC	60
Ministry of Defence	58
Dept for Education	54
Wirral Council	43
Croydon Council	39
Information Commissioner’s Office	35

Source: Martin Rosenbaum, based on ICO data

On this measure the public authorities with the biggest record of delay since FOI was implemented are the Home Office, the Ministry of Justice, NHS England and the Cabinet Office.

Ironically the authority which comes 15th on this list of shame is the ICO itself! This is clearly a very bad record for an organisation which should be setting a good example of prompt compliance with the law, but at least as a regulator it has been willing to point out its own failings.

Notes: 1) My analysis amalgamates bodies which at some point since 2005 had some change of name or scope but remained essentially the same organisation (eg NHS England with NHS Commissioning Board; Department of Health and Social Care with Department of Health). 2) The ICO is thoroughly and annoyingly inconsistent when naming authorities (eg sometimes using ‘Metropolitan Police Service’ and sometimes using ‘Commissioner of the Metropolitan Police Service’. I hope I have spotted all such instances and combined the figures accordingly, but it is possible I have missed some.

FOI: Which complaints are upheld by the ICO? Read More »

Election prediction models: how they fared

Data analysis, Elections / 6 July 2024 / 2024, election, MRP, politics, polling, prediction, voting

Which predictive model for the results of the election was best – or the least bad?

I say ‘least bad’, because in what may seem like the frequent tradition of the British polling industry, they all overstated how well Labour would do.

However there was also a huge gap between the least bad and the much worse. In a close election discrepancies of this extent would have pointed during the campaign to very different political situations, creating the impression that the forecasting models were contradictory chaos. This level of variation is somewhat disguised by the universal prediction of what could be called a ‘Labour landslide’, now confirmed as fact (even if it isn’t as big as they all said it was going to be).

Labour seats

Let’s look at the forecasts for the total number of Labour seats. This determines the size of Labour’s majority and is the most politically significant single measure of how the electorate voted.

Actual result for Labour seats	412
Britain Predicts	418
More In Common	430
YouGov	431
Election Maps	432
Economist*	433
JL Partners	442
Focal Data	444
Financial Times	447
Electoral Calculus	453
Ipsos	453
We Think	465
Survation**	470
Savanta	516

I have listed the models which predicted votes for each constituency in Great Britain and were included in the excellent aggregation site produced by Peter Inglesby. (If that means any model is missing which should have been added, my apologies.)

Note that what I am comparing here are the statistical models which aimed to forecast the voting pattern in each seat, not normal opinion polls which only provide national figures for vote share. These competing models are all based on different methodologies, the full details of which are not made public.

The large number of such models was a new feature of this election, linked to the growing adoption of MRP polling along with developments in the techniques and capacity of data science.

On this basis the winner would be the Britain Predicts model devised by Ben Walker and the New Statesman. Well done to them.

This model is not based on a single poll itself, but takes published polling data and mixes it into its analysis. This is also true of some of the others around the middle of the table, such as the Economist and the Financial Times.

On the other hand polling companies like YouGov and Survation base their constituency-level forecasts on their own MRP polls (Multilevel Regression and Post-stratification), combining large samples and statistical modelling to produce forecasts for each seat.

The closest MRP here is the More in Common one, with YouGov narrowly next. However the bottom of the table are also MRP polls rather than mixed models – We Think, Survation and Savanta. (It should be noted that the Savanta one was conducted in the middle of the campaign and so was more vulnerable to late swing).

Constituency predictions

However a different winner emerges from a more detailed examination of the constituency level results. This is based on my analysis using the data aggregated on Peter Inglesby’s website.

Although Britain Predicts was closest for the overall picture, it got 80 individual seats wrong in terms of the winning party. This was often in opposite directions, so at the net level they cancelled each other out. It predicted Labour would win 33 seats that they lost, while also predicting they would lose 26 seats which the party actually won.

In contrast YouGov got the fewest seats with the wrong party winning, just 58. So well done to them. And I’m actually being a bit harsh to YouGov here, as this is counting the 10 seats they predicted as a ‘tie’ as all wrong – on the basis that (a) the outcome wasn’t a tie (haha), and (b) companies shouldn’t get ranked with a better performance via ambiguous forecasts which their competitors avoid. If you do not agree with that, which might be the more measured approach, you can score them at 53.

The two models that did next best at the constituency level were Elections Maps (62 wrong) and the Economist (76 wrong). The worst-scoring models were We Think and Savanta which both got 134 seats wrong.

This table shows the number of constituencies where the model wrongly predicted the winning party.

Model	Errors at seat level
YouGov	53
Election Maps	62
Economist	76
Britain Predicts	80
Focal Data	80
More in Common	83
JL Partners	91
Electoral Calculus	93
Financial Times	93
Ipsos	93
Survation	100
Savanta	134
We Think	134

Source: Analysis by Martin Rosenbaum, using data from Peter Inglesby’s aggregation site.

(I’m here adopting the slightly kinder option for YouGov in the table).

This constituency-level analysis also sheds light on the nature of the forecasting mistakes.

There were some common issues. Generally the models failed to predict the success of the independent candidates who appealed largely to Muslim voters and either won or significantly affected the result. On the one hand it is difficult for nationally structured models to pick up on anomalous constituencies. On the other it is possible that the models typically do not give enough weight to religion (as opposed to ethnicity).

On this point there’s increasing evidence of growing differences in voting patterns between Muslim and Hindu communities. It’s striking that 12 of the 13 models (all except YouGov) wrongly forecast that the Tories would lose Harrow East, a seat with a large Hindu population where the party bucked the trend and actually increased its majority.

The models also failed almost universally to predict quite how badly the SNP would do – ironically with the exception of Savanta, the least accurate model overall.

On the other hand there were also wide variations between the models in terms of where they made mistakes. In all there were 245 seats – 39% of the total – where at least one model forecast the wrong winning party.

The seats that most confused the modellers are as follows.

Seats where all the 13 modellers predicted the wrong winning party: Birmingham Perry Barr, Blackburn, Chingford and Woodford Green, Dewsbury and Batley, Fylde, Harwich and North Essex, Keighley and Ilkley, Leicester East, Leicester South, Staffordshire Moorlands, Stockton West, plus the final seat to declare: Inverness, Skye and West Ross-shire***.

Seats where 12 of the 13 modellers predicted the wrong winning party: Beverley and Holderness, Godalming and Ash, Harrow East, Isle of Wight East, Mid Bedfordshire, North East Hampshire, South Basildon and East Thurrock, The Wrekin.

Overall seats v individual constituency forecasts

So which is more important – to get closest to the overall national picture, or to get most individual seats right?

The statistical modelling processes involved are inherently probabilistic, and it’s assumed they will make some errors on individual seats that will cancel each other out. That’s the case for saying Britain Predicts is the winner.

But if you want confidence that the modelling process is working comparatively accurately, that would point towards getting the most individual seats right – and YouGov.

Note that this analysis is based just on the identity of the winning party in each seat. Comparing the actual against forecast vote shares in each constituency could give a different picture. I haven’t had the time to do that more detailed work yet.

Traditional polling v predictive models

The traditional (non-MRP) polls also substantially overstated the Labour vote share, as the MRP ones did, raising further awkward questions for the polling industry. However, there’s an interesting difference between the potential impact of the traditional polls compared to the predictive models which proliferated at this election.

Without these models, the normal general assumption for translating vote shares into seats would have been uniform national swing. (This would have been in line with the historical norm that turned out to be inapplicable to this election, where Labour and the LibDems benefitted greatly from differential swing patterns across the country.) And seat forecasts reliant on that old standard assumption would then have involved nothing like the massive Labour majorities suggested by the models.

Although the predictive modelling in 2024 universally overstated Labour’s position, it did locate us in broadly the correct political terrain – ‘Labour landslide’. We wouldn’t have been expecting that kind of outcome if we’d only had the traditional polling (even with the way it exaggerated the Labour share).

To that extent the result was some kind of vindication for predictive modelling and its seat-based approach in general, despite the substantial errors. The MRP polls and the models that reflected them succeeded in detecting some crucial differential swings in social/geographic/political segments of the population (while also exaggerating their implications).

However, it’s also possible that the models/polls could in a way have been self-negating predictions. By forecasting such a large Labour victory and huge disaster for the Tories, they could have depressed turnout amongst less committed Labour supporters who then decided not to bother going to the polling station, and/or they could have nudged people over into voting LibDem, Green or independent (or indeed Reform) who were until the end of the campaign intending to back Labour.

Notes

*Note on Economist prediction: Their website gives 427 as a median prediction for Labour seats, but their median predictions for all parties sum up to well short of the total number of GB seats. In my view that would not make a fair comparison. Instead I have used the figure in Peter Inglesby’s summary table, which I assume derives from adding up the individual constituency predictions.

**UPDATE 1: Note on Survation prediction: After initially publishing this piece I was informed that Survation released a very late update to their forecast which cut their prediction for Labour seats from 484 to 470. The initial version of my table used the 484 figure, which I have now replaced with 470. However, despite reducing the extent of their error, this does not affect their position in the table as second last.

Other notes: (1) I haven’t been able to personally check the accuracy of Peter Inglesby’s data, for reasons of time, but I have no reason to doubt it. I should add that I am very grateful to him for his work in bringing all the modelling forecasts together in one place. (2) This article doesn’t take account of the outcome in Inverness, Skye and West Ross-shire, which at the time of writing was yet to declare.

***UPDATE 2: The eventual LibDem victory in Inverness, Skye and West Ross-shire was not predicted by any model, which all forecast the SNP would win. This means that this has to be added to my initial list of those which all the models got wrong, which therefore now totals 12 constituencies.

Election prediction models: how they fared Read More »

Absent on Fridays

Data analysis / 8 January 2024 / Fridays, school absence

Pupils are over 20 per cent more likely to be absent from school on Fridays compared to Wednesdays.

The average rate of absence last term in England’s state-funded schools was 7.5% on Fridays. This compares to 6.7% on Mondays, the next most common day for school absence, and the lower figures for the middle of the week: 6.3% for Tuesdays, 6.2% for Wednesdays and 6.4% for Thursdays.

I have derived these figures by analysing the detailed school attendance data collected and published by the Department for Education.

The issue of school attendance is moving up the political agenda, as levels of absence are now much higher than before the covid pandemic.

The government has today announced what it calls ‘a major national drive to improve school attendance’, with measures targeted at tackling persistent absence. The Labour party is also focusing on the issue this week.

This weekly pattern of absence being highest on Fridays, and second-highest on Mondays, with better attendance mid-week, is a widespread feature of the current school system.

From my analysis of the DfE’s data, it applies in both primary and secondary schools, and also in all regions of England.

It is seen when looking both at authorised and unauthorised absences from school. This includes applying to absence due to illness, which is the most common reason recorded for pupils not attending school.

It was also evident throughout the autumn term, as can be seen in this chart (with a particular peak on the Friday before half-term).

The DfE’s data on school attendance can be downloaded here.

In a previous post I examined how school attendance can be affected by when in the year pupils are born.

Absent on Fridays Read More »

Absence from school and month of birth

Data analysis / 3 January 2024 / relative age effect, school absence

For school pupils, does when in the year they are born affect how often they are absent from school?

My analysis of government data suggests that secondary school pupils born in September to December have a somewhat higher absence rate than those born in May to August – which is actually the opposite of what I expected.

Absence from school is now significantly higher compared to before the covid-19 pandemic, and tackling this has been made a target of government educational policy.

Since 2022 the Department for Education (DfE) has been collecting centrally some remarkably detailed and up-to-date data on attendance records for individual pupils from many schools in England, and publishing regular summaries.

The data collated by the department makes it possible to quickly analyse a wide range of factors and potential connections with absences.

Since month of birth is definitely related to other aspects of school life, such as how well pupils do in exams and in sport – the so-called ‘relative age effect‘ – I decided to explore any link with school attendance. Through a freedom of information request I obtained pupil attendance data from the DfE for the school year 2022/23, broken down by type of school, school year and month of birth.

This table shows the percentage of school sessions missed by pupils in selected year groups. It shows that for pupils in years 1 and 2 (aged 5/6 and 6/7), it was the summer-born pupils who had higher rates of absence. This was what I expected, given the well-documented school problems often faced by summer-born children.

But for pupils in years 8 to 11 (aged 12/13 to 15/16), it was those born in September to December who were more likely to be absent.

However the differences within the year groups are not massive, so this pattern (while clear) shouldn’t be overstated. For the intervening ages the data showed very little variation within each year group, so I haven’t presented the figures here. I haven’t obtained data for the reception year.

All this data relates to pupils at about 85% of state-funded schools in England, those which take part in the DfE scheme for automatically submitting daily attendance information.

The following graph shows the same data presented in the form of a line chart.

Persistent absence is a particular problem. This is defined as when pupils are absent for over 10% of school sessions. Analysing the data on persistent absence discloses a similar pattern.

This is indicated in the table below (which involves data from primary and secondary schools, but not special schools).

Generally rates of absence increase as pupils get older and move into higher year groups. Perhaps this trend could help to explain the fact that in secondary schools it’s the older pupils within the year group who tend to be absent more often.

But this can’t be a complete explanation – for example, the frequency of persistent absence is higher for year 10 September births (32.4%) than for the older pupils born in August and in year 11 (30.7%), and similarly for various other data points.

So it looks like there may be some kind of relative age effect involved here, if probably quite mild.

Bear in mind that this is just one year’s data, the period in the wake of the pandemic could be atypical, and there is also the possibility of random variation.

As another potential factor, some illnesses have been associated with when people are born within the year. However, this would not explain the jumps in this data between August and September births.

The DfE data distinguishes authorised and unauthorised absences, but this does not help much in explaining the pattern identified here.

It’s important to note that there are other characteristics which clearly have a bigger impact on school attendance, including levels of disadvantage (poorer pupils are more likely to be absent) and ethnicity (Caribbean and White ethnic groups have higher absence rates than Indian, African and Chinese groups).

The data spreadsheet supplied to me under FOI by the Department for Education is here.

For background on the government’s impressive automated collection of real-time school attendance data, you can watch a recent talk by Caroline Kempner, the DfE’s head of data transformation, given at one of the regular Institute for Government ‘Data Bites’ events (from 37’25” in the video).

It was hearing this presentation which prompted me to do this analysis.

Absence from school and month of birth Read More »

A&E: when are waits shortest?

Data analysis / 22 November 2022 / A&E, NHS

Would you like to know what times of the week have the shortest or longest waits in your local A&E department?

I’ve obtained a spreadsheet from NHS Digital (via a freedom of information request) which reveals just that.

The spreadsheet gives data separately for each provider of urgent and emergency care in England for 2021/22. For patient arrivals in each hour of the week, it shows the average duration of attendance there until discharge or admission – ie, until leaving the hospital or being admitted as an inpatient.

The overall A&E pattern is very much that there are longer waits in the late evening and overnight, shorter waits in the morning, with the afternoons/early evenings in the middle.

Source: Analysis by Martin Rosenbaum from NHS Digital data

In this chart each row going across is a different provider of emergency/urgent care in England (I have excluded those with only partial data or which are not 24-hour services), and each column is an hour of the week, going from 0000-0059 on Monday to 2300-2359 on Sunday. The red cells show longer average waiting times, the green cells shorter waits, and the yellow ones intermediate times.

It makes clear that for almost all providers, patients who arrive just before midnight or in the hours afterwards experience the longest waits on average, while those who arrive in the morning have the shortest waits.

This pattern is the same on every day of the week, including weekends. The very longest waits of all tend to be overnight from Monday to Tuesday.

The exceptions to this are mainly urgent treatment centres rather than A&E departments – their busiest times are often late afternoon or early evening. They appear congregated towards the top of this chart due to the ordering of the NHS provider code system.

Some providers show much greater variation in waiting times across different points of the week than others do.

Overall national statistics about busy times of the day in A&E are published routinely, but as far as I am aware this dataset broken down by different local providers and hour of the week has not been released before.

In a period when there is increasing concern over waiting times for emergency and urgent care, it is important and valuable localised information.

A&E: when are waits shortest? Read More »

Scotland’s alphabet effect

Data analysis, Elections / 11 May 2022 / Names, politics, voting

Last week’s local election results appear to confirm how a candidate’s chance of getting elected to Scotland’s councils is dramatically influenced by a factor which is nothing to do with their abilities – alphabetical order of surnames.

This arises from the voting system used for Scottish council elections, the Single Transferable Vote (STV), where voters number candidates in their order of preference.

Parties will stand more than one candidate in a multi-member ward if they think they have a chance of getting more than one elected.

But of course lots of voters, who may have strong preferences between the parties, don’t particularly care about preferring one candidate from within a party to another.

It’s well established that under STV many voters have a tendency to number candidates from the same party just in the order they find them on the ballot paper, which is a major advantage for those listed first. In Scotland that is alphabetical order by surname.

To illustrate the striking extent of this I have looked at what happened last week in two Scottish councils, Aberdeen and West Lothian (the first and last councils alphabetically, in a limited attempt to avoid alphabetical bias in my selection).

I examined all the cases in these two councils where a party stood two or more candidates in one ward.

In West Lothian, there were 14 examples. In 13 of these, the candidate who came first alphabetically from that party got more first preference votes than the candidate listed second alphabetically, sometimes by huge margins.

The candidates listed first alphabetically for a party averaged 1,669 first preference votes; the candidates from the same party listed second alphabetically only averaged 745 first preferences – less than half as much.

The result was that the candidates listed first alphabetically for a party had a 100% success rate at getting elected; the candidates from the same party listed second alphabetically only had a 64% success rate of election.

In Aberdeen, there were 16 examples. In 14 of these, the candidate who came first alphabetically from that party got more first preference votes than the candidate listed second alphabetically, again sometimes by huge margins.

The candidates listed first alphabetically for a party averaged 1,223 first preference votes; the candidates from the same party listed second alphabetically only averaged 554 first preferences – again, less than half as much.

The result here was that the candidates listed first alphabetically for a party had an 88% success rate at getting elected; the candidates from the same party listed second alphabetically only had a 56% success rate of election.

Obviously it would be ideal to do this analysis for all the 32 local authorities in Scotland. But given the different locations and formats in which all the results are published, that would be a very laborious exercise which is too time-consuming for me to do right now. If there was one single national database of all Scottish local election results in a convenient format for exporting data then it would be a lot more feasible! (I also haven’t examined the impact in the very different political circumstances of Northern Ireland).

It seems clear that the current position in Scotland represents a form of institutionalised systemic discrimination. A council seat is often a step towards building a powerful political career on a bigger stage.

In the past the Scottish government has considered various means of ameliorating this situation but has not implemented any change. Potential options would include randomising the ballot paper order or listing candidates in reverse alphabetical order on half the ballot papers.

Parties could counteract the effect if they had loyal, disciplined voters who would order candidates as instructed, with different instructions issued to different subsets of voters. Roughly equalising the number of first preferences would help to get more than one of their candidates elected.

There has been some evidence of alphabetical voting affecting results in English and Welsh elections, but this is to a much lesser extent because of the different voting systems. Alphabetical voting is also an international phenomenon.

And alphabetical bias also exists in other contexts – here’s an interesting paper on its impact in an academic discipline where co-authors of papers were listed alphabetically.

By the way, when drafting this piece I noticed I had automatically defaulted to providing the Aberdeen data before that for West Lothian, so I went back and reversed that. But I did leave Aberdeen first in the chart.

The acceptance of alphabetical order as an apparently natural and unproblematic method may have a deeper and more insidious grip on our minds, and more important consequences, than we may consciously realise.

Scotland’s alphabet effect Read More »

From Morgan to Frankie

Data analysis / 24 October 2021 / Names

The most popular gender-neutral first names given to babies in England and Wales in 2020 were Frankie, River and Harley.

Looking back at a longer period, the most common gender-neutral first names over the past 25 years were Morgan, Charlie and Taylor.

This is according to my analysis of the baby name datasets for England and Wales issued by the Office for National Statistics, who released their figures for 2020 a few days ago.

The ONS compiles separate datasets for the names of boys and girls. Their annual lists of most popular boys’ and girls’ names are always widely reported. I decided to examine something they don’t analyse – the frequency of gender-neutral or unisex names.

In 2020 there were just 10 first names given at birth to both over 100 girls and over 100 boys. They are listed in this table:

They are ordered according to how often they were used for whichever sex they were less popular for. This measure is mine. As it reflects the frequency of the names in both cases, it seems to me to capture gender-neutrality or ‘unisexness’ better than any other criterion I came up with, although other approaches are possible.

Here is a comparable table compiled on the same basis for the past 25 years in total (the published ONS data goes back to 1996), featuring the 12 first names given at birth both to over 2,000 girls and over 2,000 boys:

So Morgan is the leading unisex first name over this time range, the only name to have been given to over 9,000 girls and also over 9,000 boys in the 25-year period from 1996 to 2020. However it has declined considerably in popularity in recent years, as have some other names in this table.

It’s often said that there has been a long-term phenomenon of unisex names becoming ‘feminised’. Some traditional boys’ names start to become popular for girls too, and then parents apparently no longer want to give them to boys (classic examples include Evelyn and Shirley).

However there seems to be little evidence of such a trend in the ONS data over the past 25 years.

As one way to get an overall impression of this, each line on this chart below represents one of the 50 most popular gender-neutral names, and each column is a year, going chronologically from 1996 on the left to 2020 on the right. For each name, cells are coloured more in red for years when they were more popular for girls and more in blue when more popular for boys. (The colour-coding may be stereotypical, but it does make the chart more intuitive to grasp easily).

As time advances, the names move more from the redder/pinker areas to bluer ones than in the opposite way (although by no means uniformly).

That suggests these gender neutral names are not becoming feminised; if anything they appeared to get a bit more popular for boys (ie bluer) and less popular for girls.

However looking at the data in more detail it seems that what is happening is mainly a trend amongst girls: in particular it’s becoming less common to give girls names like Charlie and Jamie, which are largely boys’ names but which 15 to 25 years ago were also used for a fair number of girls.

What this does mean is that unisex names now are more likely to be broadly similar in popularity for both girls and boys, rather than include various predominantly boys’ names which are also given to some girls.

Finally, it’s important to note that generally these unisex or gender-neutral names aren’t very popular at all. So from my list of top 10 unisex names in 2020, Frankie, the highest for boys, is only 61st in popularity for boys’ names overall that year; and Eden, the highest for girls, only just squeezes into the top 100 girls’ names at 98th.

Parents do seem to prefer to give their children names which are clearly recognisable as belonging to either a girl or a boy.

Note: The ONS data (and therefore this analysis) is based on the specific spellings of names on birth certificates and does not take account of similar names. In other words, Charlie and Charley, for example, are treated as entirely different names.

From Morgan to Frankie Read More »