David Stickland: ‘is judging fit for purpose?’ *H&H VIP*

  • For eight years I have been studying dressage scores. A strange preoccupation many might say, but I’m convinced there’s more information to be mined from them than we have previously.

    These analyses have been used to justify half-points, to move to seven-member juries at our most important events, to advise Olympic teams on selection criteria and tuning up performance. Most recently, they have been used to build dashboards (an online tool allowing judges to see details of their judging and how it compares with their peers) for judges to gain useful feedback.

    With the Dutch Equestrian Federation (KNHS) collecting the individual mark details for their higher-level events and giving their judges dashboard access for the past two years, the momentum of collecting and using scores is building.

    The best judges have a judging precision (spread in score difference compared to the others) of about 2%. Each judge’s scores typically differ enough from their colleagues to change at least one rank in the top five.

    That’s for grand prix.

    Both the precision and rank differences are greater in lower level dressage, though I’m not blaming judges for this.

    These are the numbers for our top judges; people who understand dressage deeply, and often spend every weekend making this sport possible.

    So we have to decide whether this is good enough, and if not, what we can do about it.

    What is ‘good enough’?

    Most of us would probably accept that in the middle of the field exactly correct ranking is hard, but we want the medals to go to the best in the right order.

    My estimate, using historical data, is that the medallists in Rio might be separated by just 0.6% on average for their riders. The difference between fourth and fifth?

    A mere 0.07% per rider. The judges are going to have a hard time getting that “right”.

    Olympic judges average about 1.8% precision. But even with seven of them, I’ve calculated the final-score accuracy at about 0.7%. That means about two-thirds of the time the final score will be “right” within a range of plus or minus 0.7%. The rest of the time it will be less accurate.

    Even when we consider the team total score of three riders, the final accuracy will be 0.4%; so the separation between team medallists will probably be safe, but it’s at the limit of judges’ ability to separate cleanly and end up with the right result.

    So my answer is no, the system isn’t good enough for top-level dressage.

    What can we do about it?

    In the short term, we can give the judges as much feedback as is useful; allow them to see their differences so they can improve. Set aside discussion time after every competition where they have the data in front of them to focus on difficult judgements.

    Eventually we must require judges to be better, according to agreed performance levels, to be promoted or even to stay in their current grading.

    Long term, we must use this as an opportunity to design judging systems that are more efficient, better defined, and more understandable to the riders and to the public.

    Our top riders, trainers and judges agree pretty well on what they want to see, but the current system of combining everything the judge sees in a 10-second window into one mark cannot be accurately transformed into a reproducible score precise enough to do justice to the riders and their mounts.

    Ref: Horse & Hound; 12 May 2016