Thursday, October 29, 2020

Shannon-Fano_Crowd-Handicapping.md

Shannon-Fano Crowd Handicapping

Let us indulge ourselves in a thought experiment on how we might handicap the Crowd!

We have access to a betting-line (Betfair) for a graded-stakes race - with a low WCMI - as well as some simple, publicly-available data. Can we reverse-engineer what the Crowd is most likely factoring into its calculation?

Granted this is no Schrödinger's Cat, but nevertheless it might enlighten us as to whether or not the betting-line is vulnerable?

As outlined in Handicapping Twenty Questions Benford's Law And Shannon Entropy, taking our lead from Shannon-Fano Coding, we should iteratively divide the entrants into two approximately equal groups of win probabilities (i.e. 50%) and use Pairwise Comparison to eliminate the non-contenders using at most four questions.".

When the sub-divisions produced by the splits are approximately equal (50%) in terms of implied probability, then the one bit of information (question) used to distinguish them is maximally efficient. So, using the implied probability (I/P) of the betting-line odds (B/X) as our starting point, we can make an initial split into two groups:

  • Alpha, Bravo; and
  • Charlie, Delta, Echo, Foxtrot, Golf, Hotel.

Keeping our interpretation as simple as possible, it looks like the initial division is based on speed ratings. Then, in deciding between Alpha and Bravo, trainer rating appears to clinch it.

The next sub-division is:

  • Charlie, Delta; and
  • Echo, Foxtrot, Golf, Hotel.

Here, form ratings are the most likely rationale for the split with trainer rating again deciding the rank order within the group.

Next, we split the four remaining horses:

  • Echo, Foxtrot; and
  • Golf, Hotel.

Weight (proxy for fillies allowance) is the deciding factor here but it is not possible to easily account for the final, rank orders within these two groups. We have reached the limits of our simplistic approach. Obviously, the betting-line accounts for more factors than used by our naive approach. But, just because our model is wrong does not mean it is not useful!

In summary, speed ratings appear to be the primary driving factor for the betting-line with trainer rating as the qualifier. Given the likely high correlation between speed and form ratings, the fact that both are used suggests an element of double-counting by the Crowd and, consequently, may indicate a vulnerable betting-line.

Is Schrödinger's cat alive, dead or both?