Tuesday, May 20, 2025

TF Pace-Map EPF Scenarios

WCMI Timeform Pace Maps: Early Position Scenarios

Timeform Pace Maps


If you subscribe to Timeform, you will most likely be familiar with its Flat-Racing Pace-Maps (see the first image below, horse names blurred out). The map aims to reflect the most likely early position scenarios for a specific race.

🔍 NOTE: The pace map and the Monte-Carlo simulations model early race positions (typically after two furlongs), not finishing positions. They help predict how the race will unfold tactically, not which horse will ultimately win.

Timeform Racecard Pace Map


Pace Map EPF Probability Analyzer

I confess that converting colour gradations into probability estimates is not my strong suit, but I wondered if my neighbourhood Large Language Model (LLM) might have such expertise.

Early Position Figures (EPFs) indicate where a horse is likely to be positioned in the early stages of a race. In Timeform's system, EPF 1 represents a horse that leads/front-runner, while higher numbers (up to 9) indicate horses that will be positioned further back in the field.

We can analyze these pace maps, as follows:

  1. Image Encoding and Preparation:

    The pace map, typically provided as an image (screenshot), is first converted into a base64-encoded string suitable for analysis.

def encode_image_to_base64(image_path):
    """Convert image to base64 string for API submission"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')  
...  
def prepare_image(state: dict) -> dict:
    """Prepare the image for API submission"""
    new_state = state.copy()
    try:
        base64_image = encode_image_to_base64(new_state['image_path'])
        new_state['base64_image'] = base64_image
        print(f"Successfully encoded image for API submission")
    except Exception as e:
        print(f"Error encoding image: {str(e)}")
        raise
    return new_state
  1. LLM-based Image Analysis:

    Utilising a state-of-the-art Large Language Model (LLM) from an external API, the analyser decodes the graphical data. It identifies the predicted positions (marked with black dots) and interprets the colour intensities (shades of red) as probability distributions for each horse's Early Position Figure (EPF). Darker shades of red indicate higher probability, while lighter shades represent lower probability of a horse taking that position.

def analyze_with_llm(state: dict) -> dict:
    """Send the image to LLM API for analysis"""
    new_state = state.copy()
    
    client = llm.LLM(api_key=LLM_API_KEY)
    
    # Prepare the system prompt
    system_prompt = """
    You are an expert in analyzing horse racing pace maps. You will analyze the uploaded "Pace Map" image and extract Early Position Figure (EPF) data.
    
    For each horse, identify:
    1. The predicted EPF position (where the black dot is located)
    2. The probability distribution (from color intensity)
    
    Convert this data into parameters for a triangular probability distribution with:
    - EPFProbMin: the minimum probability estimate based on color intensity
    - EPFProbMode: the peak probability at the predicted position (black dot)
    - EPFProbMax: the maximum probability estimate
    
    Return results in CSV format with header: Horse;EPF;EPFProbMin;EPFProbMode;EPFProbMax
    Only include the CSV data in your response, no additional text or explanations.
    """
    
    # Prepare the user prompt
    user_prompt = "Using the attached 'Pace Map' for the horse race, analyze each horse's Early Position Figure (EPF) data and convert it into parameters for a triangular probability distribution. The black dots indicate predicted positions, while the heat map colors show probability densities. Please output only the CSV data with the following fields: Horse;EPF;EPFProbMin;EPFProbMode;EPFProbMax"
    
    try:
        # Create the message with the image
        response = client.messages.create(
            model="llm-3",
            system=system_prompt,
            max_tokens=6144, # 4096,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "image",
                            "source": {
                                "type": "base64",
                                "media_type": "image/jpeg",
                                "data": new_state['base64_image']
                            }
                        },
                        {
                            "type": "text",
                            "text": user_prompt
                        }
                    ]
                }
            ]
        )
        
        new_state['llm_response'] = response.content[0].text
        print("Successfully received analysis from LLM API")
    except Exception as e:
        print(f"Error calling LLM API: {str(e)}")
        raise

    return new_state
  1. Probability Distribution Generation:
    The program then translates the graphical data into parameters suitable for a triangular probability distribution. A triangular distribution is a simple probability model that uses three points - minimum, maximum, and most likely (mode) - making it ideal for this kind of analysis where we have limited information but can estimate these three key values. For each horse, it identifies:
    • EPFProbMin: the minimum likely probability,
    • EPFProbMode: the peak probability (indicated by the black dot),
    • EPFProbMax: the maximum likely probability.
def process_llm_response(state: dict) -> dict:
    """Extract and process CSV data from LLM's response"""
    new_state = state.copy()
    try:
        # Extract just the CSV data (removing any markdown formatting)
        response_text = new_state['llm_response']
        
        # Handle potential markdown code blocks
        if "```" in response_text:
            csv_data = response_text.split("```")[1]
            if csv_data.startswith("csv"):
                csv_data = csv_data[3:].strip()
        else:
            csv_data = response_text.strip()
            
        # Parse CSV data into a DataFrame
        df = pd.read_csv(io.StringIO(csv_data), sep=';')
        new_state['epf_results'] = df
        
        print(f"Successfully processed data for {len(df)} horses")
        print("\nPreview of parsed data:")
        print(tabulate(df.head(11), headers='keys', tablefmt='psql', showindex=False))
        
    except Exception as e:
        print(f"Error processing LLM response: {str(e)}")
        raise
        
    return new_state
  1. Structured Data Output:
    The final insights are presented in a clear CSV format, enabling easy integration with further analysis or betting models. The second image below shows this structured output, with each horse's EPF and associated probability parameters clearly displayed.

  2. Data Preparation for Simulation:
    Once we have the structured EPF data generated previously (minimum, mode, and maximum probabilities for each horse's predicted early position), we can load and prepare this data for a Monte Carlo simulation. This bridges our pace map analysis to a more dynamic model of race positioning.

def prepare_simulation_data(state: dict) -> dict:
    """ Prepare data for Monte Carlo simulation by organizing horse data. """

    new_state = state.copy()
    df = new_state['csv_results']
    
    # Create a data structure for each horse with its parameters
    horses_data = []
    for _, row in df.iterrows():
        horse_data = {
            'Horse': row['Horse'],
            'EPF': row['EPF'],
            'EPFProbMin': row['EPFProbMin'],
            'EPFProbMode': row['EPFProbMode'],
            'EPFProbMax': row['EPFProbMax']
        }
        horses_data.append(horse_data)
    
    new_state['horses_data'] = horses_data
    print(f"Prepared simulation data for {len(horses_data)} horses.")
    
    return new_state
  1. Executing the Simulation:

Each horse's early position is simulated numerous times (e.g., 100,000 iterations) using a triangular probability distribution. Each iteration introduces slight random variations, reflecting real-world uncertainties in how races unfold.

def run_monte_carlo_simulation(state: dict) -> dict:
    """ Run Monte Carlo simulation to predict early position running order. """
    
    new_state = state.copy()
    horses_data = new_state['horses_data']
    num_simulations = new_state.get('num_simulations', 10000)
    
    # Storage for all simulation results
    all_orders = []
    
    for sim in range(num_simulations):
        horse_positions = []
        
        for horse in horses_data:
            # Sample from triangular distribution to get certainty level
            certainty = np.random.triangular(
                horse['EPFProbMin'],
                horse['EPFProbMode'],
                horse['EPFProbMax']
            )
            
            # Calculate maximum possible variation based on certainty
            # Higher certainty = less variation
            max_variation = 3.0 * (1.0 - certainty)
            
            # Generate random variation within the max range
            variation = np.random.uniform(-max_variation, max_variation)
            
            # Calculate realized position
            realized_position = horse['EPF'] + variation
            
            horse_positions.append({
                'Horse': horse['Horse'],
                'EPF': horse['EPF'],
                'RealizedPosition': realized_position
            })
        
        # Sort horses by realized position (ascending)
        sorted_horses = sorted(horse_positions, key=lambda x: x['RealizedPosition'])
        
        # Extract the running order
        running_order = [horse['Horse'] for horse in sorted_horses]
        all_orders.append(running_order)
    
    new_state['simulation_results'] = all_orders
    print(f"Completed {num_simulations} Monte Carlo simulations.")
    
    return new_state
  1. Analyzing Simulation Results:
    • All simulated outcomes are aggregated, identifying the most frequent early running orders.
    • The method calculates the probability of each horse occupying each possible early position, delivering clear insights into each horse's likely early placement.
def analyze_simulation_results(state: dict) -> dict:
    """ Analyze the results of Monte Carlo simulations. """
    
    new_state = state.copy()
    all_orders = new_state['simulation_results']
    
    # Count frequency of each running order
    order_counter = Counter(tuple(order) for order in all_orders)
    total_simulations = len(all_orders)
    
    # Convert to probability and sort by frequency
    order_probabilities = [
        {
            'Running Order': order,
            'Count': count,
            'Probability': count / total_simulations
        }
        for order, count in order_counter.items()
    ]
    
    # Sort by probability (descending)
    order_probabilities.sort(key=lambda x: x['Probability'], reverse=True)
    
    # Take top N most likely running orders
    top_n = min(10, len(order_probabilities))
    top_orders = order_probabilities[:top_n]
    
    # Also analyze position probabilities for each horse
    horse_positions = {horse['Horse']: [0] * len(new_state['horses_data']) for horse in new_state['horses_data']}
    
    for order in all_orders:
        for position, horse in enumerate(order):
            horse_positions[horse][position] += 1
    
    # Convert to probabilities
    for horse in horse_positions:
        for position in range(len(horse_positions[horse])):
            horse_positions[horse][position] /= total_simulations
    
    # Store results
    new_state['top_running_orders'] = top_orders
    new_state['horse_position_probabilities'] = horse_positions
    
    print(f"Analyzed simulation results. Found {len(order_probabilities)} unique running orders.")
    
    return new_state
  1. Early Position Probabilities Ultimately, we display the detailed position probabilities for each horse in a comprehensive table, as shown in the third image below. The green highlighted values indicate the most likely position for each horse. For example, horse "India" has a 0.757 (75.7%) probability of taking the first position (EP-1), while "Golf" has a 0.4386 (43.86%) probability of starting in position 9 (EP-9).

Practical Applications

These probability distributions can be extremely valuable for handicappers and bettors who consider race dynamics in their analysis. For instance:

  • Identifying potential pace scenarios to spot races likely to favor frontrunners or closers
  • Finding horses that might be disadvantaged by their running style given the expected pace
  • Looking for overlays where a horse's tactical position might give it an advantage not fully reflected in its odds
  • Constructing exotic wagers (exactas, trifectas) based on likely early running positions

As ever, the code snippets are only a starting point for your own explorations. Tread carefully.

Enjoy!


Note: The final draft of this post was sanity checked by ChatGPT.

Monday, April 21, 2025

Expected Value and Likely Profit II: Different Playbooks

WCMI Expected Value and Likely Profit II: Different Playbooks

'EV' vs. 'EV+LP'

For the Bookmaker, EV adds up. For the Bettor, EV+LP goes forth and multiplies!

In our original post, we introduced Likely Profit (LP) as a complementary metric to Expected Value (EV). To recap clearly:

  • Expected Value (EV) measures the average profit per bet. It is an additive metric, suited to the bookmaker's scenario of multiple parallel bets.
  • Likely Profit (LP) measures the expected geometric (logarithmic) growth rate of our bankroll. It is a multiplicative metric, reflecting the bettor's sequential reality and limited bankroll.

Mathematically, these metrics are defined as follows:

E V = ( W B × P ) + ( L B × ( 1 P ) ) 1 \begin{align} \tag{a} EV = (WB \times P) + (LB \times (1 - P)) - 1 \end{align}

L P = ( W B P × L B ( 1 P ) ) 1 \begin{align} \tag{b} LP = (WB^{P} \times LB^{(1 - P)}) - 1 \end{align}

Where:

  • WB (Win Balance) is the bankroll multiplier if the bet wins (e.g., W B = ( 1 + ( F ( O 1 ) ) ) WB = (1 + (F * (O - 1))) ).
  • LB (Loss Balance) is the bankroll multiplier if the bet loses (e.g., L B = ( 1 F ) LB = (1 - F) ).
  • F is the stake.
  • O is the decimal odds.
  • P is the probability of winning.

Bookmaker's Additive World: Why EV is King

Take a small edge from every bet and let volume do the rest.

Bookmakers handle hundreds or thousands of bets simultaneously, each priced with a built-in margin ("vig"), ensuring each bet has a positive expected value from the bookmaker's perspective. With effectively unlimited bankroll and diversified action, bookmakers invoke the law of large numbers, ensuring actual profits converge closely to the expected profits.

In the bookmaker's additive world, EV is the definitive metric because:

  • Volume smooths out variance, making EV directly translate into predictable profits.
  • Bankroll size is enormous, meaning the bookmaker does not worry about short-term fluctuations from individual outcomes.
  • Profitability is guaranteed over time, given a positive EV across many bets.

Thus, bookmakers focus exclusively on EV, setting odds to guarantee their additive advantage. They care about aggregate profit, not about individual outcomes.

Bettor's Sequential Reality: Why EV Needs LP

Maximize bankroll growth, not just average payout.

Now, switch seats to a bettor's perspective. Unlike bookmakers, bettors cannot make thousands of simultaneous bets. Instead, bettors sequentially place bets over time with limited bankrolls. Each bet outcome directly influences future betting capacity, making their reality multiplicative rather than additive. The multiplicative effect means that the order and size of wins and losses matter a lot.

Here is where Likely Profit (LP) becomes essential:

  • LP measures the expected geometric (logarithmic) growth rate of our bankroll.
  • LP is closely related to the Kelly criterion, well-known in finance and betting for maximizing long-run bankroll growth.
  • LP effectively considers that bankroll growth is multiplicative, and thus, variance and bet sizing matter greatly for survivor bias and long-term wealth accumulation.

While EV alone indicates average profit per bet, LP indicates how our bankroll is expected to compound over time. A bet with positive EV but negative LP could lead to significant bankroll volatility, potentially resulting in ruin before the theoretical EV manifests.

Practical Example: EV vs EV+LP in Action

Scenario: You have a $1,000 bankroll and have decided to stake $100 (10% of our bankroll) on one of two bets. Both bets have the same EV (+$10), but differ significantly in risk and profile:

  • Bet X (High-risk, high-reward bet):

    • Odds: +1000 (decimal 11.0)
    • Our estimated true probability: 10% (implied odds = 9.1%)
    • Crucially, LP is negative, indicating expected bankroll shrinkage over repeated bets of this type at this stake fraction.
  • Bet Y (Lower-risk, moderate-reward bet):

    • Odds: +100 (decimal 2.0)
    • Our estimated true probability: 55% (implied odds = -122)
    • LP is positive, indicating expected bankroll growth over repeated sequential bets of this type.

Both bets have identical EV, yet Bet Y clearly outshines Bet X when evaluated holistically using EV+LP. Bet Y's positive LP means it's better suited for sustainable bankroll growth, lower variance, and greater certainty of survival.

This example starkly highlights that as a bettor, we must consider LP as well as EV to intelligently balance profit potential, risk, and bankroll longevity.

Different Goals, Different Metrics

Bookmakers and bettors have fundamentally different goals and constraints:

  • Bookmakers operate in an additive world with massive diversity and bankroll, making EV sufficient.
  • Bettors face multiplicative outcomes with limited bankrolls, making EV necessary but insufficient. They must also consider LP to ensure survival through inevitable volatility and to maximize bankroll growth.

Thus, the metrics they optimize diverge:

Perspective Metric Goal
Bookmaker EV (additive) Maximize total profit
Bettor EV + LP (multiplicative) Maximize long-term bankroll growth

Playing Correct Perspective

In sum, to succeed as a bettor, we must combine the mathematical rigor of EV identification with the strategic prudence of LP-based bankroll management. Selecting only positive-EV bets is necessary but not sufficient. We must also size our bets to ensure positive LP, aligning our strategy with geometric bankroll growth and survival.

In short:

  • EV answers: "How much do I win on average per bet?"
  • LP answers: "How will my bankroll realistically grow over many sequential bets?"

A savvy bettor understands both and uses them in tandem, ensuring a profitable journey not just theoretically, but practically.

By clearly understanding these dual metrics, we can meaningfully improve our betting strategy, aligning our actions with our real-world constraints and maximizing our probability of success in the long run.

Enjoy!


Note: The final draft of this post was sanity checked by ChatGPT.

Wednesday, March 12, 2025

Cheltenham (2025-03-13) - Selections

WCMI Cheltenham Live-Longshot Selections (Gallop Poll)

Cheltenham (2025-03-13) - Selections


Throwing caution to the wind and letting our private LLM bot loose on the hallowed Cheltenham turf to see how it performs. It made the following selections, and we have not curated its choices or supporting evidence.

Tread carefully!

Gallop Poll Selections and Supporting Evidence


Result:


1. Doddiethegreat, 27.0

2. Jeriko Du Reponet, 8.80

Sunday, February 23, 2025

Neigh Sayers and Gallop Polls

WCMI Neigh Sayers and Gallop Polls

Neigh Sayers


Imagine the scenario: "Two horses abreast heading for the final hurdle in a major handicap race of the Cheltenham Festival and one of those contenders ('Neigh Sayer') has £20 each-way of your hard-earned cash pinned on its success at 20/1. You can already anticipate congratulations from your mates on this successful convex bet..." As Weekend Warriors, we all dream of such improbable successes!

Live Longshot:


Now consider asking a ChatGPT-like private LLM bot to select (and justify) those bets for you. For the upcoming Cheltenham Festival, we are hopeful (though not guaranteeing) that we will provide some recommendations (from our Live-Longshot bot) for a few of the more challenging races!

Taster:


Here is a taster...

Enjoy!

Tuesday, December 24, 2024

Ensemble vs Time: Betting Strategy Simulation

WCMI Ensemble vs Time: Betting Strategy Simulation

Ensemble vs Time: Betting Strategy Simulation


Imagine a lively casino brimming with daydreamers hoping to spin meagre coins into great fortunes one bet at a time. Others position themselves on the sidelines, cheerfully tallying the incoming bets. The daydreamers are the players, and the sideliners are the house.


Volatility Drag

In the bustling arena of sports trading, a hidden force exists called volatility drag: the mathematical difference between geometric and arithmetic averages. This difference resembles a tax due to the mathematics, which imposes a lower compound return when returns vary over time. This difference between the average outcome that a bookmaker tallies (from countless eager bettors) and a single bettor's actual lived experience can become very wide.

To illustrate, we have created a Betting Strategy Simulator that reveals how luck may bolster or batter your bankroll.


Two Sides of the Same Bet

  1. Time Perspective (Median)
    Regard this perspective as the lone trader, placing multiple bets in sequence. Each triumph or tumble weighs heavily on their well-being. Over many tries, their median final bankroll can suffer from volatility drag, meaning a few unlucky tumbles may gouge deeper than occasional victories can heal.
  2. Ensemble Perspective (Mean)
    Consider the bookmaker presiding over a torrent of simultaneous wagers. In that swirling chaos, all results average out. While heartbreak and jubilation strike individuals, the house sees a calm, aggregated mean that, if well-calculated, coasts along in a far more stable fashion than a single bankroll can hope for.

Key Parameters

  • Probability of Winning: A Percentage.
  • Odds: For example, odds of 2.30 yield the stake plus 130% more if we win.
  • Stake: The size of each bet (percentage).
  • Number of Bets: Length of the trader's journey or the repeated steps over which the bookmaker aggregates.
  • Number of Simulations: Number of scenario replications.

Running Simulation

  1. Inputs
    • Enter probability, odds, stake, and so on.
  2. Process
    • Click on Run Simulation to launch the simulation.
  3. Outputs
    • Monte Carlo (Time) Results: The Median Final Bankroll for the solitary trader forging through a sequence of wagers. The geometric rise (or descent!) becomes evident here.
    • Monte Carlo (Ensemble) Results: The Mean Final Bankroll from the vantage of the house. With each bet in parallel, the chaos yields a predictable average - an expected value.

Under the Hood

  • Time: We line up (N)(N) traders, each living out (M)(M) consecutive bets. Their final bankrolls vary widely, but the median is the honest sentinel of their fortunes.
  • Ensemble: We replicate (N)(N) parallel bets for each of (M)(M) rounds, calculate the mean outcome each time, and watch the bankroll grow in that aggregated manner.
  • Volatility Drag: If there is one lesson to learn, it is that a 50% dip requires a 100% surge to climb out of the pit. The bigger the stake, the more each stumble stings and volatility seldom shows mercy.


Reading Results

  1. Single Bettor's Plight
    The median bankroll can unravel if luck sends you through a rough patch. Even with a favourable win probability, sustained drawdowns hurt more than fleeting upticks help.

  2. Parallel Paradise
    The bookmaker's many concurrent bets form a serene environment where the mean bankroll (averaged across countless outcomes) marches forward in lockstep with the basic mathematics of expected value.

  3. Practical Wisdom

    • As a punter, carefully consider the violent power of sequential losses. Bankroll management becomes your shield, lest a string of flops knock you out.
    • As the house or aggregator of bets, relax behind a wide net of players, diluting the wilder swings of misfortune.

Looking Forward

Volatility drag is a subtle and cunning opponent that shrinks big dreams. The simulator reminds us that mean vs. median can diverge drastically. The sports trader sees the ephemeral illusions of large short-term gains, recognising the risk that a run of losses can carve away capital faster than big wins can restore it. Meanwhile, the bookmaker basks in the calm assurance of ensembles: a stable accumulation of profits gleaned from the grand churn of wagers.

If nothing else, remember that early wins are crucial in the time dimension but not so in the Ensemble dimension!

Enjoy!


Note: An LLM generated the first draft of this post based on our simulator code listing.

Wednesday, November 27, 2024

Up-Down Votes To In-The-Money Finishes

WCMI Up-Down Votes To In-The-Money Finishes

Up-Down Votes

In data analysis applied to horse racing (Equus Analytics), we frequently draw inspiration and fresh insights from other disciplines. Today, we are focusing on an adaptation of Evan Miller's Bayesian method for handling up-down vote scenarios in online content ranking. To that end, we have two primary goals:

  • Work with minimal data, and
  • Identify convex bets (live longshots).

It is always intriguing to determine what insights we can infer when working with minimal data (e.g. horse lifetime record):

Horse Runs Win Place Show
Zulu 23 7 5 2

This situation frequently arises when trading foreign (e.g. international) racing circuits (e.g. Hong Kong, Japan), which have exceptional racing industries but for which we do not have detailed past performance records for most horses.

Bayesian Foundation

At its core, the Bayesian average rating system provides a statistically valid way to rank items based on positive and negative feedback, accounting for uncertainties due to limited data. In the context of up-down votes, it is straightforward: items receive up-votes (successes) and down-votes (failures), and we wish to rank them to balance their average rating with our confidence in that rating.

Translating this to horse racing, we consider:

  • Events Placed: Number of times a horse has finished "in the money" (e.g., first, second, or third).
  • Events Unplaced: Number of times a horse has raced but did not finish "in the money".
  • Time Since Placed/Unplaced: Time elapsed since the horse's last placed or unplaced finish, allowing us to weigh recent performances more heavily than older ones.

Adaptation

The Bayesian model requires a prior belief and adjusts this belief based on new evidence. Here's how we adapt the model:

  1. Prior Beliefs (Pseudo-Events): Assign each horse a baseline level of performance. This prevents horses with very few races from being unfairly ranked at the extremes due to insufficient data.
  2. Updating with New Data: Each horse's actual race outcomes are added to the prior beliefs, giving us the total "events placed" and "events unplaced".
  3. Exponential Decay of Events: To ensure that recent performances have more impact, we apply an exponential decay to the events based on the time since they occurred. This approach mirrors how specific online platforms weigh newer votes more heavily than older ones.
  4. Computing the Bayesian Rating: We calculate the Bayesian average rating (the "sorting criterion") using the beta distribution, which balances the observed data and the uncertainty inherent in limited or decayed data.

Code Walkthrough

Below is an excerpt of the Python implementation. For brevity, we'll focus on the key components.

import numpy as np
import pandas as pd
from scipy.special import betaincinv
import time

def initialize_ratings(state):
    new_state = state.copy()
    ratings_df = pd.DataFrame(new_state['new_ratings'])
    ratings_df.rename(columns={'name': 'entry_id'}, inplace=True)
    # Add prior beliefs (pseudo-events)
    ratings_df['events_placed'] += new_state['pseudo_events_placed']
    ratings_df['events_unplaced'] += new_state['pseudo_events_unplaced']
    # Convert time since last events to absolute times
    current_time = time.time()
    ratings_df['last_placed_time'] = current_time - ratings_df['time_since_placed']
    ratings_df['last_unplaced_time'] = current_time - ratings_df['time_since_unplaced']
    new_state['ratings'] = ratings_df
    return new_state

In 'initialize_ratings', we set up our DataFrame with the horse entries, adjust for prior beliefs, and calculate the timestamps.

def decay_events(state):
    new_state = state.copy()
    current_time = time.time()
    half_life = new_state['half_life']
    ratings_df = new_state['ratings']
    # Calculate decay factors
    ratings_df['decay_factor_placed'] = 2 ** (-(current_time - ratings_df['last_placed_time']) / half_life)
    ratings_df['decay_factor_unplaced'] = 2 ** (-(current_time - ratings_df['last_unplaced_time']) / half_life)
    # Apply decay
    ratings_df['events_placed'] *= ratings_df['decay_factor_placed']
    ratings_df['events_unplaced'] *= ratings_df['decay_factor_unplaced']
    new_state['ratings'] = ratings_df
    return new_state

The 'decay_events' function applies exponential decay to the events. While we lack individual timestamps for all events, using the time since the last events (i.e. placed and unplaced) provides a pragmatic approximation under current data limitations.

def construct_sorting_criterion(state):
    new_state = state.copy()
    loss_multiple = new_state['loss_multiple']
    new_state['ratings']['sorting_criterion'] = new_state['ratings'].apply(
        lambda row: betaincinv(
            row['events_placed'] + 1,
            row['events_unplaced'] + 1,
            1 / (1 + loss_multiple)
        ), axis=1
    )
    return new_state

Here, 'construct_sorting_criterion' computes the Bayesian rating using the inverse incomplete beta function, factoring in our desired level of caution via the 'loss_multiple'.

Worked Example

Let us consider a race with several horses and their performance data:

initial_state = {
    'pseudo_events_placed': 5.0,  # Prior belief of 5 placed events
    'pseudo_events_unplaced': 5.0,  # Prior belief of 5 unplaced events
    'half_life': 3600 * 24 * 7,  # One week in seconds
    'new_ratings': [
        {'name': '1. Alpha', 'events_placed': 13, 'events_unplaced': 36, 'time_since_placed': 5, 'time_since_unplaced': 24},
        {'name': '2. Bravo', 'events_placed': 6, 'events_unplaced': 22, 'time_since_placed': 432, 'time_since_unplaced': 14},
        # Additional horse data...
    ],
    'loss_multiple': 5
}

After running the code, we obtain the following rankings:

Entry Placed Unplaced Ratings S/P F/P
5. Echo 12.00 16.00 0.35 20/1 3/11
8. Hotel 8.00 11.00 0.32 5/1 1/11
11. Kilo 5.00 8.00 0.28
3. Charlie 11.00 22.00 0.27 9/2 2/11
1. Alpha 18.00 41.00 0.25 11/8 4/11
... ... ... ... ... ...
  • Ratings: The Bayesian rating computed for each horse.
  • S/P (Starting Price): The odds offered at the start of the race.
  • F/P (Finishing Position): The horse's finishing position in the race.

Results

Notably, four of the top five horses in our rankings secured the first four positions in the race (as indicated in the 'F/P' column). Interestingly, Echo, with a starting price of 20/1, was ranked highest by our model and finished third. This suggests that the Bayesian approach might help identify "live longshots" horses with higher odds that have a reasonable chance of performing well.

Limitations

An astute observer might point out that applying decay to the total event count based solely on the time since the last event is not entirely accurate. Ideally, we would decay each event individually based on when it occurred. However, lacking detailed timestamps, our current method provides a reasonable approximation, especially if:

  • Event Timing is Similar Across Entries: If most horses have events spaced similarly over time, the relative decay applied will be consistent.
  • Recent Performance is Indicative: If a horse's most recent performance strongly indicates its current form, weighting events based on the last event may be defensible.

While this is not a perfect solution, the model's success in our worked example suggests it holds practical value.

Power of Bayesian Analysis

This adaptation of the Bayesian average rating system demonstrates that we can extract meaningful insights even from minimal data with some mathematical ingenuity and a pragmatic approach to data limitations. The model doesn't guarantee winners but offers a statistically sound method to rank horses beyond surface-level metrics.

By highlighting horses like Echo, the model can point out potential value bets with favourable odds that may have slipped under the radar of the casual punter.

Moving Forward

For those interested in refining this approach:

  • Gather More Data: To improve the decay function, collect timestamps for individual events.
  • Experiment with Parameters: Adjust the 'half_life' and 'loss_multiple' to see how sensitive the model is to these parameters.
  • Re-Define Priors: Adjust 'pseudo_events_placed' and 'pseudo_events_unplaced' to see how sensitive the model is to these priors.

Conclusion

While we must remain mindful of the model's limitations, this Bayesian approach provides a solid starting point for horse racing analysis with minimal data. It blends statistical rigour with practical application.

As with all forms of betting and analysis, there are no certainties; there are only probabilities. This model helps us navigate those probabilities with more confidence, perhaps shining a light on those "live longshots".

Enjoy!


Note: The first draft of this post was generated by an LLM from our Python code listing and Evan Miller's original article.