Friday, May 24, 2024

Key Race Analysis: Prediction Or Retrodiction

WCMI Key Race Analysis: Prediction Or Retrodiction

Key Race Analysis (KRA)

Key Race analysis is a handicapping technique used to evaluate the quality of a horse race based on the subsequent performance of its participants in other races. The fundamental idea is that if horses from a particular race perform well above average in their next outings, the original race is considered a "key race," indicating that it was stronger than it initially appeared.

When applied to '2yo' and '3yo' horses, KRA can provide useful insights for several reasons:

  • Development: Young horses are still developing and often show significant improvement or regression. Identifying a strong race may highlight young horses that are progressing well.
  • Potential: Early performances can signal future potential, possibly making it worthwhile to identify which races are producing the most successful horses in subsequent outings.

KRA Process

  1. Identify Race: Start by selecting a race to analyse.
  2. Track Subsequent Performances: Monitor how the horses from that race perform in their next one or two races.
  3. Evaluate Outcomes:
    • Win Place and Show (WPS): If multiple horses from the race finish 'in the money' in their next races, it suggests the original race might be a strong candidate.
    • FPR: Using FPR, we can better estimate the performance given the number of runners in each subsequent race.
  4. Label Key Race: Once a pattern of strong subsequent performances is established, the original race can be tentatively labelled a "key race."

Retrodiction or Prediction

While KRA can be a helpful approach, there is a danger of it degenerating into retrodiction instead of prediction. Retrodiction refers to using data to explain past events, while prediction involves using data to forecast future events. In the context of Key Race analysis, retrodiction would include looking at a race that has already been run and noting that it has produced multiple winners. While this information can be interesting and potentially useful, it doesn't necessarily help predict the outcome of future races.

Retrodiction can occur for the following reasons:

  • Confirmation Bias: We might focus on races where horses performed well subsequently and ignore those where horses did not, leading to biased conclusions. Elimination of non-contenders is always better than selection of contenders!
  • Overfitting: By looking back at past data, we might identify patterns specific to those races but not generalise well to future races.

Key Race analysis can be an effective handicapper approach, especially for assessing the potential of young, developing horses. However, it is essential to apply the method carefully to avoid the trap of retrodiction.

Remember, while KRA can be useful in handicapping, it is just one of many factors to consider when evaluating a horse's potential performance.

Nevertheless, for those among you who might find this particular approach a useful entry point, we provide the following script.

# -*- coding: latin-1 -*-


# Imports --------------------------------------------------------------

import numpy as np
import pandas as pd

from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter
from datetime import date
from functools import reduce
from tabulate import tabulate


# Constants ------------------------------------------------------------

__date__ = str(date.today())

__title__ = 'Key Race Analysis'
__author__ = '(c) 2024, matekus. All Rights Reserved.'
__version__ = '1.0.5.'
__address__ = '[ http://vendire-ludorum.blogspot.com/ ]'


# Functions ------------------------------------------------------------

# Read and preprocess results.
def read_results(state: dict) -> dict:
    new_state = state.copy()
    # Read results from CSV
    results = pd.read_csv(
        new_state['input_results_path'],
        sep=new_state['input_results_separator'],
        encoding=new_state['input_results_encoding'],
    )
    # Combine and convert 'RaceDate' and 'RaceTime' into single datetime column
    results['RaceDate_RaceTime'] = pd.to_datetime(
        results['RaceDate'] + ' ' + results['RaceTime'],
        dayfirst=True,
        format='%d/%m/%Y %H-%M'  # Adjust this format to match your data if necessary
    )
    # Ensure data is sorted by datetime for subsequent processing
    results.sort_values('RaceDate_RaceTime', inplace=True)
    # Return updated state with preprocessed 'results'
    new_state['results'] = results
    return new_state

# Read and preprocess racecards.
def read_racecards(state: dict) -> dict:
    new_state = state.copy()
    # Read results from CSV
    racecards = pd.read_csv(
        new_state['input_racecards_path'],
        sep=new_state['input_racecards_separator'],
        encoding=new_state['input_racecards_encoding'],
    )
    # Using regular expression to remove country codes in parentheses
    racecards['Horse'] = racecards['Horse'].str.replace(r'\s*\([^)]*\)', '', regex=True)
    # Return updated state with preprocessed 'results'
    new_state['racecards'] = racecards
    return new_state

# Clean up finish positions.
def clean_finish_positions(state) -> dict:
    new_state = state.copy()
    df = new_state['results']
    # Remove rows where 'HorseFP' equals 99
    df = df[df['HorseFP'] != 99]
    df['HorseFP'] = df['HorseFP'].astype(str)
    # Remove letter 'D' from 'HorsePosition' column, if it exists.
    df['HorsePosition'] = df['HorseFP'].apply(lambda x: x[:-1] if x[-1] == 'D' and x[:-1].isdigit() else x)
    # If 'HorsePosition' column contains only letters, then replace its values with corresponding value in 
    # 'RaceCount' column.
    df.loc[df['HorsePosition'].str.isalpha(), 'HorsePosition'] = df.loc[df['HorsePosition'].str.isalpha(), 'RaceCount']
    df['HorsePosition'] = df['HorsePosition'].astype(int)
    new_state['results'] = df
    return new_state

# Calculate subsequent performances.
def calculate_subsequent_performance(state: dict) -> dict:
    new_state = state.copy()
    df = new_state['results']
    # Get all unique races
    unique_races = df[['RaceDate', 'RaceTime', 'RaceCourse', 'RaceSortID']].drop_duplicates()
    race_scores = []
    for _, race in unique_races.iterrows():
        race_date, race_time, race_course, race_sort_id = race['RaceDate'], race['RaceTime'], race['RaceCourse'], race['RaceSortID']
        # Filter current race
        current_race_df = df[
            (df['RaceDate'] == race_date) &
            (df['RaceTime'] == race_time) &
            (df['RaceCourse'] == race_course)
        ]
        subsequent_performance = []
        for _, horse in current_race_df.iterrows():
            horse_name = horse['Horse']
            race_date_time = horse['RaceDate_RaceTime']
            # Filter subsequent races for current horse
            subsequent_races = df[
                (df['Horse'] == horse_name) &
                (df['RaceDate_RaceTime'] > race_date_time)
            ]
            for _, subsequent_race in subsequent_races.iterrows():
                finish_position = subsequent_race['HorsePosition']
                race_count = subsequent_race['RaceCount']
                if race_count > 0:
                    # Calculate FPR (https://vendire-ludorum.blogspot.com/2016/10/juvenile-finish-position-ratings.html)
                    wins = race_count - finish_position
                    losses = ((race_count - 1) - wins)
                    fpr = (wins + 1) / (wins + losses + 2)
                    subsequent_performance.append(fpr)
        if subsequent_performance:
            score = np.mean(subsequent_performance)
        else:
            score = 0
        race_scores.append({
            'RaceDate': race_date,
            'RaceTime': race_time,
            'RaceCourse': race_course,
            'RaceSortID': race_sort_id,
            'Score': score
        })
    # Create DataFrame for race scores
    race_scores_df = pd.DataFrame(race_scores)
    # Sort by score descending and take top N races
    top_races = race_scores_df.sort_values(by='Score', ascending=False).head(new_state['top_races_count'])
    new_state['top_races'] = top_races
    return new_state

# Filter top runners.
def filter_top_runners(state: dict) -> dict:
    new_state = state.copy()
    results = new_state['results']
    racecards = new_state['racecards']
    top_races = new_state['top_races']
    # Add RaceSortID to top_runners dictionary
    top_runners = []
    for _, top_race in top_races.iterrows():
        race_date, race_time, race_course, race_sort_id = top_race['RaceDate'], top_race['RaceTime'], top_race['RaceCourse'], top_race['RaceSortID']
        # Filter top race
        top_race_runners = results[
            (results['RaceDate'] == race_date) &
            (results['RaceTime'] == race_time) &
            (results['RaceCourse'] == race_course)
        ]
        for horse in top_race_runners['Horse'].unique():
            top_runners.append({
                'Horse': horse,
                'RaceSortID': race_sort_id
            })
    # Filter racecards for today's runners that are in list of top runners
    today_date = date.today().strftime('%d/%m/%Y')
    todays_racecards = racecards[racecards['RaceDate'] == today_date]
    top_runners_today = todays_racecards[todays_racecards['Horse'].isin([runner['Horse'] for runner in top_runners])]
    # Merge RaceSortID information with today's runners
    top_runners_today = top_runners_today.merge(
        pd.DataFrame(top_runners), on='Horse', how='left'
    )
    # Select relevant columns to include in output
    top_runners_today = top_runners_today[['RaceDate', 'RaceTime', 'RaceCourse', 'Horse', 'HorseML', 'RaceSortID']]
    # Convert to list of dictionaries for easy tabulation
    top_runners_list = top_runners_today.to_dict(orient='records')
    new_state['top_runners'] = top_runners_list
    return new_state

# Print top races.
def print_top_races(state: dict) -> dict:
    new_state = state.copy()
    top_races = new_state['top_races']
    # Set index to'RaceSortID' for better readability.
    top_races.set_index('RaceSortID', inplace=True)
    # Print top races.
    print(tabulate(top_races, 
                    headers='keys', 
                    tablefmt='grid',
                    colalign=['left','left','left','left','right'],
                    floatfmt='.2f'))
    top_races.reset_index(inplace=True)
    return new_state

# Print top runners.
def print_top_runners(state: dict) -> dict:
    new_state = state.copy()
    top_runners = new_state['top_runners']
    # Convert 'top_runners' to DataFrame
    top_runners_df = pd.DataFrame(top_runners)
    # Sort 'top_runners_df' by 'RaceTime' in ascending order
    top_runners_df = top_runners_df.sort_values('RaceTime', ascending=True)
    # Set index to'RaceSortID' for better readability.
    top_runners_df.set_index('Horse', inplace=True)
    # Print top runners.
    print()
    print(tabulate(top_runners_df, 
                    headers='keys', 
                    tablefmt='psql', 
                    floatfmt='.2f'))
    top_runners_df.reset_index(inplace=True)
    return new_state


# Main -----------------------------------------------------------------

def main() -> None:
    print('')
    print(__title__)
    print(__author__)
    print(__address__)
    print('')
    print('[', __date__, ']')
    print('')
    # Define pipeline with updated steps
    pipeline = [
        read_results,
        read_racecards,
        clean_finish_positions,
        calculate_subsequent_performance,
        print_top_races,
        filter_top_runners,
        print_top_runners,
    ]
    # Define initial state.
    initial_state = {
        'input_results_path': r'Results.csv',
        'input_results_separator': ';',
        'input_results_encoding': 'latin-1',
        'input_racecards_path': r'Racecards.csv',
        'input_racecards_separator': ';',
        'input_racecards_encoding': 'latin-1',
        'top_races_count': 5
    }
    # Run pipeline.
    _ = reduce(lambda v, f: f(v), pipeline, initial_state)
    print('')
    print('Fini!')
    print('')


# Start ----------------------------------------------------------------

if __name__ == '__main__':
    main()

# EOF.

Output

As ever, the script has little or no "errr0r" handling and is only a starting point for your own explorations. Enjoy!