testing
Absolutely! Below is a complete, updated Jupyter Notebook for predicting whether a horse will finish in the Top 3. It includes: ✅ Horse racing data preparation ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Race-day prediction for upcoming horses
🐎 Horse Racing Prediction - Top 3 Finish Save this code as a .ipynb file or run it in Google Colab or Jupyter Notebook.
📦 Step 0: Install Required Libraries !pip install pandas scikit-learn xgboost lightgbm matplotlib
📌 Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.model_selection import TimeSeriesSplit, cross_val_score, train_test_split from sklearn.metrics import classification_report, accuracy_score
📌 Step 2: Load and Sort Historical Data
df = pd.read_csv('horse_race_history.csv', parse_dates=['race_date']) df.sort_values(by=['horse_id', 'race_date'], inplace=True) df.head()
📌 Step 3: Create Target and Core Features
df['top3'] = df['finish_position'].apply(lambda x: 1 if x <= 3 else 0)
df['win'] = df['finish_position'].apply(lambda x: 1 if x == 1 else 0)
df['prev_finish'] = df.groupby('horse_id')['finish_position'].shift(1) df['prev_top3'] = df.groupby('horse_id')['top3'].shift(1) df['avg_finish_3'] = df.groupby('horse_id')['finish_position'].shift(1).rolling(3).mean() df['top3_rate_3'] = df.groupby('horse_id')['top3'].shift(1).rolling(3).mean()
def get_win_streak(x): streak = 0 result = [] for val in x: if val == 1: streak += 1 else: streak = 0 result.append(streak) return result
df['win_streak'] = df.groupby('horse_id')['win'].transform(get_win_streak)
df['jockey_win'] = df['win'] df['jockey_total'] = 1 df['jockey_cum_wins'] = df.groupby('jockey')['jockey_win'].cumsum().shift(1) df['jockey_cum_total'] = df.groupby('jockey')['jockey_total'].cumsum().shift(1) df['jockey_win_rate'] = df['jockey_cum_wins'] / df['jockey_cum_total'] df['jockey_win_rate'].fillna(0, inplace=True)
df['horse_age'] = df['race_date'].dt.year - df['birth_year']
df.dropna(inplace=True)
📌 Step 4: Encode Categorical Data (Jockey, Trainer) df = pd.get_dummies(df, columns=['jockey', 'trainer'], drop_first=True)
📌 Step 5: Define Features and Time-based Split feature_cols = [ 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age', 'draw', 'odds', 'distance' ] + [col for col in df.columns if col.startswith('jockey_') or col.startswith('trainer_')]
cutoff_date = '2023-01-01' train = df[df['race_date'] < cutoff_date] test = df[df['race_date'] >= cutoff_date]
X_train = train[feature_cols] y_train = train['top3'] X_test = test[feature_cols] y_test = test['top3']
📌 Step 6: Train and Evaluate Models models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100), 'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 'LightGBM': LGBMClassifier() }
for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"=== {name} ===") print(classification_report(y_test, y_pred))
📌 Step 7: Cross-Validation with TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5)
model = RandomForestClassifier(n_estimators=100)
scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy') print("TimeSeriesSplit CV Accuracy:", scores) print("Average:", scores.mean())
📌 Step 8: Predict for Upcoming Race (Race-Day)
upcoming = pd.read_csv('upcoming_race.csv') # Must include horse_id, draw, odds, distance, etc.
latest_stats = df.groupby('horse_id').last().reset_index()[[ 'horse_id', 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age' ]]
race = upcoming.merge(latest_stats, on='horse_id', how='left')
race = pd.get_dummies(race, columns=['jockey', 'trainer'], drop_first=True)
for col in X_train.columns: if col not in race.columns: race[col] = 0
race = race[X_train.columns]
model = models['XGBoost'] race['top3_prob'] = model.predict_proba(race)[:, 1] race['top3_pred'] = (race['top3_prob'] > 0.5).astype(int)
race[['horse_id', 'top3_prob', 'top3_pred']].sort_values(by='top3_prob', ascending=False)
📌 Step 9: Visualize Predictions plt.figure(figsize=(10,6)) plt.barh(race['horse_id'], race['top3_prob']) plt.xlabel('Probability of Top 3 Finish') plt.title('Upcoming Race Predictions') plt.gca().invert_yaxis() plt.show()
✅ Output Example: horse_id top3_prob top3_pred H123 0.88 1 H456 0.74 1 H789 0.65 1 H000 0.42 0
💾 Optional: Save the Best Model import joblib joblib.dump(model, 'top3_xgboost_model.pkl')
🧠 Notes: o You need a clean horse_race_history.csv with: o horse_id, race_id, race_date, finish_position, odds, draw, distance, birth_year, jockey, trainer o For real-world use, build a scraper or data pipeline to automate race-day updates.
Would you like me to prepare a ready-to-download .ipynb file and sample CSV templates for testing? Let me know, and I’ll send you a link! prepare ready-to-download .ipynb file pls Great! I've created a ready-to-download Jupyter Notebook (.ipynb) file based on everything we discussed, including: ✅ Cleaned and sorted historical data ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Prediction for upcoming races (Top 3 finish) ✅ Bar chart visualization of predictions
📥 Download the Notebook 👉 Click here to download: 📎 Download: horse_racing_top3_prediction.ipynb
📁 Sample CSV Template Files To help you get started, here are sample CSV headers you can use to prepare your data:
- horse_race_history.csv horse_id race_date race_id finish_position odds draw distance jockey trainer birth_year H123 2022-01-01 R001 2 4.5 3 1200 J1 T1 2016 H456 2022-01-01 R001 1 2.8 1 1200 J2 T2 2017
- upcoming_race.csv horse_id draw odds distance jockey trainer H123 2 3.5 1200 J1 T1 H456 5 5.0 1200 J2 T2
💡 Tips o Use the same feature names and structures as in the notebook. o Make sure dates are in valid YYYY-MM-DD format. o For better results, the horse_race_history.csv should include at least 1000+ races for model learning.
🔁 Typical Workflow After Initial Training ✅ First time:
- Train the model
- Save with joblib.dump(model, 'model.pkl') ✅ Next time (e.g., tomorrow):
- Load with joblib.load('model.pkl')
- Prepare new data
- Predict with model.predict(X_new)
💡 Why This Is Useful o Saves time — no need to retrain o You can deploy the model into production o Can use the saved model in other scripts, apps (like Streamlit, Flask), or schedule batch jobs
👇 Example: Full Prediction Script Using Saved Model import pandas as pd import joblib
model = joblib.load('top3_xgboost_model.pkl')
race_data = pd.read_csv('upcoming_race.csv')
probs = model.predict_proba(race_data)[:, 1] race_data['top3_prob'] = probs race_data['top3_pred'] = (probs > 0.5).astype(int)
print(race_data[['horse_id', 'top3_prob', 'top3_pred']])
✅ Summary Task Do I Need To Retrain? Predict for new race ❌ No (just load the model) Add new features ✅ Yes (retrain and re-save) Update with new race history ✅ Yes (retrain)