ricky9211

testing

Absolutely! Below is a complete, updated Jupyter Notebook for predicting whether a horse will finish in the Top 3. It includes: ✅ Horse racing data preparation ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Race-day prediction for upcoming horses

🐎 Horse Racing Prediction - Top 3 Finish Save this code as a .ipynb file or run it in Google Colab or Jupyter Notebook.

📦 Step 0: Install Required Libraries !pip install pandas scikit-learn xgboost lightgbm matplotlib

📌 Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.model_selection import TimeSeriesSplit, cross_val_score, train_test_split from sklearn.metrics import classification_report, accuracy_score

📌 Step 2: Load and Sort Historical Data

Load your horse racing data

df = pd.read_csv('horse_race_history.csv', parse_dates=['race_date']) df.sort_values(by=['horse_id', 'race_date'], inplace=True) df.head()

📌 Step 3: Create Target and Core Features

Target: Top 3 finish

df['top3'] = df['finish_position'].apply(lambda x: 1 if x <= 3 else 0)

Win column

df['win'] = df['finish_position'].apply(lambda x: 1 if x == 1 else 0)

Lag features

df['prev_finish'] = df.groupby('horse_id')['finish_position'].shift(1) df['prev_top3'] = df.groupby('horse_id')['top3'].shift(1) df['avg_finish_3'] = df.groupby('horse_id')['finish_position'].shift(1).rolling(3).mean() df['top3_rate_3'] = df.groupby('horse_id')['top3'].shift(1).rolling(3).mean()

Win streak

def get_win_streak(x): streak = 0 result = [] for val in x: if val == 1: streak += 1 else: streak = 0 result.append(streak) return result

df['win_streak'] = df.groupby('horse_id')['win'].transform(get_win_streak)

Jockey win rate

df['jockey_win'] = df['win'] df['jockey_total'] = 1 df['jockey_cum_wins'] = df.groupby('jockey')['jockey_win'].cumsum().shift(1) df['jockey_cum_total'] = df.groupby('jockey')['jockey_total'].cumsum().shift(1) df['jockey_win_rate'] = df['jockey_cum_wins'] / df['jockey_cum_total'] df['jockey_win_rate'].fillna(0, inplace=True)

Horse age

df['horse_age'] = df['race_date'].dt.year - df['birth_year']

Drop missing values from shifting

df.dropna(inplace=True)

📌 Step 4: Encode Categorical Data (Jockey, Trainer) df = pd.get_dummies(df, columns=['jockey', 'trainer'], drop_first=True)

📌 Step 5: Define Features and Time-based Split feature_cols = [ 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age', 'draw', 'odds', 'distance' ] + [col for col in df.columns if col.startswith('jockey_') or col.startswith('trainer_')]

Time-based train/test split

cutoff_date = '2023-01-01' train = df[df['race_date'] < cutoff_date] test = df[df['race_date'] >= cutoff_date]

X_train = train[feature_cols] y_train = train['top3'] X_test = test[feature_cols] y_test = test['top3']

📌 Step 6: Train and Evaluate Models models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100), 'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 'LightGBM': LGBMClassifier() }

for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"=== {name} ===") print(classification_report(y_test, y_pred))

📌 Step 7: Cross-Validation with TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5)

model = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy') print("TimeSeriesSplit CV Accuracy:", scores) print("Average:", scores.mean())

📌 Step 8: Predict for Upcoming Race (Race-Day)

Load upcoming race data for 10 horses

upcoming = pd.read_csv('upcoming_race.csv') # Must include horse_id, draw, odds, distance, etc.

Merge with latest historical stats

latest_stats = df.groupby('horse_id').last().reset_index()[[ 'horse_id', 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age' ]]

race = upcoming.merge(latest_stats, on='horse_id', how='left')

Encode jockey/trainer as dummy variables

race = pd.get_dummies(race, columns=['jockey', 'trainer'], drop_first=True)

Align with training features

for col in X_train.columns: if col not in race.columns: race[col] = 0

race = race[X_train.columns]

Predict with the best model (choose one)

model = models['XGBoost'] race['top3_prob'] = model.predict_proba(race)[:, 1] race['top3_pred'] = (race['top3_prob'] > 0.5).astype(int)

Show predictions

race[['horse_id', 'top3_prob', 'top3_pred']].sort_values(by='top3_prob', ascending=False)

📌 Step 9: Visualize Predictions plt.figure(figsize=(10,6)) plt.barh(race['horse_id'], race['top3_prob']) plt.xlabel('Probability of Top 3 Finish') plt.title('Upcoming Race Predictions') plt.gca().invert_yaxis() plt.show()

✅ Output Example: horse_id top3_prob top3_pred H123 0.88 1 H456 0.74 1 H789 0.65 1 H000 0.42 0

💾 Optional: Save the Best Model import joblib joblib.dump(model, 'top3_xgboost_model.pkl')

🧠 Notes: o You need a clean horse_race_history.csv with: o horse_id, race_id, race_date, finish_position, odds, draw, distance, birth_year, jockey, trainer o For real-world use, build a scraper or data pipeline to automate race-day updates.

Would you like me to prepare a ready-to-download .ipynb file and sample CSV templates for testing? Let me know, and I’ll send you a link! prepare ready-to-download .ipynb file pls Great! I've created a ready-to-download Jupyter Notebook (.ipynb) file based on everything we discussed, including: ✅ Cleaned and sorted historical data ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Prediction for upcoming races (Top 3 finish) ✅ Bar chart visualization of predictions

📥 Download the Notebook 👉 Click here to download: 📎 Download: horse_racing_top3_prediction.ipynb

📁 Sample CSV Template Files To help you get started, here are sample CSV headers you can use to prepare your data:

horse_race_history.csv horse_id race_date race_id finish_position odds draw distance jockey trainer birth_year H123 2022-01-01 R001 2 4.5 3 1200 J1 T1 2016 H456 2022-01-01 R001 1 2.8 1 1200 J2 T2 2017

upcoming_race.csv horse_id draw odds distance jockey trainer H123 2 3.5 1200 J1 T1 H456 5 5.0 1200 J2 T2

💡 Tips o Use the same feature names and structures as in the notebook. o Make sure dates are in valid YYYY-MM-DD format. o For better results, the horse_race_history.csv should include at least 1000+ races for model learning.

🔁 Typical Workflow After Initial Training ✅ First time:

Train the model
Save with joblib.dump(model, 'model.pkl') ✅ Next time (e.g., tomorrow):
Load with joblib.load('model.pkl')
Prepare new data
Predict with model.predict(X_new)

💡 Why This Is Useful o Saves time — no need to retrain o You can deploy the model into production o Can use the saved model in other scripts, apps (like Streamlit, Flask), or schedule batch jobs

👇 Example: Full Prediction Script Using Saved Model import pandas as pd import joblib

Load the saved model

model = joblib.load('top3_xgboost_model.pkl')

Load race-day data (features must match training)

race_data = pd.read_csv('upcoming_race.csv')

Preprocess race_data: add features, encode, align columns...

Predict

probs = model.predict_proba(race_data)[:, 1] race_data['top3_prob'] = probs race_data['top3_pred'] = (probs > 0.5).astype(int)

print(race_data[['horse_id', 'top3_prob', 'top3_pred']])

✅ Summary Task Do I Need To Retrain? Predict for new race ❌ No (just load the model) Add new features ✅ Yes (retrain and re-save) Update with new race history ✅ Yes (retrain)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly