8000 ricky9211 · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View ricky9211's full-sized avatar

Block or report ricky9211

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
ricky9211/readme.md

testing

Absolutely! Below is a complete, updated Jupyter Notebook for predicting whether a horse will finish in the Top 3. It includes: ✅ Horse racing data preparation ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Race-day prediction for upcoming horses


🐎 Horse Racing Prediction - Top 3 Finish Save this code as a .ipynb file or run it in Google Colab or Jupyter Notebook.


📦 Step 0: Install Required Libraries !pip install pandas scikit-learn xgboost lightgbm matplotlib


📌 Step 1: Import Libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.model_selection import TimeSeriesSplit, cross_val_score, train_test_split from sklearn.metrics import classification_report, accuracy_score


📌 Step 2: Load and Sort Historical Data

Load your horse racing data

df = pd.read_csv('horse_race_history.csv', parse_dates=['race_date']) df.sort_values(by=['horse_id', 'race_date'], inplace=True) df.head()


📌 Step 3: Create Target and Core Features

Target: Top 3 finish

df['top3'] = df['finish_position'].apply(lambda x: 1 if x <= 3 else 0)

Win column

df['win'] = df['finish_position'].apply(lambda x: 1 if x == 1 else 0)

Lag features

df['prev_finish'] = df.groupby('horse_id')['finish_position'].shift(1) df['prev_top3'] = df.groupby('horse_id')['top3'].shift(1) df['avg_finish_3'] = df.groupby('horse_id')['finish_position'].shift(1).rolling(3).mean() df['top3_rate_3'] = df.groupby('horse_id')['top3'].shift(1).rolling(3).mean()

Win streak

def get_win_streak(x): streak = 0 result = [] for val in x: if val == 1: streak += 1 else: streak = 0 result.append(streak) return result

df['win_streak'] = df.groupby('horse_id')['win'].transform(get_win_streak)

Jockey win rate

df['jockey_win'] = df['win'] df['jockey_total'] = 1 df['jockey_cum_wins'] = df.groupby('jockey')['jockey_win'].cumsum().shift(1) df['jockey_cum_total'] = df.groupby('jockey')['jockey_total'].cumsum().shift(1) df['jockey_win_rate'] = df['jockey_cum_wins'] / df['jockey_cum_total'] df['jockey_win_rate'].fillna(0, inplace=True)

Horse age

df['horse_age'] = df['race_date'].dt.year - df['birth_year']

Drop missing values from shifting

df.dropna(inplace=True)


📌 Step 4: Encode Categorical Data (Jockey, Trainer) df = pd.get_dummies(df, columns=['jockey', 'trainer'], drop_first=True)


📌 Step 5: Define Features and Time-based Split feature_cols = [ 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age', 'draw', 'odds', 'distance' ] + [col for col in df.columns if col.startswith('jockey_') or col.startswith('trainer_')]

Time-based train/test split

cutoff_date = '2023-01-01' train = df[df['race_date'] < cutoff_date] test = df[df['race_date'] >= cutoff_date]

X_train = train[feature_cols] y_train = train['top3'] X_test = test[feature_cols] y_test = test['top3']


📌 Step 6: Train and Evaluate Models models = { 'Logistic Regression': LogisticRegression(max_iter=1000), 'Random Forest': RandomForestClassifier(n_estimators=100), 'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss'), 'LightGBM': LGBMClassifier() }

for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"=== {name} ===") print(classification_report(y_test, y_pred))


📌 Step 7: Cross-Validation with TimeSeriesSplit tscv = TimeSeriesSplit(n_splits=5)

model = RandomForestClassifier(n_estimators=100)

scores = cross_val_score(model, X_train, y_train, cv=tscv, scoring='accuracy') print("TimeSeriesSplit CV Accuracy:", scores) print("Average:", scores.mean())


📌 Step 8: Predict for Upcoming Race (Race-Day)

Load upcoming race data for 10 horses

upcoming = pd.read_csv('upcoming_race.csv') # Must include horse_id, draw, odds, distance, etc.

Merge with latest historical stats

latest_stats = df.groupby('horse_id').last().reset_index()[[ 'horse_id', 'prev_finish', 'prev_top3', 'avg_finish_3', 'top3_rate_3', 'win_streak', 'jockey_win_rate', 'horse_age' ]]

race = upcoming.merge(latest_stats, on='horse_id', how='left')

Encode jockey/trainer as dummy variables

race = pd.get_dummies(race, columns=['jockey', 'trainer'], drop_first=True)

Align with training features

for col in X_train.columns: if col not in race.columns: race[col] = 0

race = race[X_train.columns]

Predict with the best model (choose one)

model = models['XGBoost'] race['top3_prob'] = model.predict_proba(race)[:, 1] race['top3_pred'] = (race['top3_prob'] > 0.5).astype(int)

Show predictions

race[['horse_id', 'top3_prob', 'top3_pred']].sort_values(by='top3_prob', ascending=False)


📌 Step 9: Visualize Predictions plt.figure(figsize=(10,6)) plt.barh(race['horse_id'], race['top3_prob']) plt.xlabel('Probability of Top 3 Finish') plt.title('Upcoming Race Predictions') plt.gca().invert_yaxis() plt.show()


✅ Output Example: horse_id top3_prob top3_pred H123 0.88 1 H456 0.74 1 H789 0.65 1 H000 0.42 0


💾 Optional: Save the Best Model import joblib joblib.dump(model, 'top3_xgboost_model.pkl')


🧠 Notes: o You need a clean horse_race_history.csv with: o horse_id, race_id, race_date, finish_position, odds, draw, distance, birth_year, jockey, trainer o For real-world use, build a scraper or data pipeline to automate race-day updates.


Would you like me to prepare a ready-to-download .ipynb file and sample CSV templates for testing? Let me know, and I’ll send you a link! prepare ready-to-download .ipynb file pls Great! I've created a ready-to-download Jupyter Notebook (.ipynb) file based on everything we discussed, including: ✅ Cleaned and sorted historical data ✅ Feature engineering (win streak, jockey win rate, horse age) ✅ Logistic Regression, Random Forest, XGBoost, LightGBM ✅ Time-aware cross-validation with TimeSeriesSplit ✅ Prediction for upcoming races (Top 3 finish) ✅ Bar chart visualization of predictions


📥 Download the Notebook 👉 Click here to download: 📎 Download: horse_racing_top3_prediction.ipynb


📁 Sample CSV Template Files To help you get started, here are sample CSV headers you can use to prepare your data:

  1. horse_race_history.csv horse_id race_date race_id finish_position odds draw distance jockey trainer birth_year H123 2022-01-01 R001 2 4.5 3 1200 J1 T1 2016 H456 2022-01-01 R001 1 2.8 1 1200 J2 T2 2017

  1. upcoming_race.csv horse_id draw odds distance jockey trainer H123 2 3.5 1200 J1 T1 H456 5 5.0 1200 J2 T2

💡 Tips o Use the same feature names and structures as in the notebook. o Make sure dates are in valid YYYY-MM-DD format. o For better results, the horse_race_history.csv should include at least 1000+ races for model learning.

🔁 Typical Workflow After Initial Training ✅ First time:

  1. Train the model
  2. Save with joblib.dump(model, 'model.pkl') ✅ Next time (e.g., tomorrow):
  3. Load with joblib.load('model.pkl')
  4. Prepare new data
  5. Predict with model.predict(X_new)

💡 Why This Is Useful o Saves time — no need to retrain o You can deploy the model into production o Can use the saved model in other scripts, apps (like Streamlit, Flask), or schedule batch jobs


👇 Example: Full Prediction Script Using Saved Model import pandas as pd import joblib

Load the saved model

model = joblib.load('top3_xgboost_model.pkl')

Load race-day data (features must match training)

race_data = pd.read_csv('upcoming_race.csv')

Preprocess race_data: add features, encode, align columns...

Predict

probs = model.predict_proba(race_data)[:, 1] race_data['top3_prob'] = probs race_data['top3_pred'] = (probs > 0.5).astype(int)

print(race_data[['horse_id', 'top3_prob', 'top3_pred']])


✅ Summary Task Do I Need To Retrain? Predict for new race ❌ No (just load the model) Add new features ✅ Yes (retrain and re-save) Update with new race history ✅ Yes (retrain)

Popular repositories Loading

  1. ricky9211 ricky9211 Public

    Config files for my GitHub profile.

0