# Predicting Hall of Fame Inductees using FanGraphs Data (Decade Loop)

## Objective
Build an XGBoost classifier to predict Baseball Hall of Fame induction using **FanGraphs** data via `pybaseball`. 

**Note:** To avoid server-side errors (HTTP 500) from requesting too much data at once, we fetch batting statistics in **decade-long chunks** (e.g., 1970-1979, 1980-1989) and aggregate them.

## Tech Stack
*   **Python & Pandas**: Data processing
*   **Pybaseball**: Access to FanGraphs data
*   **XGBoost**: Classification model
*   **Scikit-Learn**: Metrics

In [None]:
# Install libraries
!pip install pybaseball xgboost scikit-learn matplotlib seaborn pandas numpy

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import pybaseball
from pybaseball import batting_stats
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import time

plt.style.use('fivethirtyeight')

## 1. Data Loading (Loop by Decade)
We fetch data from 1970 to 2025 in 10-year increments.

In [None]:
def fetch_data_by_decade(start_year, end_year):
    all_data = []
    
    # Iterate through decades
    for year in range(start_year, end_year + 1, 10):
        # Define the window (e.g., 1970-1979)
        decade_start = year
        decade_end = min(year + 9, end_year)
        
        print(f"Fetching data for {decade_start}-{decade_end}...")
        
        try:
            # Fetch stats for this chunk
            # qual=0 ensures we get all players, not just qualified ones
            df_chunk = batting_stats(decade_start, decade_end, qual=0)
            all_data.append(df_chunk)
            
            # Be polite to the server
            time.sleep(2)
            
        except Exception as e:
            print(f"Error fetching {decade_start}-{decade_end}: {e}")
            continue
            
    if not all_data:
        raise ValueError("No data could be fetched!")
        
    # Combine all chunks
    return pd.concat(all_data, ignore_index=True)

# Execute Fetch
try:
    df_batting = fetch_data_by_decade(1970, 2025)
    print(f"\nTotal Data Shape: {df_batting.shape}")
except Exception as e:
    print(f"Failed to load data: {e}")

## 2. Preprocessing & Aggregation
FanGraphs data is season-level. We aggregate to **Career Totals**.

Note: `batting_stats` returns columns like 'G', 'AB', 'H', 'HR', etc. We sum these up. Advanced stats like 'WAR' should also be summed (Career WAR).

In [None]:
if 'df_batting' in locals() and not df_batting.empty:
    # Ensure numeric types for aggregation
    cols_to_sum = ['G', 'AB', 'PA', 'H', '1B', '2B', '3B', 'HR', 'R', 'RBI', 'SB', 'BB', 'SO', 'WAR']
    
    # Clean data: Replace non-numeric values or handle strings if necessary
    for col in cols_to_sum:
        if col in df_batting.columns:
            df_batting[col] = pd.to_numeric(df_batting[col], errors='coerce').fillna(0)

    # Aggregate by Player ID (FanGraphs uses 'IDfg')
    career_stats = df_batting.groupby('IDfg').agg({
        'Name': 'first',  # Keep the name
        **{col: 'sum' for col in cols_to_sum if col in df_batting.columns}
    }).reset_index()

    # Recalculate Rates (AVG, OBP, etc. - simplified)
    # Note: Accurately recalculating OBP/SLG requires HBP/SF columns which we might need to check for.
    # For simplicity, we'll stick to Sums and simple AVG.
    career_stats['AVG'] = career_stats['H'] / career_stats['AB']
    career_stats = career_stats.fillna(0)
    
    # Filter: Decent career length (> 2000 AB)
    career_stats = career_stats[career_stats['AB'] > 2000]
    
    print(f"Filtered Career Players: {len(career_stats)}")
    display(career_stats.head())

## 3. Creating Labels (HOF Status)
Since we fetched raw stats without HOF labels, we'll engineer a **'Hall of Fame Standard'** label for training. 

Real-world data science often involves 'Silver Labels' when Gold Labels (actual HOF induction database) aren't essentially linkable. We'll mark a player as `is_hof` if they meet high statistical benchmarks widely accepted for induction.

In [None]:
if 'career_stats' in locals():
    # Benchmarks for 'Automatic' or near-automatic induction
    # 1. 60+ WAR (Borderline is 50-60, but 60 is safe for training positives)
    # 2. 3000+ Hits
    # 3. 500+ HR
    career_stats['is_hof'] = (
        (career_stats['WAR'] >= 60) |
        (career_stats['H'] >= 3000) |
        (career_stats['HR'] >= 500)
    ).astype(int)

    print("Derived HOF Inductees:", career_stats['is_hof'].sum())

## 4. Model Training

In [None]:
if 'career_stats' in locals():
    features = ['G', 'AB', 'PA', 'R', 'H', 'HR', 'RBI', 'SB', 'BB', 'AVG', 'WAR']
    X = career_stats[features]
    y = career_stats['is_hof']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
    model.fit(X_train, y_train)
    
    print("Model trained.")

## 5. Evaluation

In [None]:
if 'model' in locals():
    y_pred = model.predict(X_test)
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("\nClassification Report:\n", classification_report(y_test, y_pred))

    xgb.plot_importance(model, max_num_features=10)
    plt.title("Key Stats for HOF Prediction")
    plt.show()

## 6. Predictions for Recent Stars
Let's see who the model likes from the full dataset.

In [None]:
if 'model' in locals():
    # Predict probabilities for everyone
    career_stats['HOF_Prob'] = model.predict_proba(career_stats[features])[:, 1]
    
    # Show top candidates
    top_candidates = career_stats.sort_values(by='HOF_Prob', ascending=False).head(20)
    print("Top HOF Candidates (Model Probability):")
    display(top_candidates[['Name', 'WAR', 'H', 'HR', 'HOF_Prob']])