Signal Miner

Signal Miner

Revolutionizing Staking: Aligning users and the fund through unique models.

This repository houses code and notebooks to mine (or systematically search for) machine learning models that aim to beat benchmarks for Numerai Classic Tournament. By automating the process of iteratively training, evaluating, and retaining high-performing models, Signal Miner is your quickstart into generating models that potentially produce better-than-benchmark performance on historical data.

Background

This notebook addresses the Numerai Classic data science tournament and aims to align incentives for generic staking on the tournament. Ideally, when more people stake, the hedge fund’s meta model improves because it can incorporate a diversity of unique signals. However, under the current setup, generic stakers often rely on pre-existing models—either Numerai’s example models or paid models from NumerBay—which limits the potential for fresh, unique alpha.

Make Staking Great Again:
The core idea of this project is that every staker should be able to contribute unique alpha to Numerai Classic. Why? Because unique alpha:

Has a better chance of producing positive MMC (Meta Model Contribution).
Potentially earns higher payouts than staking on widely used example models.
Doesn’t compromise on performance (all generated models exceed specified benchmark metrics like correlation and Sharpe).

By automatically searching for and refining these distinct models, we increase the variety of signals feeding into Numerai’s meta model, benefiting both stakers (via higher potential rewards) and Numerai (via more robust, diversified signals).

Signal Miner extends this idea by creating a pipeline to search for robust models in an automated fashion, focusing on:

Unique: Emphasizing uncommon or orthogonal predictions that add new information.
Transparent: Offering clear performance metrics at each mining iteration.
Efficient: Letting your machine handle the computational tasks while you focus on analysis.

The result is a win-win:

Stakers are happy because they can generate new signals and potentially earn more.
Numerai’s hedge fund is happy because it gains new, non-redundant alpha from the community.

Installation & Setup

Clone the repository:

git clone https://github.com/jefferythewind/signal_miner.git
cd signal_miner

8000

Create (and activate) a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # on Linux or macOS
# or
venv\Scripts\activate     # on Windows

Install required dependencies:
```
pip install -r requirements.txt
```
Install Jupyter (if you want to use the notebook):
```
pip install jupyter
```

That’s it! Once done, you’re ready to either run the code directly (e.g., via Python scripts) or explore the iPython notebooks.

Usage Overview

Recommended: See Model Miner.ipynb for a complete end-to-end example. It’s best run from top to bottom using Python 3.10.

Below is a high-level summary of how you might use Signal Miner in practice:

Load your data as usual (e.g., reading a CSV or Parquet file into a Pandas DataFrame).
Define a benchmark configuration to compare against (e.g., a standard LightGBM model).
Create a parameter dictionary (hyperparameters to be sampled or searched).
Set up time-series cross-validation with an embargo or gap (important to avoid leakage in financial data).
Launch the asynchronous mining process (which iterates through parameter combinations and evaluates them across the defined cross-validation folds).
Check progress periodically and see how many configurations have run.
Evaluate results relative to your benchmark on the validation and test folds (e.g., correlation, Sharpe).
Export or ensemble any configuration(s) that exceed your benchmark.

Step-by-Step Example

Below are excerpts from the notebook demonstrating these steps:

1. Define the benchmark configuration:

benchmark_cfg = {
    "colsample_bytree": 0.1,
    "max_bin": 5,
    "max_depth": 5,
    "num_leaves": 2**4 - 1,
    "min_child_samples": 20,
    "n_estimators": 2000,
    "reg_lambda": 0.0,
    "learning_rate": 0.01,
    "target": 'target'  # Using the first target for simplicity
}

2. Create the parameter dictionary to search:

param_dict = {
    'colsample_bytree': list(np.linspace(0.001, 1, 100)),
    'reg_lambda': list(np.linspace(0, 100_000, 10000)),
    'learning_rate': list(np.linspace(0.00001, 1.0, 1000, dtype='float')),
    'max_bin': list(np.linspace(2, 5, 4, dtype='int')),
    'max_depth': list(np.linspace(2, 12, 11, dtype='int')),
    'num_leaves': list(np.linspace(2, 24, 15, dtype='int')),
    'min_child_samples': list(np.linspace(1, 250, 250, dtype='int')),
    'n_estimators': list(np.linspace(10, 2000, 1990, dtype='int')),
    'target': targets
}

3. Set up time-series cross-validation (with a gap/embargo to avoid leakage across eras):

ns = 2  # number of splits
all_splits = list(TimeSeriesSplit(n_splits=ns, max_train_size=100_000_000, gap=12).split(eras))

Here, we use two folds. The first fold acts as “validation” and the second as a “test” set, ensuring no overlap.

4. Launch the mining process (asynchronous job pool) to train multiple configurations:

start_mining()

This begins training across the folds for each parameter combination. The process runs in the background, so you can continue using the notebook.

5. Periodically check progress:

check_progress()
# Example Output:
# Progress: 122.0/2002 (6.09%)

This lets you know how many configurations have completed.

6. Evaluate results once you’ve accumulated sufficient runs:

res_df = evaluate_completed_configs(
    data, configurations, mmapped_array, done_splits, all_splits, ns
)
# Label any benchmark configuration
res_df['is_benchmark'] = (res_df.index == BENCHMARK_ID)

print("Benchmark Results:")
res_df[res_df['is_benchmark']]

You’ll see metrics such as validation_corr, test_corr, whole_corr, validation_shp, etc., alongside your benchmark.

7. Compare models to the benchmark to find superior configurations:

print("Better Than Benchmark Results:")
compare_to_benchmark(res_df)

8. Export any top-performing models for deployment:

to_export = [res_df.sort_values('whole_shp').iloc[-1].name]  # pick the best by Sharpe
evaluate_and_ensemble(
    to_export, configurations, mmapped_array, data,
    all_splits, feature_cols, get_model, save_name="model"
)
# Example output:
# Predict function saved as predict_model_full.pkl

The above snippet creates an ensemble (even if it’s a single model) and saves a .pkl file suitable for future inference or Numerai submission.

That’s the overall usage flow of Signal Miner. For the most up-to-date code and additional detail, please refer to the Model Miner notebook.

Performance Plot & Randomness

Below is a scatter plot, which illustrates the relationship between past performance (cross-validation / in-sample Sharpe) and future performance (test fold or out-of-sample Sharpe):

Key Takeaway: The best model on historical (validation) data is not necessarily the best model for unseen data. There’s inherent randomness in the modeling process, and no amount of backtesting can completely guarantee out-of-sample success.

In our example plot, each dot represents a model configuration:

The x-axis is the validation Sharpe (past fold).
The y-axis is the test Sharpe (future fold).
The benchmark model is shown as a star, and we fit a best-fit line showing a strong linear relationship.

Some observations:

Not Perfect: The top-performing validation model isn’t the top performer on the test set, confirming that overfitting or luck can play a role in “winning” the validation stage.
Benchmark Surprises: The benchmark ranks near the top in validation, yet multiple models outperformed it on the test set.
Encouraging Correlation: Despite the inevitable randomness, there is a strong positive correlation between past and future performance—meaning high validation Sharpe often translates to high test Sharpe.
What If the Plot Looked Random?: If, instead, you saw a circular or completely random distribution, that would mean your model selection is mostly noise. In such cases, “chasing” the top validation model yields little to no real out-of-sample edge.

This dynamic mirrors the transition from training to live deployment: even the best backtested model might not be the best performer going forward. But a solid positive correlation provides some confidence that better in-sample results can lead to better out-of-sample performance.

Contributing

We welcome contributions! Whether it’s:

Bug fixes or clarifications
Additional model-mining techniques
Expanded plotting and diagnostic tools

Feel free to open a Pull Request or Issue.

License

This project is licensed under the MIT License. You’re free to use and modify this code for your own modeling adventures.

Namaste, and happy mining!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
Model Miner.ipynb		Model Miner.ipynb
README.md		README.md
requirements.txt		requirements.txt
signal_miner.py		signal_miner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Signal Miner

Table of Contents

Background

Installation & Setup

Usage Overview

Step-by-Step Example

Performance Plot & Randomness

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

joakimarvidsson/signal_miner

Folders and files

Latest commit

History

Repository files navigation

Signal Miner

Table of Contents

Background

Installation & Setup

Usage Overview

Step-by-Step Example

Performance Plot & Randomness

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages