Home

Welcome to the 2dv50e wiki!

Data

4 datasets are provided:

Heart Disease
Breast Cancer Wisconsin (Diagnostic)
Pima Indian Diabetes
Vehicle Silhouettes

Dataset structure

Each dataset includes following files:

dataset.csv - original csv file with all respective features
target.csv - csv file with target class instances
topModels.csv - top 55 models (5 models per base learning algorithm)
Following base classifiers (with respective hyperparameters alternatives) are used:
- K-Nearest Neighbor: {'n_neighbors': list(range(1, 25)), 'metric': ['chebyshev', 'manhattan', 'euclidean', 'minkowski'], 'algorithm': ['brute', 'kd_tree', 'ball_tree'], 'weights': ['uniform', 'distance']}
- Support Vector Machine: {'C': list(np.arange(0.1,4.43,0.11)), 'kernel': ['rbf','linear', 'poly', 'sigmoid']}
- Gaussian Naive Bayes: {'var_smoothing': list(np.arange(0.00000000001,0.0000001,0.0000000002))}
- Multilayer Perceptron: {'alpha': list(np.arange(0.00001,0.001,0.0002)), 'tol': list(np.arange(0.00001,0.001,0.0004)), 'max_iter': list(np.arange(100,200,100)), 'activation': ['relu', 'identity', 'logistic', 'tanh'], 'solver' : ['adam', 'sgd']}
- Logistic Regression: {'C': list(np.arange(0.5,2,0.075)), 'max_iter': list(np.arange(50,250,50)), 'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'penalty': ['l2', 'none']}
- Linear Discriminant Analysis: {'shrinkage': list(np.arange(0,1,0.01)), 'solver': ['lsqr', 'eigen']}
- Quadratic Discriminant Analysis: {'reg_param': list(np.arange(0,1,0.02)), 'tol': list(np.arange(0.00001,0.001,0.0002))}
- Random Forests: {'n_estimators': list(range(60, 140)), 'criterion': ['gini', 'entropy']}
- Extra Trees: {'n_estimators': list(range(60, 140)), 'criterion': ['gini', 'entropy']}
- Adaptive Boosting: {'n_estimators': list(range(40, 80)), 'learning_rate': list(np.arange(0.1,2.3,1.1)), 'algorithm': ['SAMME.R', 'SAMME']}
- Gradient Boosting: {'n_estimators': list(range(85, 115)), 'learning_rate': list(np.arange(0.01,0.23,0.11)), 'criterion': ['friedman_mse', 'mse', 'mae']}

Each instance (row) represents one model with model_id, algorthm id, all calculated metrics and overall performance. Overall performance is calculated as a single average of all 8 metrics. Column "params" identifies the hyperparameters, used for this particular model

topModelsProbabilities.csv - csv file with class predictions for all 55 best models

each row represents class probabilities per instance of target variable for every model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Data

Dataset structure

Clone this wiki locally