8000 GitHub - Koldh/DynaML: Scala Library/REPL for scalable Machine Learning
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ DynaML Public
forked from tailhq/DynaML

Scala Library/REPL for scalable Machine Learning

License

Notifications You must be signed in to change notification settings

Koldh/DynaML

 
 

Repository files navigation

DynaML

Build Status

Aim

DynaML is a scala library/repl for implementing and working with general Machine Learning models. Machine Learning/AI applications make heavy use of various entities such as graphs, vectors, matrices etc as well as classes of mathematical models which deal with broadly three kinds of tasks, prediction, classification and clustering.

The aim is to build a robust set of abstract classes and interfaces, which can be extended easily to implement advanced models for small and large scale applications.

But the library can also be used as an educational/research tool for multi scale data analysis.

Currently DynaML has implementations of Least Squares Support Vector Machine (LS-SVM) for binary classification and regression. LS-SVM is equivalent to ridge regression/Tikhonov regularization, for further background consider Wikipedia or the [book] (http://www.amazon.com/Least-Squares-Support-Vector-Machines/dp/9812381511).

A good general introduction to Probabilistic Models for Machine Learning can be found here in David Barber's text book. The LS-SVM is equivalent to the class of models discussed in Chapter 18 (Bayesian Linear Models) of the book.

Installation

Prerequisites: Maven

  • Clone this repository
  • Run the following.
  mvn clean compile
  mvn package
  • Make sure you give execution permission to DynaML in the target/bin directory.
  chmod +x target/bin/DynaML
  target/bin/DynaML

You should get the following prompt.

    ___       ___       ___       ___       ___       ___   
   /\  \     /\__\     /\__\     /\  \     /\__\     /\__\  
  /::\  \   |::L__L   /:| _|_   /::\  \   /::L_L_   /:/  /  
 /:/\:\__\  |:::\__\ /::|/\__\ /::\:\__\ /:/L:\__\ /:/__/   
 \:\/:/  /  /:;;/__/ \/|::/  / \/\::/  / \/_/:/  / \:\  \   
  \::/  /   \/__/      |:/  /    /:/  /    /:/  /   \:\__\  
   \/__/               \/__/     \/__/     \/__/     \/__/  

Welcome to DynaML v 1.2
Interactive Scala shell

DynaML>

Getting Started

The data/ directory contains a few sample data sets, and the root directory also has example scripts which can be executed in the shell.

  • First we create a linear classification model on a csv data set. We will assume that the last column in each line of the file is the target value, and we build an LS-SVM model.
	val config = Map("file" -> "data/ripley.csv", "delim" -> ",", "head" -> "false", "task" -> "classification")
	val model = GaussianLinearModel(config)
  • We can now (optionally) add a Kernel on the model to create a generalized linear Bayesian model.
  val rbf = new RBFKernel(1.025)
  model.applyKernel(rbf)
15/06/25 22:30:57 INFO SVMKernel$: Constructing key-value representation of kernel matrix.
15/06/25 22:30:57 INFO SVMKernel$: Dimension: 13 x 13
15/06/25 22:30:57 INFO SVMKernelMatrix: Eigenvalue decomposition of the kernel matrix using JBlas.
15/06/25 22:30:57 INFO SVMKernelMatrix: Eigenvalue stats: 0.09797818213131776 =< lambda =< 3.178218421049352
15/06/25 22:30:57 INFO GaussianLinearModel: Applying Feature map to data set
15/06/25 22:30:57 INFO GaussianLinearModel: DONE: Applying Feature map to data set
bayeslearn>
  • Now we can solve the optimization problem posed by the LS-SVM in the parameter space. Since the LS-SVM problem is equivalent to ridge regression, we have to specify a regularization constant.
  model.setRegParam(1.5).learn
  • We can now predict the value of the target variable given a new point consisting of a Vector of features using model.predict().

  • Evaluating models is easy in DynaML. You can create an evaluation object as follows.

	val configtest = Map("file" -> "data/ripleytest.csv", "delim" -> ",", "head" -> "false")
	val met = model.evaluate(configtest)
	met.print
  • The object met has a print() method which will dump some performance metrics in the shell. But you can also generate plots by using the generatePlots() method.
15/06/25 22:35:06 INFO BinaryClassificationMetrics: Classification Model Performance
15/06/25 22:35:06 INFO BinaryClassificationMetrics: ============================
15/06/25 22:35:06 INFO BinaryClassificationMetrics: Area under PR: NaN
15/06/25 22:35:06 INFO BinaryClassificationMetrics: Area under ROC: 0.8160130718954248
met.generatePlots

Image of Plots

  • Although kernel based models allow great flexibility in modeling non linear behavior in data, they are highly sensitive to the values of their hyper-parameters. For example if we use a Radial Basis Function (RBF) Kernel, it is a non trivial problem to find the best values of the kernel bandwidth and the regularization constant.

  • In order to find the best hyper-parameters for a general kernel based supervised learning model, we use methods in gradient free global optimization. This is relevant because the cost (objective) function for the hyper-parameters is not smooth in general. In fact in most common scenarios the objective function is defined in terms of some kind of cross validation performance.

  • DynaML has a robust global optimization API, currently Coupled Simulated Annealing and Grid Search algorithms are implemented, the API in the package org.kuleven.esat.optimization can be extended to implement any general gradient or gradient free optimization methods.

  • Lets tune an RBF kernel on the Ripley data.

import com.tinkerpop.blueprints.Graph
import com.tinkerpop.frames.FramedGraph
import io.github.mandar2812.dynaml.graphutils.CausalEdge
val (optModel, optConfig) = KernelizedModel.getOptimizedModel[FramedGraph[Graph],
      Iterable[CausalEdge], model.type](model, "csa",
      "RBF", 13, 7, 0.3, true)

We see a long list of logs which end in something like the snippet below, the Coupled Simulated Annealing model, gives us a set of hyper-parameters and their values.

optModel: io.github.mandar2812.dynaml.models.GaussianLinearModel = io.github.mandar2812.dynaml.models.GaussianLinearModel@6adcc6d9
optConfig: scala.collection.immutable.Map[String,Double] = Map(bandwidth -> 4.292522306297284, RegParam -> 7.56099893666852)

To inspect the performance of this kernel model on an independent test set, we can use the model.evaluate() function. But before that we must train this 'optimized' kernel model on the training set.

optModel.setMaxIterations(2).learn()
val met = optModel.evaluate(configtest)
met.print()
met.generatePlots()

And the evaluation results follow ...

15/06/25 23:49:32 INFO BinaryClassificationMetrics: Classification Model Performance
15/06/25 23:49:32 INFO BinaryClassificationMetrics: ============================
15/06/25 23:49:32 INFO BinaryClassificationMetrics: Area under PR: NaN
15/06/25 23:49:32 INFO BinaryClassificationMetrics: Area under ROC: 0.8696078431372549

Documentation

You can refer to the project home page or the documentation for getting started with DynaML. Bear in mind that this is still at its infancy and there will be many more improvements/tweaks in the future.

About

Scala Library/REPL for scalable Machine Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 91.9%
  • CSS 4.8%
  • HTML 2.8%
  • Other 0.5%
0