8000 Home · brainstorm/pdc Wiki · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
knyl edited this page Sep 22, 2010 · 9 revisions

Introduction

Using the knowledge gained on PDC summer school 2010, we choose to parallelize part of a phylogenetic analysis algorithm suite so that the speedups and parallelization would be of great use to researchers interested on phylogenetics. Fastphylo, the software package we're working on, is currently published under the GPL license on sourceforge projects repository.

The experimental setup we will base our development and tests on is PDC's Ferlin cluster.

Phylogenetic Background

Phylogeny consists in building biological trees of organisms and analyze their relationships over time. Several methods and phylogenetic tools exist, among them, fastphylo, and more specifically fastprot. Those tools analyze the mutations that have happened over time on particular biological datasets. Typically, a important value used to assess how organisms evolved is the "distance" or how much they differ from each other. We briefly explain two methods used in this software package were we spotted parallelization options.

Maximum Likelihood

Maximum Likelihood is a method that tries to find the distance that maximizes the likelihood function [ L(t) = P(N|t) ]. The likelihood represents the probability that a certain amount of replacements has taken place given a distance.

Expected Distance

Expected distance is another method for estimating the evolutionary distance by calculating [\ E(t|N) ].

Performance model

The performance model will be outlined on next iterations.

What does FastPhylo do overall

TODO Little bit of theory with paper references

Profiling the serial program

We used a script that did multiple runs on different lengths of input sequences and different parameters (namely "speed") and collected times using callgrind profiling tool, part of the bigger well known valgrind suite. In order to visualize the resulting runs, we used KCachegrind, the following screenshot shows the results for the bigger dataset:

We immediately identified were most of the time is spent. Those are the functions: "posterior_probability" which calls "elem_mult" 76000 times. The strategy envisioned is to apply OpenMP to the for loop calling elem_mult and use MPI to parallelize the calculations of distances between sequences later on, seen in the code:

   // XXX Splits the resulting matrix "dm" depending on input size
   // Rank will determine the dm chunk assigned to worker

   // XXX MPI bag of tasks

    for (int i=0; i<sv.size(); i++) {
      for (int j=i+1; j<sv.size(); j++){
        Matrix N = count_replacements(sv[i], sv[j]);
        double distance = calculate_ed(N);
        dm.setDistance(i, j, distance); 
      }
    }
  }

The profiling run for the maximum likelihood (screenshot not included) was discarded as a plausible optimization target provided that it is already optimized using the [[LAPACK|http://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library/ ]] library.

Parallelization using OpenMP

Using the OpenMP directive "parallel for" on the elem_mult routine

Parallelization using MPI

Parallelized by using the "bag of tasks" paradigm

Algorithm description

if (rank == 0) {
    initialize();
    for (i = 1 .. nr_processes)
        MPI_SEND // Send translation model to process i
}

if (rank != 0)
    // Receive translation model from master

if (rank == 0) {
  // Set up tasks
    while (/* Still tasks left */) {
        // Receive task request
        // Send task to process
        //
    }
}

if (rank != 0) {
    // Send request for task
    // Receive task
    while (/*received task != end_task*/) {
        // Do computations
        // Send result to master
    }
}

Results

Speedup and comparison

TOTELLONMAIL Security audits and:

a07c01n08$ spq -u $USER Q JID USER STATE CAC RESOURCE TIME
87 092115320949 knyl wait summer-2010-03 1A 0h10
88 092115315663 knyl wait summer-2010-03 1A 0h10
Note: Operator is enforcing an allocation pause. Note: no jobs will start, execution of dequeue requests delayed.

Wed Sep 22 18:48:44 CEST 2010 speed 1 threads 1 231.58 real 228.98 user 0.16 sys threads 2 199.63 real 388.64 user 0.98 sys threads 4 173.57 real 685.63 user 2.77 sys threads 8 173.51 real 1352.01 user 8.24 sys speed 4 threads 1 58.76 real 58.67 user 0.08 sys threads 2 53.76 real 106.71 user 0.42 sys threads 4 54.32 real 202.60 user 5.05 sys threads 8 62.12 real 466.13 user 9.10 sys speed 8 threads 1 25.54 real 25.45 user 0.08 sys threads 2 26.75 real 51.58 user 1.03 sys threads 4 28.33 real 106.21 user 2.69 sys threads 8 36.82 real 266.96 user 8.50 sys

0