More Web Proxy on the site http://driver.im/
|
||||
Contents |
Welcome to the CD-HIT Project Main PageNews (September 2009) CD-HIT web server is now available to run cd-hit or download pre-calculated clusters.CD-HIT stands for Cluster Database at High Identity with Tolerance. The program (cd-hit) takes a fasta format sequence database as input and produces a set of 'non-redundant' (nr) representative sequences as output. In addition cd-hit outputs a cluster file, documenting the sequence 'groupies' for each nr sequence representative. The idea is to reduce the overall size of the database without removing any sequence information by only removing 'redundant' (or highly similar) sequences. This is why the resulting database is called non-redundant (nr). Essentially, cd-hit produces a set of closely related protein families from a given fasta sequence database. CD-HIT uses a 'longest sequence first' list removal algorithm to remove sequences above a certain identity threshold. Additionally the algorithm implements a very fast heuristic to find high identity segments between sequences, and so can avoid many costly full alignments. With recent developments, cd-hit package offers new programs for DNA sequence clustering and comparing two databases. It also has lots of new options for clustering control. CD-HIT was originally written by Weizhong Li and is now an open source project!
BugsThere are a number of outstanding bugs in the current implementation. We are always looking for hard working and enthusiastic volunteers (people like Luc Ducazu) to shoot these problems down.
Sub ProjectsThe CD-HIT project provides a number of opportunities for interesting research activities. If one of these sub-projects takes your interest why not join up and take part? We are especially keen to work closely with bioinformatics MSc students working on their MSc projects.
Related ResourcesFor related resources, please see (or update) sequence clustering
ThanksMany thanks are due.Comments and suggestions |
|||
|