WO2006008575A2

WO2006008575A2 - Construction of a comparative database and identificaiton of virulence factors through comparison of polymorphic regions in clinical isolates of infectious organisms

Info

Publication number: WO2006008575A2
Application number: PCT/IB2004/002598
Authority: WO
Inventors: Villoo Morawala Patell; K. R. Rajyashri; Marc Rodrigue; Guy Vernet
Original assignee: Avestha Gengraine Technologies Pvt. Ltd.; Biomerieux Sa
Priority date: 2004-07-12
Filing date: 2004-07-12
Publication date: 2006-01-26
Also published as: EP1789577A4; CN101421415A; EP1789577A2; WO2006008575A9

Abstract

The present invention is directed to novel nucleotide sequences to be used for diagnosis, identification of the strain, typing of the strain and giving orientation to its potential degree of virulence, infectivity and/or latency for all infectious diseases more particularly tuberculosis. The present invention also includes method for the identification and selection of polymorphisms associated with the virulence' and /or infectivity in infectious diseases more particularly in tuberculosis by a comparative genomic analysis of the sequences of different clinical isolates/strains of infectious organisms. The regions of polymorphisms, can also act as potential drug targets and vaccine targets. More particularly, the invention also relates to identifying virulence factors of M. tuberculosis strains and other infectious organisms to be included in a diagnostic DNA chip allowing identification of the strain, typing of the strain and finally giving orientation to its potential degree of virulence. Although the present invention has been illustrated with specific reference to the polymorphic region in the Mycobacterium tuberculosis, the said invention is not to be understood and construed as being limited to Tuberculosis but is applicable to all infectious diseases.

Description

Construction of a comparative database and identification of virulence factors through comparison of polymorphic regions in clinical isolates of Infectious

Organisms

Field of Invention

The present invention is directed to novel nucleotide sequences to be used for diagnosis, identification of the strain, typing of the strain and giving orientation to its potential degree of virulence, infectivity and/or latency for all infectious diseases including tuberculosis. The present invention also includes method for the identification and selection of polymorphisms associated with the virulence and /or infectivity in infectious diseases by a comparative genomic analysis of the sequences of different clinical isolates/strains of infectious organisms. The regions of polymorphisms, can also act as potential drug targets and vaccine targets. More particularly, the invention also relates to identifying virulence factors of M. tuberculosis strains and other infectious organisms to be included in a diagnostic DNA chip allowing identification of the strain, typing of the strain and finally giving orientation to its potential degree of virulence.

Although the present invention has been illustrated with specific reference to the polymorphic region in the Mycobacterium tuberculosis, the said invention is not to be understood and construed as being limited to Tuberculosis but is applicable to all infectious diseases.

Background of the Invention

Microbial pathogens use a variety of complex strategies to subvert host cellular functions to ensure their multiplication and survival. Some pathogens that have co- evolved or have had a long-standing association with their hosts utilize finely tuned host-specific strategies to establish a pathogenic relationship.

During infection, pathogens encounter different conditions, and respond by expressing virulence factors that are appropriate for the particular environment, host, or both.

Although antibiotics have been effective tools in treating infectious disease, the emergence of drug resistant pathogens is becoming problematic in the clinical setting. New antibiotic or antipathogenic molecules are therefore needed to combat such drug resistant pathogens. Accordingly, there is a need in the art for screening methods aimed not only at identifying and characterizing potential antipathogenic agents, but also for identifying and characterizing the virulence factors that enable pathogens to infect and debilitate their hosts.

The mycobacteria are rod-shaped, acid-fast, aerobic bacilli that do not form spores. Several species of mycobacteria are pathogenic to humans and/or animals, and factors associated with their virulence. Tuberculosis is a worldwide health problem, which causes approximately 3 million deaths each year, yet little is known about the molecular basis of tuberculosis pathogenesis. The disease is caused by infection with Mycobacterium tuberculosis; tubercle bacilli are inhaled and then ingested by alveolar macrophages. As is the case with most pathogens, infection with M. tuberculosis does not always result in disease. The infection is often arrested by developing cell- mediated immunity (CMI) resulting in the formation of microscopic lesions, or tubercles, in the lung. If CMI does not limit the spread of M. tuberculosis, caseous necrosis, bronchial wall erosion, and pulmonary cavitations may occur. The factors that determine whether infection with M. tuberculosis results in disease are not well understood.

The tuberculosis complex is a group of four mycobacterial species that are so closely related genetically that it has been proposed \liat they be combined into a single species. Three important members :of the complex are Mycobacterium tuberculosis, the major cause of human tuberculosis; Mycobacterium africanum, a major cause of human tuberculosis in some populations; and Mycobacterium bovis, the cause of bovine tuberculosis. None of these mycobacteria is restricted to being pathogenic for a single host species. For example, M. bovis causes tuberculosis in a wide range of animals including humans in which it causes a disease that is clinically indistinguishable from that caused by M. tuberculosis. Human tuberculosis is a major cause of mortality throughout the world, particularly in less developed countries. It accounts for approximately eight million new cases of clinical disease and three million deaths each year. Bovine tuberculosis, as well as causing a small percentage of these human cases, is a major cause of animal suffering and large economic costs in the animal industries.

Antibiotic treatment of tuberculosis is very expensive and requires prolonged administration of a combination of several anti-tuberculosis drugs. Treatment with single antibiotics is not advisable as tuberculosis organisms can develop resistance to the therapeutic levels of all antibiotics that are effective against them. Strains of M. tuberculosis that are resistant to one or more anti-tuberculosis drugs are becoming more frequent and treatment of patients infected with such strains is expensive and difficult. In a small but increasing percentage of human tuberculosis cases the tuberculosis organisms have become resistant to the two most useful antibiotics, isoniazid and rifampicin. Treatment of these patients presents extreme difficulty and in practice is often unsuccessful. In the current situation there is clearly an urgent need to develop new methods for detecting virulent strains of mycobacteria and to develop tuberculosis therapies.

There is a recognized vaccine for tuberculosis, which is an attenuated form of M. bovis known as BCG. This is very widely used but it provides incomplete protection. The development of BCG was completed in 1921 but the reason for its avirulence was and has continued to remain unknown. Methods of attenuating tuberculosis strains to produce a vaccine in a more rational way have been investigated but have not been successful for a variety of reasons. However, in view of the evidence that dead M. bovis BCG was less effective in conferring immunity than live BCGj there exists a need for attenuated strains of mycobacteria that can be used in the preparation of vaccines.

A variety of compounds have been proposed as virulence factors for tuberculosis but, despite numerous investigations, good evidence to support these proposals is lacking. Nevertheless, the discovery of a virulence factor or factors for tuberculosis is very important and is an active area of current research. Such a discovery would not only enable the possible development of a new generation of tuberculosis vaccines but might also provide a target for the design or discovery of new or improved anti¬ tuberculosis drugs or therapies.

Present methods for the identification and characterization of mycobacteria in samples from human and animal diseases are by Zeil-Neilson staining, in-vitro and in vivo culture, biochemical testing and serological typing . These methods are generally slow and do not readily discriminate between closely related mycobacterial strains and species particularly, for example, Mycobacterium paratuberculosis and Mycobacterium avium. Mycobacteria are widespread in the environment, and rapid methods do not exist for the identification of specific pathogenic strains from amongst the many environmental strains, which are generally non-pathogenic. Difficulties with existing methods of mycobacterial identification and characterization have increased relevance for the analysis of microbial isolates from Crohn's disease (Regional Ileitis) in humans and Johne's disease in animals (particularly cattle, sheep and goats) as well as for M. avium strains from AIDS patients with mycobacterial superinfections. Although recognition of the causative agents of human leprosy and tuberculosis are clear, clinico-pathological forms of each disease exist, such as the tuberculoid form of leprosy, in which mycobacterial tissue abundance is low and identification correspondingly difficult. Improvements in the specific recognition and characterization of mycobacteria may also increase in relevance if current evidence linking diseases such as rheumatoid arthritis to mycobacterial antigens is substantiated. Emerging drug resistance to mycobacteria including M. avium isolates from AIDS patients, any Mycobacterium tuberculosis from TB patients is an increasing problem.

There is no data or technical information in the prior art, which permits to select specifically potential new targets and protective antigens for new drugs and vaccine compositions to treat and prevent infectious diseases, particularly tuberculosis. Furthermore, there is a need for the development of new tools for the selection of genes which encode for essential proteins or regulatory nucleotidic sequences in the survival or infection of mycobacterium species and useful for the design of anti¬ tuberculosis drugs and vaccines based on the knowledge of comparative mycobacterial genomics.

A method of using DNA probes for the precise identification of mycobacteria and discrimination between closely related mycobacterial strains and species by genotype characterization is essential. The method of genotypic analysis is further applicable to the rapid identification of phenotypic properties such as drug resistance and pathogenicity.

The invention aids in fulfilling these needs in the art. The method according to the invention has the advantage to reduce drastically the number of potential new targets and protective antigens by giving for the first time an exhaustive description of conserved SNPs in the tuberculosis. The isolated polynucleotides described in the present invention, which are highly conserved in genomic sequences of both virulent and avirulent, are by this characteristic essential for the survival or the virulence of these mycobacteria in the host. The identification of antigens and potentially therapeutic targets has been made by a method of comparative genomic analysis. Prior Art

Patent application WO 02074903 describes a method of selection of purified nucleotidic sequences or polynucleotides encoding proteins or part of proteins carrying at least an essential function for the survival or the virulence of mycobacterium species by a comparative genomic analysis of the sequence of the genome of M. tuberculosis aligned on the genome sequence of M. leprae and M. tuberculosis and M. leprae marker polypeptides of nucleotides encoding the polypeptides, and methods for using the nucleotides and the encoded polypeptides are disclosed.

US patent no. 6,228,575 provides oligonucleotide based arrays and methods for speciating and phenotyping organisms, for example, using oligonucleotide sequences based on the Mycobacterium tuberculosis, rpoB gene. The groups or species to which an organism belongs may be determined by comparing hybridization patterns of target nucleic acid from the organism to hybridization patterns in a database.

Patent application no. WO9954487 and US patent no.6,492,506 describes a method for isolating a polynucleotide of interest that is present or is expressed in a genome of a first mycobacterium strain and that is absent or altered in a genome of a second mycobacterium strain which is different from the first mycobacterium strain using a bacterial artificial chromosome (BAC) vector. This invention further relates to a polynucleotide isolated by this method and recombinant BAC vector used in this method. In addition the present invention comprises method and kit for detecting the presence of a mycobacteria in a biological sample.

US patent no. 5,783,386 describes polynucleotides associated with virulence in mycobacteria, and particularly a fragment of DNA isolated from M. bovis that contains a region encoding a putative sigma factor. Also provided are methods for a DNA sequence or sequences associated with virulence determinants in mycobacteria, and particularly in M. tuberculosis and M. bovis. In addition, the invention provides a method for producing strains with altered virulence or other properties, which can themselves be used to identify and manipulate individual genes.

US patent no. 5,955,077 relates to novel antigens from mycobacteria capable of evoking early (within 4 days) immunological responses from T-helper cells in the form of gamma-interferon release in memory immune animals after rechallenge infection with mycobacteria of the tuberculosis complex. The antigens of the invention are believed useful especially in vaccines, but also in diagnostic compositions, especially for diagnosing infection with virulent mycobacteria. Also disclosed are nucleic acid fragments encoding the antigens as well as methods of immunizing animals/humans and methods of diagnosing tuberculosis.

US patent no. 6,596,281 describes two genes for proteins of M. tuberculosis have been sequenced. The DNAs and their encoded polypeptides can be used for immunoassays and vaccines. Cocktails of at least three purified recombinant antigens, and cocktails of at least three DNAs encoding them can be used for improved assays and vaccines for bacterial pathogens and parasites. US patent no. 5,700;683 provides specific genetic deletions that result in an avirulent phenotype of a mycobacterium. These deletions may be used as phenotypic markers of providing a means for distinguishing between disease-producing and non-disease producing mycobacteria.

US Patent no. 5,225,324 relates to a family of DNA insertion sequences (ISMY) of mycobacterial origin and other DNA probes which may be used a probes in assay methods for the identification of mycobacteria and the differentiation between closely related mycobacterial strains and species. The use of ISMY, and of proteins and peptides encoded by ISMY, in vaccines, pharmaceutical preparations and diagnostic test kits is also disclosed.

WO0066157 patent application provides for polypeptides encoded by open reading frames present in the genome of Mycobacterium tuberculosis but absent from the genome of BCG and diagnostic and prophylactic methodologies using these polypeptides.

US 6,458,366 discloses compounds and methods for diagnosing tuberculosis. The compounds provided include polypeptides that contain at least one antigenic portion of one or more M. tuberculosis proteins, and DNA sequences encoding such polypeptides. Diagnostic kits containing such polypeptides or DNA sequences and a suitable detection reagent may be used for the detection of M. tuberculosis infection in patients and biological samples. Antibodies directed against such polypeptides are also provided.

S. T. Cole has sequences the complete genome sequence of the best-characterized strain of Mycobacterium tuberculosis, H37Rv. The sequence has been analyzed in order to improve our understanding of the biology of this slow-growing pathogen and to help the conception of new prophylactic and therapeutic interventions. [Nature 393, 537 - 544 (1998)]

In a multicomponent analysis to determine the association of polymorphism to the degree of virulence and infectivity is in progress. These polymorphisms constitute a set of putative virulence markers that are being validated in 120 clinical isolates of tuberculosis. The study results in a set of virulence markers, which could be used in predicting the degree of virulence and infectivity of Mycobacterium infections.

There is no data or technical information in the prior art, which permits to select specifically potential new targets and protective antigens for new drugs and vaccine compositions to treat and prevent infectious diseases including mycobacterial diseases, particularly tuberculosis and leprosy.

SUMMARY OF THE INVENTION

The object of the present invention is to identify genes which encode for essential proteins or regulatory nucleotidic sequences in the survival or infection of mycobacterium species as also all infectious diseases and which could be useful for the design of drugs and vaccines based on the knowledge of comparative genomics. Yet another object of the present invention is to provide for the identification of strains including mycobacterium in disease samples, for the specific recognition of pathogenic strains, for precisely distinguishing closely related strains including mycobacterial strains and for defining virulence and resistance patterns.

The method according to the invention has the advantage to reduce drastically the number of potential new targets and protective antigens by giving for the first time an exhaustive description of conserved SNPs in different M. tuberculosis strains, which cause tuberculsosis. The isolated polynucleotides described in the present invention, which are highly conserved in genomic sequences of virulent strains are essential for the survival or the virulence of these strains, in particular mycobacteria, in the host. The identification of antigens and potentially therapeutic targets has been made by a method of comparative genomic analysis.

The invention is directed to identifying virulence factors in M. tuberculosis & other infectious diseases, using both strands of DNA, RNA and/or proteins associated with the virulence factors, allowing identification of the strain, typing of the strain and finally giving orientation to its potential degree of virulence, infectivity and/or latency.

Accordingly this invention provides a nucleotide sequences for diagnosis, identification of the strain, typing of the strain and giving orientation to its potential degree of virulence, infectivity and/or latency of all infectious diseases having a SEQ ID nos 1 to 2531.

The invention is further directed to a method comprising of aligning the genomic sequences of different mycobacteria species to a. Select a polynucleotide sequence highly conserved amongst the virulent strains and corresponds to an essential gene for the survival or the virulence of mycobacterium species b. Select polymorphisms between virulent and avirulent strains to identify genes and regions conferring virulence to the former strains c. And optionally, testing the polynucleotide selected for its capacity of virulence or involved in the survival of a mycobacterium species said testing being based on the activation or inactivation of said polynucleotide in a bacterial host or said testing being based on the activity of the product of expression of said polynucleotide in vivo or in vitro.

The invention further comprises of identification of following polymorphisms, having potential to be used as reagents and in diagnostics, drug and vaccine development for infectious diseases: i. Identical nucleotide in . virulent strains/species, but a different nucleotide in avirulent strains/species at the same position ii. Some of the virulent strains differ in the nucleotide sequence at specific positions and share the nucleotide sequence with that of avirulent strains. Yet another object of the present invention is to provide for the identification of strains including mycobacterium in disease samples, for the specific recognition of pathogenic strains, for precisely distinguishing closely related strains including mycobacterial strains and for defining virulence and resistance patterns.

The invention further comprises of identification of following polymorphisms, having potential to be used as reagents and in diagnostics, drug and vaccine development for infectious diseases: i. Identical nucleotide in virulent strains/species, but a different nucleotide in avirulent strains/species at the same position ii. Some of the virulent strains differ in the nucleotide sequence at specific positions and share the nucleotide sequence with that of avirulent strains. The invention relates to the identification and analysis of Non-synonymous SNPs to predict conservative and non-conservative amino acid substitutions. The effect of the substitution on the function of the proteins encoded provided a powerful insight in predicting SNPs correlating with virulence and infectivity in infectious diseases for example M. tuberculosis.

The invention further relates to proteins, RNA, DNA and metabolites encoded by the region carrying the polymorphisms in tuberculosis and other infectious disease causing organisms; which can be utilized for developing drugs and vaccines effective against tuberculosis and other infectious diseases, plays a important role in gene therapy, RNAi technology.and imaging.

The invention is also directed to a process for the production of recombinant polypeptides and chimeric polypeptides comprising them, antibodies generated against these polypeptides, immunogenic or vaccine compositions comprising at least one polypeptide useful as protective antigens or capable to induce a protective response in vivo or in vitro against ycobacterium infections, immunotherapeutic compositions comprising at least such a polypeptide according to the invention, and the use of such nucleic acids and polypeptides in diagnostic methods, vaccines, kits, or antimicrobial therapy.

SEQ ID Nos.l to 1829 are single nucleotide polymorphisms. SEQ ID Nos.l 830 to 2286 is an insertion/deletion (indel) SEQ ID No 2287 to 2531 are regions of long polymorphism.

The present invention also includes primer sequences for amplifying the region around the polymorphism SEQ ID nos 1 to 2531

The nucleotide sequences flanking the polymorphisms of SEQ ID Nos. 1 to 2531 to a length of 35 nucleotides on either side are used in reagents and in diagnostics, drug development, RNAi, gene therapy and other such technologies.

SEQ ID Nos 1 to 2531 are used as targets for drug design using bioinformatics and other tools, drug development, for gene therapy and vaccine development. This invention also includes the use of proteins, RNA, DNA and metabolites encoded by the region carrying the polymorphisms having a SEQ ID Nos. 1 to 2531 for RNAi technology and antisense technologies.

This invention also includes a database for identification and selection of the polymorphisms having SEQ ID nos . 1 to 2531 . Brief description of the figures and tables:

Fig 1 describes Entity Relationship Model.

Fig 2 illustrates the identification of SNPs in M. tuberculosis strains H37Rv, CDC1551 and M. bovis BCG. A total of 1829 SNP's have been identified in the three genomes. Of these 1825 SNPs are identical in H37Rv and CDC1551, with a different nucleotide in BCG. 1579 of these are in ORFs while the rest (246) are in non-coding regions. The SNPs in the ORF are categorized into synonymous, non-synonymous SNPs. The latter are further categorized on the basis of the change in primary structure of the protein that results - conservative for no-change and non-conservative for changed primary structure of protein encoded.

Figure 3 illustrates the identification of indels in M. tuberculosis strains H37Rv, CDC1551 and M. bovis BCG. A total of 794 indels have been identified in the three genomes. Of these, 237 are present in both H37Rv and CDC1551 with respect to BCG, 178 in ORF and 59 are outside the ORF.

Figure 4 illustrates Identification of long plymorphisms in M. tuberculosis strains H37Rv, CDC 1551 and M. bovis BCG. 136 polymorphisms are present in the three genomes, 30 of them being identical to CDC1551 and H37Rv. 22 of these polymorphisms are present in the ORFs while 8 are outside the ORF.

Figure 5 display shows a region of 10kb of the BCG genome with three types of annotations: BCG ORF's, SNP's in H37Rv, and SNP's in CDC1551.

Figure 6 shows the comparative genomics browser displaying BCG in the upper panel and H37Rv in the bottom panel. The segments labeled MUM-* are the perfect matches generated by the MUMmer tool, and the vertical lines show the alignment of the MUM segments in both genomes. The color coding of the ORF's is used to indicate the length of the ORF. This is very helpful to researchers because if an ORF in H37 aligns with an ORF in BCG but they have different colors, then there is a mutation that makes them have different lengths (see for example the genes in the MUMrl280 region).

Figure .7.1 - 7.25 are the primers used for the amplification to encompass the regions of polymorphisms.

Table 1 gives the list of Single Nucleotide Polymorphisms in Mycobacterium tuberculosis/ M. bovis BCG.

Table 2 gives the list of Insertions/deletions (Indels) in Mycobacterium tuberculosis/ M, bovis BCG.

Table 3 gives the list of long polymorphisms in Mycobacterium tuberculosis/ M. bovis BCG.

Table 4 lists Polymorphisms in genes involved in cell wall synthesis. Table 5 lists Polymorphisms in transcription factors.

Table 6 lists Polymorphisms in genes involved in lipid metabolism

Table 7 lists Polymorphisms in genes encoding membrane transport proteins

Table 8 lists Polymorphisms in genes implicated in virulence

Detailed Description of the invention:

The Mycobacterium tuberculosis complex consists of six species - M. tuberculosis, M. bovis, M. canotti, M.microtii and M. africanum. Of these, the genomes of two different strains of M. tuberculosis, which are virulent and infective to humans, have been completely sequenced, while the complete genome of M. bovis BCG, which is non-virulent and non-infective has also been sequenced. Only partial sequences are available for the other species. All Mycobacterium sequences available in the NCBI, EMBL, GENBANK, Sanger and ΗGR databases were retrieved and compiled.

The total numbers of sequences retrieved are as follows:

Species name No of sequences retrieved

Mycobacterium africanum 16

Mycobacterium canetti 03

Mycobacterium microtii 24

Mycobacterium tuberculosis 121 A

Mycobacterium bovis 183

The complete genomes of Mycobacterium tuberculosis strains H37Rv (referred to as H37Rv) and CDC1551 (referred to as CDC1551) - both of which are virulent and infective to humans) and Mycobacterium bovis BCG (referred to as BCG) - non- virulent and non-infective in humans - were aligned and a database constructed. The structure of the database is given in figure 1.

Sequences were aligned using the pairwise alignment tool "MUMmer-3.08" (www.tigr.org).

The use of MUMmer required three distinct steps:

1. running MUMmer for each of the target genomes (CDC1551 and H37Rv) against the reference genome (BCG)

2. parsing the MUMmer output using to produce a list of polymorphisms, and loading these data into a polymorphism database.

3. generating feature files for visualization, and loading these features into a feature database.

BCG was chosen as the reference genome and compare the two tuberculosis strains, CDC1551 and H37Rv, against the reference. MUMmer uses fasta files as input and was run using the following command line: run-mummer 1 bovis.fasta cdcl551. fasta BCG-CDC which takes the format, program <reference> <query> <output>

The BCG-CDC parameter provides the file name prefix for the output files, the bovis.fasta parameter is the reference fasta file, and the CDC1551.fasta parameter is the name of the query fasta sequence file.

The database is generated using the scripts: Parsing MUMmer . align file to extract polymorphism data

The file is parsed to extract useful information and stored it in a much simpler tab- delimited text file format. A custom perl script named mum-par se.pl which uses the Perl module Parse ::RecDescent to create a recursive descent parser based on the grammar contained in the custom file Mummer . pm. is used to run the following command line:

$perl Jmum-parse.pl —mummerl ~outfile=../mummer/BCG-CDC .. /mummer /BCG- CDC. align

This creates three output files:

1. BCG-CDC.gaps - this is the initial output file that simply lists the location of all exact matches in the two sequences.

2. BCG-CDC.errorgaps - this is a processed version of the gaps file.

3. BCG-CDC. align - this is the fully annotated file that is used to locate all polymorphisms.

Pairwise alignments of BCG-H37Rv and BCG-CDC1551 was done using the BCG genomic sequence as reference. Results of the alignment identified three types of polymorphisms:

1. SNPs - single nucleotide polymorphisms in one or more of the sequences aligned.

2. indels - insertion or deletion of one or more bases in the sequences aligned.

3. Long polymorphic regions - regions with numerous changes in the sequences aligned.

Inserting the Annotation of the complete genomes into the database

The gene annotation downloaded from either genbank or EMBL is included into the database by running the following script

$ /work/mtb/scripts annot.pl ~seq= [filename] ~dbname=[NAME] — user=[NAME] — password=[PASS] filename indicates either genbank or the EMBL genes annotation file. Inserting the Data into the DB

To insert the CDC1551 SNP's into the DB the following command is run:

$ perl /work/mtb/scripts/snp-insert.pl ~snp=../mummer/BCG-CDC.snp ~ user=[NAME] -password=[PASS] ~query_acc=NC_002755 To insert the H37Rv SNP's into the DB run the following command is run:

$ perl /work/mtb /scripts/snp-insert.pl ~snp=../mummer/BCG-H37.snp ~ user=[NAME] ~ρassword=[PASS] ~-query_acc=NC_000962

To determine whether SNP's are synonymous or non-synonymous, whether they are within or outside an open reading frame. is first determined. All SNP's that lie within an ORF are taken and the amino acid for that codon containing the SNP is determined.

To determine if the BCG locations lie within ORFs run the following command is run:

$ perl /work/mtb /scripts/snp-orf-ref.pl ~ref_seq=../seqs/bovis.fasta — user=[NAME] - -password=[PASS]

AU BCG locations within ORFs must have their amino acids determined. To do so, the following command is run:

$ perl /work/mtb/scripts/ref-aa.pl — ref_seq=../seqs/bovis.fasta — user=[NAME] -password=[PASS]

Next, the H37Rv and CDC1551 locations are mapped. To assign the CDC1551 ORFs the following command is run:

$ perl /work/mtb /scripts/snp-orf2.pl — query_seq=../seqs/CDC1551.fasta — user=[NAME] -password=[PASS]

To assign the H37Rv ORFs the following command is run:

$ perl scripts/snp-orf2.pl — query_seq=../seqs/H37Rv.fasta — user=[NAME] — password=[PASS]

To determine whether the CDC1551 SNP's are synonymous or non-synonymous the following command is run:

$ cd /work/mtb/scripts

$ perl s/work/mtb/scripts/synomous.pl ~bcg_file=../seqs/bovis.fasta ~ query_seq=../seqs/CDC1551.fasta ~user=[NAME] — password=[PASS]

To determine whether the H37Rv SNP's are synonymous or non-synonymous the following command is run:

$ cd /work/mtb/scripts

$ perl /work/mtb/scripts/synomous.pl — bcg_file=../seqs/bovis.fasta - bcg_file=../seqs/H37Rv.fasta ~user=[NAME] -password=[PASS]

A set of summary columns are used to coallesce all the SNP data in one place. To do this, the following command is run: $ perl /work/nitb/scripts/compare-snps.pl ~user=[NAME] -password=[PASS]

To insert data into the SNP analysis table the SNP data from the SNP, SEQ_SNP and gene ontology tables is fetched and entered into the SNP_analysis table. This step also identifies the conservative and non-conservative amino acids.

To do this, the following program is run:

$ run.sh /work/mtb/scripts /

The SNP data in the database is thus complete. °

Analysis of SNPs

The SNPs identified were of two kinds: i. Identical nucleotide in CDC 1551 and H37Rv, but a different nucleotide in

BCG at the same position, ii. One of the three sequences is polymorphic; the nucleotide sequence of

CDC1551 and H37Rv are different from each other and one of them is identical to the BCG sequence at identical positions.

The SNPs thus identified were categorized according to their location in Open Reading Frames. SNPs falling within the ORF of both BCG and H37Rv were identified. The results were validated by determining if the SNPs were present in the ORFs of BCG and CDC1551.

The SNPs falling in ORFs were further categorized into synonymous and non- synonymous SNPs. A SNP was said to cause a non-synonymous change if:

1) It occurs in an ORF

2) It occurs in the *same* ORF in the genome it is being compared to.

In some cases a SNP can be in one ORF in the reference sequence but in another ORF in the comparison sequence, e.g. due to a frame-shift mutation earlier in the sequence. So before we assign SNP's to 'Non Synonymous' or 'Synonymous' groupings all SNP's which either did not fall in an ORF, or fell into different ORF's on the reference and comparison sequences were eliminated. The BCG and H37 genomes have been annotated with respect to one another. However CDC 1551 has not been so thoroughly annotated, so it was not possible to immediately assess if an ORF in BCG was the corresponding ORF in CDC. Therefore, a metric was devised to eliminate spurious comparisons.

The non-synonymous SNPs thus identified was analysed to predict conservative and non-conservative amino acid substitutions. The effect of the substitution on the function of the proteins encoded was predicted. This provides a powerful insight in predicting SNPs correlating with virulence and infectivity in M. tuberculosis.

Below is an example of the output obtained from the database. •

The above figure describes the SNP details, which is as follows:

■ Bovis_pos - Bovis position having a SNP.

■ Bovis_ORF - Yes indicates that the SNP in bovis is in bovis ORF. No indicates not in ORF.

^■ Bovis_base - Indicates the SNP with respect to the SNP position in bovis

■ Bovis_AA - Displays the bovis amino acid after the codon translation.

■ Qryjiame - Displays the name of a strain, example H37Rv or microtii

^■ Qry_pos - Displays the position of a SNP in either CDC 1551 or H37Rv with respect to bovis SNP position.

■ QryJDRF- Displays Yes if the SNP falls in the ORF of the query (H37Rv or CDC1551)

■ Qry_base - Displays the query SNP.

■ Qry_AA - Displays the amino acid of the query (H37Rv or CDC1551).

■ Is_nsSNP - Displays SNPs synonymous (S), non-synonymous (NS) and SNPs in non-coding region (NC).

^■ Conservative_subst - Displays homologous substitution in H37rv and CDC1551.

■ Fun_annotation - Will display the functional annotation of the query.

A list of Single nucleotide polymorphisms identified in the manner described above is given in Table 1.

A total of 1829 have been identified in the three genomes. Of these 1825 SNPs consist of having the same nucleotide in H37Rv and CDC1551, with a different nucleotide in BCG. Of thel829 SNPs, 1579 are in ORFs while the rest (246) are in non-coding regions. 811 H37Rv SNPs and 810 CDC1551 SNPs are synonymous while 1282 H37Rv and 1219 CDC1551 SNPs are non-synonymous. Out of 1219 CDC1551 nsSNPs, 312 SNPs have conservative amino acid substitution, 888 have non- conservative substitution and 19 results in truncated proteins. Out of 1282 H37Rv non-synomous SNPs, 304 have conservative amino acid substitution, 954 have non- conservative substitution and 24 results in truncated proteins. (Figure 2)

Analysis of indels (insertions arid deletions):

Indels are insertions and deletions in the sequence with respect to BCG sequence. These indels could be of one or more nucleotides. Considering BCG as reference sequence, the indels in the both the strains of M. tuberculosis, H37rv and CDC1551 were identified.

To insert the indels from the .align file of the mummer output into the database, the following Java program is run:

$ java /work/mtb/scripts/indel

To enter functional annotation from the gene ontology database into the indels table, the following program is run:

$ java /work/mtb/scripts /indfunction

The list of indels identified is given in Table 2.

A total of 794 indels have been identified in the three genomes. Of these, 237 (H37Rv) and 237 (CDC1551) indels are present in both H37Rv and CDC1551 with respect to BCG. Of these, 178 are in ORF and 59 are outside the ORF. (Figure 2)

Analysis of Long polymorphs:

Long polymorphs are insertions or deletions of long stretches of nucleotides with respect to BCG sequence.

To insert the long polymorphs from the .align file of the mummer output into the database, following java program is run:

$ java /work/mtb/scripts /indel

To enter the functional annotation from the gene ontology database into the long polymorph table, following java program is run:

$ java /work/mtb/scripts /indfunction

A table listing the long polymorphisms is given in Table 3.

A total of 136 long polymorphisms have been identified in the three genomes. Of these, 30 (H37Rv) and 30 (CDC1551) indels are present in both H37Rv and CDC1551 with respect to BCG. Of these, 22 are in ORF and 8 are outside the ORF. (Figure 3)

Functional annotation of the polymorphisms identified

In order to identify polymorphisms with a putative functional association, a tool was built using the Gene Ontology DB (GO). The EMBL sequence DB has made putative GO assignments to most of the ORF's in the three TB genomes, so a local installation of GO was used together with the EMBL cross reference tables to identify TB polymorphisms based on their putative functional classification.

The annotation table consisting of the genbank features of the genes such as coding region, database reference and product information to name a few was constructed.

To inserts the gene ontology features such as term defenition and name from the gene ontology database into the indels and long polymorph table, following program is run:

$ java /work/mtb/scripts /indfunctionl

The following are the list of attributes in the annotation table.

Accession no — This indicates the accession number of the sequences

Gene_start - This indicates the start of the coding region

Gene_end - This indicates the end of the coding region

Locus_tag - db_xref - This indicates the gene indices representation of the gene db_xref_GOA - This indicates the gene ontology identity of the gene product id - This indicates the gene annotation type - strand - This indicates the forward or reverse strand of the sequence that is stored in the genbank gene_name - This indicates the gene name gene_link - This provides a hyperlink to the gene features form the genbank note - This provides the general information and the protein information of the gene.

A front-end was constructed as an essential part of the database: Front end of the database:

The front-end displaying the results of alignment as follows:

The annotation table consists of genbank annotation about the genes in bovis, H37Rv and CDC1551. It specifies details including the coding region of a gene and its database reference.

The annotation id for the SNPs, indels and long polymorphs has been hyperlinked to obtain all the records pertaining to a particular gene.

The data pertaining to indels and long polymorphs have also been added to the front- end. ' Description of the queries:

The database is made queryable to retrieve the required features of SNPs, indels and long polymorphs respectively.

The main options to query the SNP information are: Select SNPs

^■ ALL - This displays all the records which satisfies the below features.

^■ Identical in both queries - This query indicates that SNPs are present in BCG with respect to H37Rv and CDC 1551.

^■ Different bases in both queries - This query indicates different nucleotides in H37Rv and CDC1551.

^■ Having SNPs in BCG-H37 only - This query specifies SNPs in BCG and H37Rv only and not in CDC 1551.

^■ Having SNPs in BCG-CDC only - This query specifies SNPS in BCG and CDC1551 only and not in H37Rv.

^■ BCG-H37 SNPs - This query indicates, that SNPs are present in H37Rv with respect to BCG .position and may or may not be present in CDC 1551 at that particular position.

^■ BCG-CDC SNPs - This query indicates, that SNPs are present in CDC1551 with respect to BCG position and may or may not be present in H37Rv at that particular position.

The other options considered are:

■ Select BCG ORF - This provides an option to select the presence of BCG SNPs in BCG ORF or outside the BCG ORF.

■ Select query ORF - This provides an option to select the presence of query SNPs in query ORF or outside the query ORF.

■ Select synonymous - This provides an option to select if the SNP is synonymous or non-synonymous.

■ Select Conservative - This provides an option to select if the non-synonymous SNP results in conservative, non-conservative substitution or truncated protein.

■ Select function - This provides an option to select a required function, which includes cell wall synthesis, Transcription factor, Lipid metabolism, Membrane transport and Surface proteins. An example of a query to extract SNP information from the database is shown below.

The result obtained from the above query is shown below: Explorer

The query has been designed in the similar way for both indels and long polymorphs.

The SNP analysis includes functional annotation id, which is hyperlinked to the functional annotation of the gene carrying the polymorphism. The functional annotation id consists of either one of the Swiss Prot, SPTREMBL or gene ontology id's. Similarly the indels and long polymorphs are also functionally annotated.

Genes with known involvement in virulence of Mycobacterium tuberculosis can also be accessed from the SNP database query or from the Long polymorphs database query respectively.

Polymorphisms involved in the following functions have been identified:

1. Cell wall synthesis

2. Transcription factor

3. Lipid metabolism

4. Membrane transport

5. Surface proteins.

6. Virulence genes

One such query for cell wall synthesis function is shown below

The output of the above query is shown below

The polymorphisms detected in genes involved in cell wall synthesis are listed in Table 4.

Visualization tools

To increase the utility of the SNP data, two tools to visualize the Tuberculosis SNP data have been created: the first tool was based on the Generic Genome Browser developed at Cold Spring Harbor Lab (CSHL). This visualization tool could show a single TB genome along with any annotations, e.g. SNP locations for all other genomes.

The details of the browser is as follows:

^■ The output displays the polymorphs in the region of interest.

^■ Alternatively the output can be obtained by specifying the region of interest in the text box labeled as "landmark or region". In case of SNP, the gene start and the gene end has to be specified and in case of indels or long polymorphs, the BCG start and BCG end must be specified.

^■ By clicking the ruler at the region of interest across the genome, the view can be re-centered.

^■ The display can also be zoomed in or out by selecting the required number of base pairs in the scroll down menu.

The required features can be displayed by selecting the options in the tracks checkbox as shown in Figure 4 Figure 4 display shows a region of 10kb of the BCG genome with three types of annotations: BCG ORFs, SNFs in H37Rv, and SNPs in CDC1551.

To compare multiple genomes, a second tool based on the WormBase synteny browser was built. This tool can visualize two TB genomes at one time and was very useful in validating the polymorphisms the CDC 1551 genome as shown in Figure 5.

Figure 5 shows the comparative genomics browser displaying BCG in the upper panel and H37Rv in the bottom panel. The segments labeled MUM-* are the perfect matches generated by the MUMmer tool, and the vertical lines show the alignment of the MUM segments in both genomes. The color coding of the ORF's is used to indicate the length of the ORF. This is very helpful to researchers because if an ORF in H37 aligns with an ORF in BCG but they have different colors, then there is a mutation that makes them have different lengths (see for example the genes in the MUM-1280 region).

A methodical screening of all the regions of polymorphism identified above in clinical isolates with known disease profiles to further home-in on the polymorphisms associated with virulence and/or infectivity in M.tuberculosis is in progress.

2. Screening of regions of polymorphisms

A set of five Mycobacterium tuberculosis strains with known virulence is being screened for the polymorphisms identified above.

Strains chosen: The following strains have been chosen for the study: a. H37Rv - a reference laboratory strain known to be infective to mice, but is only mildly infective in humans. It has undergone a number of passages in the lab since its isolation. It is the standard used in studies on tuberculosis in different laboratories across the world. b.. Beijing strain - a clinical isolate with known virulence and infectivity in humans. 70% of the patients with tuberculosis in certain areas of India and China are infected with this strain. The strain was isolated from a patient in the Western Indian state of Mumbai. c. S.I - a mild South Indian strain with only mild virulence and infectivity in humans isolated from a patient residing in the South Indian state of Hyderabad. d. N.I.F - Fatal North Indian strain isolated from Safderjung hospital, Delhi where the patient developed pulmonary tuberculosis died. e. N.I.NF - a non-fatal North Indian strain isolated from Safderjung hospital, Delhi. Known clinical progression of disease in the patient.

Primers have been designed to encompass the regions of polymorphisms. The list of the primers used for the amplification is given in the Fig. 6.1-6.25 Amplification and sequencing of regions around the polymorphisms: DNA from the five strains has been amplified under optimal conditions determined for each primer pair. The amplified fragments have been sequenced and the sequences obtained from different strains compared.

A few examples are given below:

Sequencing of the region from H-590622 to H-591026. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; NINF: non-lethal North Indian strain; BS: Beijing strain; NIF: Lethal North Indian strain. The gene coding for oxidoreductase activity is a virulence gene which does not show any differences between the M.tuberculosis strains, but has a conservative polymorphism with M.bovis BCG .

Sequencing of the region from H-138548 to H-139067. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain .The insertion in BCG leads to a shorter protein with a different carboxyl terminal compared to the transcription factor encoded by the tuberculosis strains. TGGCflCGGGflGCTGflGCCGTTGTGGTTC *fiC'T"_*C*CCT_**» ++C*G"+TGB*C»-CG*"-*+T**T*ϊ***CTCC6Cfl**G**TC***"^>*flC*T** *C*CCT** ♦ flC*T***C*CCT+** **C*CCT** ♦

Sequencing of the region from H-3283171 to H-3283585. Two SNPs, one indel and a long polymorphism characterize this region. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M. tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. All the polymorphisms occur in the fadD28, a virulence gene involved in fatty acid synthesis. They result in a non-conservative substitution and probably have an important role in the degree of virulence imparted to the strain.

Sequencing of the region from H-2051784 to H-2052209. This region is characterized by a SNP between M.bovis BCG and the tuberculosis strains and a second SNP common to the Asian strains and to BCG, but different from H37Rv and CDC1551. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NESfF: non-lethal North Indian strain. The SNP common to all the tuberculosis strains results in a conservative substitution in the PPE33b gene and does not affect the function of this gene. However the A to G substitution results in the truncation of the prjotein encoded by BCG.

Sequencing of the region from H-3006917 to H-3007246. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; M18: non-lethal North Indian strain. This region encloses a long polymorphism of 106bp inserted into a gene encoding an integral membrane protein in BCG and the Asian strains. This results in a longer integral membrane product in these strains as compared to H37Rv and CDC1551. The SNP also results in the introduction of a stop codon in H37Rv and CDC 1551 further reducing the length of the membrane protein encoded by the latter.

Sequencing of the region from H-3247737 to H-3248224 Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. All the polymorphisms observed occur in ppsA - the polyketide synthase gene and are synonymous substitutions. AU the three Asian strains show identity to BCG in this region.

Sequencing of the region from H-2052524 to H-2052863. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M. tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NMF: non-lethal North Indian strain; NIF: Lethal Norm Indian strain .A single nucleotide polymorphism occurring in the proton transport gene PPE33b results in the introduction of a stop codon and hence truncation of the protein in BCG.

Sequencing of the region from H-1468644 to H-1469150. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551;S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. An insertion of 47bp is seen in all the tuberculosis strains in Mb 1346c, a gene with DNA binding activity. A second polymorphism (SNP) is also seen immediately adjacent to the insertion in the same gene. The SNP results in splitting the gene into two genes while there is a single long gene in the M.tuberculosis strains.

Sequencing of the region from H-455094 to H-455468. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. The region is characterized by the occurrence of two indels and two SNPs in a transcription regulator. All the tuberculosis strains appear to be identical in this region while BCG. has a different amino-acid sequence in the region.

Sequencing of the region from H-466229 to H-466536. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain .The C to T transition occurs in a gene of unknown function and results in a synonymous substitution. However, the C to A change occurs in a transcription factor (MbO393) and is a non-conservative substitution resulting in a slightly different protein in BCG.

Sequencing of the region from H-560625 to H-561248. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain. A synonymous SNP occurs in a virulence gene and is identical in all the tuberculosis strains.

Sequencing of the region from H-2046394 to H-2046928. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain. The SNP in BCG results in splitting the gene PE-PGRS32 into two parts with the latter being truncated.

Sequencing of the region from H-1373629 to H-1374101. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M. tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF: Lethal North Indian strain. The two polymorphisms observed occur in a transcription factor and result in non- conservative substitutions.

Sequencing of the region from H-1622821 to H-1623282. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF:North Indian Fatal. The polymorphisms observed occur in a non-coding region outside the ORF.

Sequencing of the region from Bj2295752 to H-2296046. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. The polymorphism observed occurs in the pksl2 gene and results in a non-conservative substitution.

Sequencing of the region from H-3086111 to H-3086539. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. The SNP seen in H37Rv occurs in a non-coding region while the deletion in BCG leads to truncation of the transcription regulator.

Sequencing of the region from H-2295062 to H-2295633. Sequences are amplified from different strains. BCG: M.bovis' BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC 1551; A2313: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF:North Indian Fatal. The SNP observed occurs in the pksl2 gene and results in a non-conservative susbstitution.

Sequencing of the region from H-162341 to H-162761. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain. The deletion in BCG occurs in the region corresponding to a gene with putative enzyme activity and results in a loss of function in BCG.

Sequencing of the region from H-1478664 to H-1479140. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF:North Indian Fatal. The first T to C transition results in the truncation of the bacterial regulatory protein in BCG.

Sequencing of the region from H-2296260 to H-2296692. Sequences are amplified from different strains. BCG: M.bovis BCG; H37Rv: M.tuberculosis strain H37Rv sequence from NCBI database; CDC: CDC1551; S.I: South Indian strain A2313; BS: Beijing strain; NINF: non-lethal North Indian strain; NIF:North Indian Fatal strain.The long polymorphism observed in the pksl2 gene but does not alter the activity of the polyketide synthase enzyme.

A total of 2755 polymorphisms including 1779 in ORFs and 313 in regions outside the ORF are being screened for association to virulence and/or infectivity in tuberculosis. A multicomponent analysis to determine the association of polymorphism to the degree, of virulence and infectivity is in progress. The polymorphisms which constitute a set of virulence markers are further being validated in 120 clinical isolates of tuberculosis.

The virulence factors thus identified could be used as: i. Diagnostic markers in prediction of disease and its progress in the patient, ii. Drug targets for development of new and effective treatments for TB. iii. Candidate genes/sequences in DNA vaccine, iv. In development of SiRNA technology for combating tuberculosis.

Claims

1. Nucleotide sequences for diagnosis, identification of the strains, typing of the strains and giving orientation to its potential degree of virulence, infectivity and/or latency of all infectious diseases having a SEQ ID nos 1 to 2531

2. Nucleotide sequences as claimed in claim 1 for diagnosis, identification of the strains, typing of the strains and giving orientation to its potential degree of virulence, infectivfty and/or latency of all strains of Mycobacteria having SEQ ID nos 1 to 2531.

3. Nucleotide sequence as claimed in claim 1 or 2 wherein the said sequence is a single nucleotide polymorphism having SEQ ID Nos.l to 1829

4. Nucleotide sequence as claimed in claim 1 or 2 wherein the said sequence is an insertion/deletion (indel) having SEQ ID Nos.1830 to 2286

5. Nucleotide sequence as claimed in claim 1 or 2 wherein the said sequence are regions of long polymorphism having a SEQ ID No 2287 to 2531.

6. Primer sequences for amplifying the region around the polymorphism SEQ ID nosl to 2531

7. A nucleotide sequences flanking the polymorphisms of SEQ ED Nos. 1 to 2531 as claimed in claim 1 to a length of 35 nucleotides on either side for use in reagents and in diagnostics, drug development, RNAi, gene therapy and other such technologies.

8. Use of the sequences encompassing nucleotide sequence having a SEQ ID Nos 1 to 2531 as targets for drug design using bioinformatics and other tools, drug development, for gene therapy and vaccine development

9. Use of the sequences encompassing single nucleotide polymorphism having a SEQ ID Nos 1 to 1829 as claimed in claim 3 as targets for drug design using bioinformatics and other tools, drug development, for gene therapy and vaccine development

10. Use of the sequences encompassing insertion/deletion (indel) having a SEQ ID Nos. 1830 to 2286 as claimed in claim 4 as targets for drug design using bioinformatics and other tools, drug development, for gene therapy and vaccine development

11. Use of the regions of long polymorphism having a SEQ ID Nos. 2287 to 2581 as claimed in claim 5 as targets for drug design using bioinformatics and other tools, drug development, for gene therapy and vaccine development

12. Use of proteins, RNA, DNA and metabolites encoded by the region carrying the polymorphisms having a SEQ ID NOs. 1 to 2531 as claimed in claim 1 for drug design using bioinformatics and other tools, for development of drugs effective against infectious diseases including tuberculosis.

13. Use of proteins, RNA, DNA and metabolites encoded by the region carrying the polymorphisms having a SEQ ID NOs. 1 to 2531 as claimed in claim 1 for vaccine development against infectious diseases including tuberculosis.

14. Use of proteins, RNA, DNA and metabolites encoded by the region carrying the polymorphisms having a SEQ ID Nos. 1 to 2531 as claimed in claim 1 for RNAi technology and antisense technologies.

15. The method for generating and developing a daSibase for identification and . selection of the polymorphisms having SEQ ID nos to 2531 as claimed in claim 1

16. A method as claimed in claim 15 wherein the said database is generated using the algorithms as herein described

17. Use of the database as claimed in claim 15, for identification of the polymorphisms across organisms.

18. A diagnostic kit for diagnosis, identification of the strain, typing of the strain and giving orientation to its potential degree of virulence, infectivity and/or latency of all infectious diseases having a SEQ ID nos 1 to 2531 as claimed in claim 1

19. A diagnostic kit as claimed in 19 for diagnosis, identification of the strain, typing of the strain and giving orientation to its potential degree of virulence, infectivity and/or latency of all strain of Mycobacteria having SEQ ID nos 1 to 2531 as claimed in claim 1

20. A diagnostic kit as claimed in 19 wherein the said sequence is a single nucleotide polymorphism having SEQ ID Nos.l to 1829 as claimed in claim 3

21 A diagnostic kit as claimed in 19 wherein the said sequence is an insertion/deletion (indel) having SEQ ID Nos.1830 to 2286 as claimed in claim 4

22. A diagnostic kit as claimed in 19 wherein the said sequence are regions of long polymorphism having a SEQ ID No 2287 to 2531 as claimed in claim 5

23. Use of nucleotide sequences having a SEQ ID nos 1 to 2531 as claimed in claim 1 as probes in assay methods for the identification of strains for infectious diseases including mycobacterium

24. Use as claimed in claim 23 wherein the said sequence is a single nucleotide polymorphism having SEQ ID Nos.l to 1829

25. Use as claimed in claim 23 wherein the said sequence is an insertion/deletion (indel) having SEQ ID Nos.1830 to 2286

26. Use as claimed in claim 24 wherein the said sequence are regions of long polymorphism having a SEQ ID No 2287 to 2531.