Identifying Active Molecules using Physico-chemical Parameters
Field of the invention The present invention relates to methods and apparatus for identifying physico-chemical or topological parameters of molecules which are associated with biochemical activity (e.g. molecules which are suitable for being used as a pharmaceutical for a certain task) . The invention further relates to molecules which are identified as being active or potentially active on the basis of the identified physico-chemical or topological parameters.
Background of the invention Modern high-speed computational technigues are having an ever greater influence on the design of new drugs. In particular, a variety of techniques have been proposed to identify computationally compounds which are likely to have a certain biochemical activity (that is to be suitable for a particular biological use, e.g. a pharmaceutical use) . The purpose of this is so that subsequently in vi tro or m vi vo testing can be carried out on those molecules which have a relatively high likelihood of being biologically active. That is, in vi tro and m vivo testing (which are both relatively expensive to carry out) can be concentrated on drugs which have the maximal chance of being active.
Some such computational techniques for predicting the activity of molecules require and exploit biochemical understanding why a certain molecule is active. For example, such a method may create a candidate molecule purely as a representation within a computer. Such a molecule, which does not yet necessarily have physical existence, will be referred to herein as a "virtual molecule". The method may then attempt to predict the activity of the virtual molecule by molecular modelling,
taking into account biochemical understanding of the role the molecule has to play m order to be active (for example, what chemical bonds the molecule would have to be capable of forming) . By contrast, so called QSAR (Quantitative Structure- Activity Relationship) methods attempt to infer whether a given molecule is active, without specific biochemical understanding. These methods make use of the fact that certain molecules ("lead compounds") may already be known to exhibit the desired activity, at least to a certain degree. The activity of a new molecule can then be inferred based on a comparison between measured and/or calculated physico-chemical properties of the lead compounds and the new molecule. In other words, m QSAR the activity of a new candidate molecule is predicted not on the basis of biological insight, or at least not exclusively on that basis, but rather by inference by comparing known (or easily derivable) physico-chemical properties of the candidate molecule to the properties of lead compounds.
In this document the term physico-chemical parameters will be used to include any physical or chemical property of a molecule, including topological parameters of the molecule such as its folding conformations. It includes both properties which are "static" (at least m a time- averaged sense) such as the dipole moment of a molecule, and also "dynamic" properties of a molecule, such as ones characterising the range of conformations through which the molecule may flex over a period of time. In the case of some molecules, the flexing of the molecule over time can be determined with high accuracy using modern molecular modelling techniques. Each physico-chemical parameter of a molecule partially describes that molecule, and for this reason m this document the term "descriptor" will be used interchangeably with the term "physico- chemical parameter" .
The pioneering work initiating the QSAR concept was by Hansch and al . (1962) X This work demonstrated that biological activity can be linked quantitatively to some physico-chemical parameters and it also introduced the idea that the activity can be described by more than one parameter (i.e. multiple regression). Following this, some methods have been developed using regression analysis such as partial least square and multiple linear regression analysis.2"3 Other QSAR multivaπate statistical methods use principal component analysis4 and discriminant analysis.5 These methods proceed by reducing the space of initial variables to three- or bidimensional space. They allow the classification of the compounds by synthetic linear variables and link the obtained classes to the observed activity.
Although these techniques appear extremely useful, transparent and easy to interpret, they present two ma]or disadvantages .
First, they assume the continuity of the descriptor space (i.e. information carried by the physico-chemical descriptors) with the biological space (biological activity information) .
Second, the activity of the molecules is taken as a linear combination of the molecular physico-chemical descriptors.
The continuity between chemical and biological space is only true if we consider molecules m the same chemical series of compounds. Such a series is based on a definition of a common skeleton, with a general shape and fundamental functionalities. The members of these series are then generated by relatively small variations on this common theme. Thus, the use of chemical series restricts the above QSAR techniques to modelling relatively small and continuous variations of activity over members of the series.6 Thus, it is not surprising that QSAR models can predict the activity of these series reasonably well by
combining their physico-chemical parameters linearly.
However, m a more general class of compounds (e.g. a biological set of compounds), we usually find a discontinuity between the chemical and biological spaces. This arises from the fact that these series contain a collection of different chemical species sharing the same biological message (i.e. same mechanism of action) but which do not obey the common skeleton rule. Consequently, the classical linear QSAR methods are not able to provide accurate predictions of the activity of biological series of compounds. Furthermore, the information derivable from such approaches is balanced by the low robustness of the methods: the constraints (m terms of the range of molecules) within which the predictions are accurate are very strict. Thus, these known QSAR methods, even if they are adequate for lead optimization, are clearly unsatisfactory for de novo design, especially when molecular diversity exploration is a concern.
To deal with the complexity of relationships between biologically active molecules, some known non-linear QSAR methods are based on optimization algorithms using artificial neural networks or genetic
These methods attempt to produce quantitative models which are more robust (valid over a wider range of compounds) at the expense of transparency.
Recently Grassy et al10 (a citation which is incorporated herein by reference) have suggested a technique which significantly improves on the above QSAR methods. Whereas m QSAR the function f is derived only from lead compounds which are biologically active, Grassy et al exploited also lead compounds which were known to be inactive. Also, whereas classical QSAR uses a simple linear function, Grassy et al employed a more rigorous understanding of the relationship between activity and the space of physico-chemical parameters.
Specifically, rather than combining a number of
different descriptors into a single linear function, they determined for each descriptor a range which was associated with biological activity. For example, for the task of causing lmmunosuppressive activity m a certain environment, it turned out that having a dipole moment m the range 34.23 and 80.79 was associated with activity (i.e. most molecules (e.g. of a certain class) which were already known to be active had a dipole moment m this range, whereas most molecules which were already known to be inactive had a dipole moment outsiαe this range) . Candidate molecules having a descriptor value m this range thus have a higher likelihood of biological activity. A candidate molecule for which the values of each of several descriptors are inside a respective range associated with activity, is predicted to be active. Grassy et al only used descriptors which can be calculated by computer. This meant that a very large number of virtual molecules could be screened to determine whether they fall into the descriptor ranges, without it being necessary to actually fabricate them. By analogy with in vi tro and m vi vo screening, Grassy et al refer to their technique as "m silico" screening.
Due to the non-linearity of the structure-activity models of Grassy et. al (the descriptor ranges have hard limits) , their techniques are particularly applicable predicting the activity of compounds which are not m the same chemical class as the lead compounds.
Note that m vi tro or m vi vo screening allow one to derive relatively valuable information about a molecule, but only at the considerable expense of chemically fabricating it. By contrast, a single descriptor value obtained computationally tends to give relatively little information about the activity of the molecule. In other words, to achieve accurate predictions of activity several descriptors must be calculated, and information from them combined. Grassy et al showed that combining
computationally-calculated descriptor data relating to both static and dynamic descriptors permits biological activity to be predicted relatively well without the need for chemical fabrication. This therefore permits a large number of chemicals to be screened, compared to m vi vo or m vi tro screening.
Even if we restrict ourselves to descriptors which can be calculated for virtual molecules, the number of possible descriptors which can be envisaged is almost limitless. If a large number of virtual molecules are to be screened, a selection must be made of which descriptors to use so that the computational time required to calculate the descriptors for each molecule does not rise excessively. In addition to, or instead of, computationally calculable descriptors, many known QSAR methods use descriptors which can only be measured by chemical testing, and again a great variety of such descriptors can be envisaged. Therefore any QSAR technique must choose which descriptors to use. This choice may be made on a case-by-case basis, on the basis of biochemical intuition or entirely at random. Preferably the physico-chemical parameters which are chosen are not hignly correlated with each other (e.g. the values of any two descriptors are not highly correlated as measured over a large number of molecules) so that there is maximum information per descriptor value.
Attempts to rationally select descriptors have been made before, for example by So and Karplus2** who employed a "genetic algorithm" (GA) . GA is one of the recent optimization methods based on the natural selection concept. 8/ 13~17 To perform a genetic algorithm, two requirements must be satisfied: the encoding of the possible solution of the problem, and the quantitative evaluation of a given solution by an objective function. So and Karplus24 considered the problem of selecting descriptors from a large space of D descriptors (including
both descriptors which can be calculated and descriptors which must be found by chemical testing) . They considered a string of D components (i.e. with a component for each respective descriptor) , each component being either 1 (to indicate that the descriptor is worth using m QSAR) or -1 (to indicate that the molecule should not be used) . Selection of a set of descriptors is encoded as optimising the string within these constraints. To do this, they defined a fitness function of the string (i.e. a measure of the badness of the string) , and tried to minimise the objective function with respect to possible strings by a GA.
To define an objective function, they constructed for any string a neural network which uses as inputs the descriptors which are +1 m the string, and which is trained to predict biochemical activity of certain lead compounds ("the learning set"). The objective function is then defined as the error rate of the neural network m predicting the activity of other lead compounds ("the test set"), i.e. a cross-validation. In other words the objective function measured the quality of a certain set of descriptors on the basis of the error rate m predicting biological activity by a neural network using those descriptors. A major disadvantage of this technique is that the time taken to train a neural network is very long. Thus, each evaluation of the objective function is computationally expensive. This means that for the GA to work m a reasonable time the total number of descriptors from which the method selects a subset of descriptors must be small. So and Karplus began with only 10 descriptors (of which 3 were calculable and 7 were experimental) and selected 6 using 57 lead compounds. Even this took A days of CPU time on a fast workstation. Using the 6 selected descriptors (including descriptors which can only be evaluated by experiment) , So and Karplus obtained 93%
prediction accuracy.
However, as pointed out above, to achieve screening m sili co only descriptors which can be determined by calculation can be used, and for reasonable prediction success the number of such descriptors must be greater than 6 (e.g. the best 10 descriptors of out of the 100 calculable descriptors which are most easily envisaged) . For this reason the technique of So and Karplus is inapplicable to selection of descriptors for m sili co screening.
Summary of tne present invention
The present invention aims to permit a faster rational selection of descriptors, m particular a selection of descriptors which is suitable for use m m sili co screening .
The invention may further aim to identify active candidate molecules on the basis of the selected descriptors . In its most general terms the present invention proposes that a subset of descriptors (e.g. about 10) is selected from a larger set of possible descriptors (e.g. at least 100) based on the statistical significance of correlations between the values of those descriptors for active lead compounds.
For example, m the space of a certain subset of descriptors, the descriptors values of the active lead compounds may be highly correlated (e.g. m comparison to the inactive molecules) . This set of values is said to form a tight cluster m the space. The statistical significance of this cluster can be quantified, and this provides a very efficient method of deciding whether the set of descriptors associated with biochemical activity (i.e. whether the set of descriptors well encodes the biochemical significance of the lead compounds).
This cluster significance analysis (CSA) can be
performed relatively quickly (e.g. much more quickly than building a neural network) , and therefore descriptors associated with the activity can be selected quickly, even from a large set of possible descriptors. The concept of considering the clustering of the acti ve molecules among the lead compounds (e.g. correlations between the inactive molecules may optionally just be used to evaluate the statistical significance of the clustering of the active ones) is based on the biochemical realisation than there are a variety of reasons why inactive molecules fail, and thus the clustering of the active molecules is of greater significance than clustering of the inactive ones.
Specifically, the invention proposes a method of identifying physico-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active molecules, of which said activity is known to be at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, and a predetermined set of physico- chemical and/or topological parameters of which the values are known or obtainable for each of the lead molecules, the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said molecules, the method comprising the steps of:
(l) selecting a plurality of first subsets of said parameters from among said set of parameters; (n) determining the value of said function for each first subset of
parameters; and
(m) selecting at least one second subset of said parameters from among said set of parameters based on the values of said function for the respective first subsets of parameters, whereby the or each second subset of parameters is more closely associated with said activity than the first subsets of parameters.
For example, when a high (low) value of said function (f) is associated with a high value of said statistical significance (p) , the method (e.g. iteratively) may select a second subset of parameters with a high (low) value of
(f) •
The set of parameters are preferably ones which can be determined for a given molecule computationally (e.g. without m vi tro or in vi vo testing) . It may include for example at least 100, at least 200 or at least 500 descriptors. Each subset of descriptors may for example be no more than 50 or 100 descriptors. In further aspects the invention relates to using the selected descriptors to develop criteria for screening candidate molecules (e.g. by building filters from them based on ranges, m the way described in Grassy et al10) , to ways of generating candidate molecules to test using the criteria, to molecules which have been thus derived
(and pharmaceuticals based on them), and to apparatus for carrying out all the methods.
Following identification of a molecule m accordance with the method of the present invention, the substance may be investigated further. Furthermore, it may be manufactured and/or used m preparation, i.e. manufacture or formulation, of a composition such as a medicament, pharmaceutical composition or drug. It may itself be used m the generation of mimetic molecules according to the method disclosed herein or any other suitable technique known to those skilled in the art.
The designing of molecules according to the method of the invention might be desirable where it is difficult or expensive to synthesise known molecules having the desired behaviour or where it is unsuitable for a particular method of administration, e.g. peptides are not well suited as active agents for oral compositions as they tend to be quickly degraded by proteases m the alimentary canal. The methods disclosed herein may be used to avoid randomly screening large number of molecules for a target property.
The molecule or composition may be used m a variety of contexts depending upon the criteria set (e.g. biological activity, physiochemical properties) m the method of the invention. By way of example the molecules may be used:
(l) as pharmaceuticals, e.g. anti-bacterial molecules, anti-fungal molecules, anti-viral molecules, antibiotics, immuno-stimulatory molecules, e.g. for use m vaccines or lmmuno-suppressants; (n) as cosmetics e.g. new molecules with a deoαerant effect. Undesirable body odours are caused by bacteria, typically gram positive aerobic bacteria e.g. Corneybacterium xerosis or negative coagulase anaerobic micrococci (S. epidermidis) . Using the method of the present invention it is possible to design new antibacterial molecules (e.g. peptide-based) whose antibacterial effect is targetted specifically to the odour-causing bacteria;
(m) m veterinary applications. New molecules may be designed to protect bodily fluids (e.g. semen) from microbial infection during storage. For example, pig semen is typically stored at a relatively high-temperature (approximately 20°C) . At such temperatures bacteria proliferate. Some of the bacterial strains are resistant to known antibiotics. Using the method of the present invention antibiotic molecules may be designed which have
broad spectrum anti-bacterial activity (including anti- gram negative and anti-gram positive activities) whilst not exhibiting significant spermicidal activity;
(IV) as agrochemicals, e.g. mimics of natural peptides having antifungal, antibacterial or antiviral activity may be designed. Such peptides are preferably non-toxic to humans and may be expressed directly m genetically modified plants;
(v) as biomaterials, molecules may be designed which favour certain dialysis membrane properties. Starting from a known polymeric membrane which is used for dialysis purposes (e.g. human dialysis), new molecules may be designed with improved permeability or dialysis properties, such molecules can be used as additives to the polymeric membrane.
Thus, the present invention extends m various further aspects to a molecule identified or defined m accordance with a method of the present invention, and also a pharmaceutical composition, medicament, drug or other composition comprising such a molecule, a method comprising administration of such a composition to a patient, e.g. for antibiotic/anti-fungal treatment, which may include preventative treatment, use of such a substance m manufacture of a composition for administration, e.g. for antibiotic/antifungal treatment, and a method of making a pharmaceutical composition comprising admixing such a substance with a pharmaceutically acceptable excipient, vehicle or carrier, and optionally other ingredients.
A substance identified using a method of the present invention may be peptide or non-peptide m nature. Non- peptide "small molecules" are often preferred for many m vi vo pharmaceutical uses. A convenient way of producing a polypeptide is to express nucleic acid encoding it. This may conveniently
be achieved by growing m culture a host cell containing the nucleic acid under appropriate conditions which cause or allow expression of the polypeptide. The nucleic acid may be introduced alone or as part of a vector, and may be extragenomic or integrated into the genome. Polypeptides may also be expressed m m vi tro systems, such as reticulocyte lysate.
Systems for cloning and expression of a polypeptide m a variety of different host cells are well known. Suitable host cells include bacteria, eukaryotic cells such as mammalian and yeast, and baculovirus systems. Mammalian cell lines available m the art for expression of a heterologous polypeptide include Chinese hamster ovary cells, HeLa cells, baby hamster kidney cells, COS cells and many others. A common, preferred bacterial host
Suitable vectors can be chosen or constructed, containing appropriate regulatory sequences, including promoter sequences, terminator fragments, polyadenylation sequences, enhancer sequences, marker genes and other sequences as appropriate. Vectors may be plasmids, viral e.g. 'phage, or phagemid, as appropriate. For further details see, for example, Molecular Cloning: a Laboratory Manual: 2nd edition, Sambrook et al . , 1989, Cold Spring Harbor Laboratory Press. Many known techniques and protocols for manipulation of nucleic acid, for example m preparation of nucleic acid constructs, mutagenesis, sequencing, introduction of DNA into cells and gene expression, and analysis of proteins, are described in detail m Current Protocols in Molecular Biology, Ausubel et al. eds., John Wiley & Sons, 1992.
The introduction of DNA may employ any available technique. For eukaryotic cells, suitable techniques may include calcium phosphate transfection, DEAE-Dextran, electroporation, liposome-mediated transfection and transduction using retrovirus or other virus, e.g.
vaccinia or, for insect cells, baculovirus . For bacterial cells, suitable techniques may include calcium chloride transformation, electroporation and transfection using bacteriophage . Following production by expression, a polypeptide may be isolated and/or purified from the host cell and/or culture medium, as the case may be.
Peptides can also be generated wholly or partly by chemical synthesis. The well-established, standard liquid or solid-phase peptide synthesis methods can be used, general descriptions of which are broadly available (see, for example, m J.M. Stewart and J.D. Young, Solid Phase Peptide Synthesis, 2nd edition, Pierce Chemical Company, Rockford, Illinois (1984), in M. Bodanzsky and A. Bodanzsky, The Practice of Peptide Synthesis, Springer
Verlag, New York (1984); and Applied Biosystems 430A Users Manual, ABI Inc., Foster City, California), or they may be prepared m solution, by the liquid phase method or by any combination of solid-phase, liquid phase and solution chemistry, e.g. by first completing the respective peptide portion and then, if desired and appropriate, after removal of any protecting groups being present, by introduction of the residue X by reaction of the respective carbonic or sulfonic acid or a reactive derivative thereof.
Pharmaceutical compositions according to the present invention, and for use m accordance with the present invention, may include, m addition to active ingredient, a pharmaceutically acceptable excipient, carrier, buffer, stabiliser or other materials well known to those skilled m the art. Such materials should be non-toxic and should not interfere with the efficacy of the active ingredient. The precise nature of the carrier or other material will depend on the route of administration, which may be oral, or by injection, e.g. cutaneous, subcutaneous or intravenous .
Pharmaceutical compositions for oral administration may be m tablet, capsule, powder or liquid form. A tablet may include a solid carrier such as gelatin or an adjuvant. Liquid pharmaceutical compositions generally include a liquid carrier such as water, petroleum, animal or vegetable oils, mineral oil or synthetic oil. Physiological saline solution, dextrose or other sacchaπde solution or glycols such as ethylene glycol, propylene glycol or polyethylene glycol may be included. For intravenous, cutaneous or subcutaneous injection, or injection at the site of affliction, the active ingredient will be in the form of a parenterally acceptable aqueous solution which is pyrogen-free and nas suitable pH, isotonicity and stability. Those of relevant skill m the art are well able to prepare suitable solutions using, for example, lsotonic vehicles such as Sodium Chloride Injection, Ringer's Injection, Lactated Ringer's Injection. Preservatives, stabilisers, buffers, antioxidants and/or other additives may be included, as required.
A molecule defined by a method of the present invention, or a composition containing such a molecule may be provided in a kit, e.g. sealed in a suitable container which protects its contents from the external environment. Such a kit may include instructions for use.
The invention may alternatively be expressed as a method of identifying physico-chemical and/or topological parameters which are associated with biological activity, the method using data relating to a set of lead molecules including active molecules, of which said activity is known to be at least a predetermined level, and inactive molecules, of which said activity is known to be below said predetermined level, and a predetermined set of physico-chemical and/or topological parameters of which the values are know or obtainable for each of the lead molecules,
the method further using a function (f) which is defined for any subset of said parameters and which depends on the statistical significance (p) of correlations between the values of that subset of parameters for the active molecules m comparison to the values of that subset of parameters for other of said molecules, the method comprising the steps of iteratively determining subsets of said parameters having a respective value of said function (f) which is progressively (i.e. on successive iterations) more highly associated with a high value of said statistical significance.
Although the method has been defined aoove m relation to the selection of a minimal set of pertinent descriptors (e.g. physico-chemical and/or topological descriptors) characterising a molecule, in fact the invention is m principle applicable more generally (e.g. outside the field of chemistry) for the selection of a minimal set of pertinent independent variables m order to discriminate between populations of individuals (elements) with regard to observed, estimated or calculated features. The method of the invention is thus relevant to numerous fields including econometrics, agronomy, opinion polling, marketing, criminology, etc. That is, the method may be expressed as a method of identifying parameters associated with a feature, the method using data (observed, estimated or calculated) distinguishing among a plurality of lead individuals active individuals which have that feature and inactive individuals which do not have that feature, and a set of parameters known or obtainable for each lead individual, the method using a function defined for any subset of parameters and which depends on the statistical significance of correlations between the parameter values of that subset of the active individuals m relation to the inactive individuals,
the method comprising determining (e.g. iteratively) a subset of parameters having a value of said function (f) associated with a high value of said statistical significance. The iterative method is preferably by a genetic algorithm as described below, but may alternatively m principle be by any other iterative algorithm (e.g. simulated annealing).
Embodiments of the invention will now be described as non-limiting examples, with reference to the accompanying drawings.
Brief description of the figures
Fig. 1 illustrates the principle of the m silico screening approach. The active and inactive classes of molecules are represented as distributions m a given physico-chemical parameter.
Fig. 2 shows the iterative process m which, once descriptor filters are set, virtual molecules obeying the filters are generated m an embodiment of the invention. Fig. 3 shows six descriptor maps based on benzodiazepme affinity.
Fig. 4 shows experimental vs. predicted values for benzodiazepme affinity (Fig. 4(a)), and immunosuppressive peptide activity (Fig. 4(b)), for NN models built using descriptor subsets selected by the embodiment.
Fig. 5 illustrates encoding of genetic chromosomes and steps used to select descriptor subsets by a genetic algorithm combined with cluster significance analysis, according to the invention. Fig. 6 shows evolution of the maximum and minimum fitness (A) , and variance and standard deviation (B) during the optimization.
Fig. 7 shows evolution of the cluster significance and normalised mean squared distance NMSD (A) and correlation percentage (NC2) in the descriptor subsets (B) during the optimization.
Fig. 8 shows the correlation matrix the descriptor subset before (A) and after (B) the GA-CSA selection by the embodiment of the invention.
Fig. 9 shows experimental vs. predicted log IC50 values for NN models built by descriptor subset S-21 selected by the embodiment of the invention for the benzodiazepme data set. Description of embodiments Immunosuppressive peptides derived from HLA Class 1 protein, flexible molecules, were chosen to illustrate an application of the different capabilities of the invention: descriptors selection, building activity filters, molecular de novo design, and screening of new candidates . As a shorthand we will refer to the embodiment as
"Oasis". Oasis involves two principal parts: a first part treats the rational selection of pertinent descriptors by using a module we will refer to as "VarSelect", and a second part which involves four interconnected modules for lead optimisation and de novo design. We will refer to these modules as "Generator", "Builder", "Descriptor", and "Evaluator" modules. Principl e of VarSelect Module
The module VarSelect of OASIS was used to select the pertinent subset of descriptors from the initial set. This module is based on a method according to the invention which combines genetic algorithms (GA)11 and cluster significance analysis (CSA).12 Descriptors The method uses N lead compounds, including some (a number Na) which are known to be active (e.g. the sense of having an activity level above a predetermined level) and some of which are known to be inactive (e.g. m the sense of having an activity level below the predetermined level) .
We consider a set of D possible descriptors from
which a subset of d descriptors is to be chosen. Preferably, the value of d is not predetermined, but rather is optimised during the algorithm. The values of all descriptors for all N lead compounds are assumed to be known (or determinable) .
For statistical reasons the descriptor values are normalised. That is, for each given descriptor, the descriptor value is pread usted so that its average over the lead compounds is 0, and its standard deviation is 1.
Geneti c Algori thm
As mentioned above, GA is one of the recent optimization methods based on the natural selection concept8, 13~-~ (these citations are incorporated herein by reference) , and to perform a genetic algorithm, two requirements must be satisfied: the encoding of the possible solution of the problem, and the quantitative evaluation of a given solution by an objective function. In this embodiment, the objective function employs cluster significance analysis (CSA) . To encode the problem, each descriptor subset is expressed as a binary string of D digits (a "chromosome") . Each digit is either one or zero to indicate the presence or absence of the respective descriptor the subset. The length of each string is the same and is equal to the total starting number of descriptors. At the beginning of the process, a population of descriptor subsets randomly generated is evaluated by CSA (see below) . A pair of chromosomes that has a high score is randomly selected by the roulette wheel selection method to serve as parents, and a pair of children is generated by randomly performing a cross-over of the parents' genes so that each child is derived from part of a gene from each parent. Both chromosomes associated to the respective children are subjected to single-point mutation, that is, a randomly selected one
(or zero) is changed to zero (or one) and evaluated; those
that have high scores replace the old chromosomes. The genetic operation is repeated until a predefined maximum number of steps or a predefined convergence criterion is achieved. As the convergence criterion, we use the variance value of the fitness function m the evolved population: the calculations are terminated when the variance is equal to zero (or a predefined low value) .
In summary, each parent m our method represents a combination of randomly chosen descriptors, and the purpose of the calculation is to evolve the initial set of descriptors into a population with the hignest significance value of the CSA. Obj ecti ve function
This is a function of any string of D elements, and is evaluated m the following way.
CSA measures the statistical significance p of a classification of data molecules with a given subset of descriptors (i.e. the statistical significance of the classification of the molecules by the d descriptors of the string which have value +1) . This is done by calculating the sum of squared distances (MSD) among the Na active molecules, i.e the sum of squared distances d between each pair of active molecules, and dividing it by the number C of pairs of active molecules.
MSD = (l/α∑,,. d2(ι,j) , (1)
where l and j are integers, l and j are summed from l,...,Na with K], and d(ι,j) is a sum over the d descriptors (i.e. the descriptors which have value +1 the string) of the differences the value of that descriptor between the l-the and j-the molecule among the active molecules.
This quantity MSD may then be calculated for all other possible subsets of Na molecules from the set of N lead compounds (i.e. both active and inactive groups) . The
proportion of such subsets which have an MSD higher than the MSD for the active set is called p.
Alternatively (if the number of molecules or the number of descriptors increases, so that this process becomes very lengthy) the quantity MSD is calculated using 1000 randomly selected subsets of N3 olecules from among the N active and inactive molecules. In this case, p is defined as the proportion of randomly selected subsets that have MSDs higher than the MSD of the active group. p measures the statistical significance of the cluster of active molecules the space defined by the d descriptors (i.e. the probability that a cluster as tight as the one of active molecules has not arisen only by chance) . In general we are seeking a set of descriptors for which the value of p is high.
A problem with using a function of p on its own as the objective function is that two subsets of descriptors (each with an equal number d of descriptors) could have the same p value although their MSD quantities are different. To take this into account we added to the objective function a term depending on the MSD value.
This creates a further problem that the quantity MSD for the active molecules depends on the number of descriptors d used the distance calculations. Thus, when comparing two descriptor subsets for the same classification, we normalised MSD for the active molecules by dividing it by the number of the +1 descriptors m the string. The resultant quantity we called NMSD.
An additional objective of the GA is to reduce the number of correlated descriptor pairs within each descriptor subset. To take this into account, we added to our objective function a term which tended to produce lower correlation, using a variable NC defined as the total number of pairs of descriptors the subset represented by the string for which the values of the two descriptors are correlated (e.g. to above a predetermined
level), averaged over the lead compounds.
Taking all tnese points into account, our objective function preferably contained 3 parameters: the normalised mean squared distance NMSD (to be minimised); the statistical significance p (to be maximised) ; and the number of correlated descriptors NC (to be minimised) . Consequently, our fitness function (objective function) F
F = NMSD2 + (1/pX+NCX (2)
Note that the exact form of this expression may be varied within the scope of the invention.
Principles of In Silico Screening:
The following text explains how, once descriptors have been chosen, they are used to identify candidate molecules predicted to have high activity. Mapping of the activi ty As described above, m Silico Screening is a qualitative technique consisting of an evaluation of the distribution (global or percentwise) of the active and inactive molecules as a function of the values of given parameters (descriptors) . For example, Fig. 1 shows on the horizontal axis the values of a descriptor. The upper shaded area (light shading) shows the range of values for that descriptor of lead compounds which are known to be active. The lower shaded area (dark shading) shows the range of values for that descriptor of lead compounds which are known to be inactive. For each class, the limiting values are shown. Mm_a and Max_a are the minimum and maximum values for the active class. Mm_ and Max_m are the minimum and maximum values for the inactive class. From Fig. 1 the limiting values of this descriptor associated with activity can be deduced. This is called a
"filter". By combining a plurality of filters, activity can be predicted.
This method, which is easy and fast, gives a diagnosis of the qualitative non-linear dependencies between the activity and molecular properties, and so allows the building of physico-chemical filters for the activity of interest.
The experimental activity database is produced by the Descriptors module of OASIS, which associates a numerical vector to each molecule the database. The vector components are the values of calculated physico-chemical parameters. Then, the activity values of the compounds are divided into a user-defined number of classes giving the numerical values of the descriptors associated to each activity class or converted to maps with the Evaluator module of OASIS. Extracti on of Activi ty Fil ters
After building the activity maps for each class of compounds activity with the various descriptors selected by the VarSelect module, each descriptor is taken separately order to extract the limiting values of the activity. This extraction is based on the comparison of the points density associated with active and inactive molecules m the given map. Although Fig. 1 showed an ideal case which this process is relatively straightforward, different situations are possible according the distributions of the active, and inactive molecules. This is illustrated m Table 1, which shows (m the left column) 10 descriptor maps drawn m a way analogous to Figure 1.
For each map, a label (middle column; the lower row) is derived which represents the way the activity interval (i.e. the range of active molecules) is positioned with regards to the inactivity interval. For example, the symbol ">" is attributed to a map where the activity interval extends to the right past the end of the
inactivity interval. The symbol ">|--|<" signifies that the inactivity interval extends at both the right and left ends by more than the activity interval.
The maps and the associated symbols determine segments to explore for each parameter (e.g. the values of the descriptor which a good candidate molecule being screened should have) . Information is given by each map indicating the zone to be explored (m Table 1, the right column; upper row) , and the expected usefulness ("validity") of the filter predicting activity (the right column; lower row) . For example, when a map possesses the symbol ">", this means that during the screening step, the zone to be explored is at the right of the parameter interval, and that this filter is considered to be a good one ("good") . A descriptor for which both ends of the active range are displaced the same direction from the inactive range is labelled "best".
In practice, the filter extraction and the symbols attribution are performed by the Evaluator module of OASIS.
Screening New Candi da tes
The extracted filters are used to screen new candidates for the activity of interest. These new molecules can be generated during a virtual combinatorial explosion (e.g. exploring all possible compounds within a given range, which is the technique used m Grassy et al10) or by a non-systematic approach (such as a genetic algorithm) . OASIS, its module Generator, uses a genetic algorithm to generate a population of molecules. The same genetic algorithm engine (software section) as that used the descriptor selection step is used here with a different encoding process. The screening of the GA-proposed molecules is performed m a cyclic way as shown m Fig . 2. A population of new candidate virtual molecules are built by the Builder module of OASIS. This module
converts the GA-encoded molecules to Smiles encoded molecules 18~19, and then uses this encoding to deduce 3-D structures via Coπna software0.
The built population of molecules is transmitted to the Descriptor module which uses the descriptors contained m the filters to describe the molecules. When this step is completed, the described molecules are evaluated by the
Evaluator module using the filters. Based on the filters, a satisfaction ("score") is attributed to each molecule m the population, and this information is returned to the
Generator module, which the next iteration generates new candidate molecules using a GA.
During each screening iteration, the Generator module attempts to maximise the score. The cyclic process stops when the variance m the scores of molecules m the generated population is equal to zero.
Computa tion Details
The OASIS program is implemented ANSI-C code and runs on a SGI Iris workstation. The graphical user interface is based on Xtlntrmcisec/Motif libraries. The architecture of OASIS integrates five interconnected modules: VarSelect, Generator, Builder, Descriptor, and
Evaluation module. As input/output, the program reads and produces ASCII files.
APPLICATIONS OF OASIS
Benzodiazepme Data Set
Benzodiazepme are well-known as anxiolytics, tranquillisers, and anticonvulsants epilepsy treatment. In this work, we used a set of 54 benzodiazepme analogues whose biological activity (IC50 values) are derived from the work of Haefely and al . (1985) .21
A set of 766 descriptors were computed for the benzodiazepme data set by using MolconnZ 3.15 and TSAR V3.1 softwares X2"23 The descriptors with null variance were removed from the analysis resulting 312 descriptors.
Fig. 3 shows maps for a selection of 6 of the remaining descriptors. Again, each map the lower row shows descriptor values for tne inactive molecules, while the upper row shows descriptor values for the active molecule. The bright dot m each of the upper rows indicates the active molecule having the highest activity.
In order to apply VarSelect module m descriptor selection, the benzodiazepme derivatives were divided to two classes according to their logIC50 values: a class with logIC50 values below 0.8 representing the class of active molecules, and a class whose logIC50 values are higner than 0.8 corresponding to the class of inactive molecules. The data set containing the molecules belonging to the two activity classes which are described by the 312 descriptors were submitted to VarSelect module of OASIS. The GA of VarSelect module was run with a population size of 100 descriptors subsets. The population was evolved until its fitness variance reaches a value of zero.
The optimisation process by VarSelect module converged after 60 generations evaluated during the GA evolution resulting m only 53 non correlated descriptors. These selected parameters were submitted to different QSAR analysis techniques to build a predictive model for benzodiazepme data set. These techniques concerned Principal component analysis (PCA) multiple linear regression (MLR, partial least square (PLS) , and backpropagation artificial neural networks (NN) . In PCA, the first 3 principal components contained on 37.46% of total variance with 29.17% explained m the first 2 principal components. The whole variance is explained 44 first components.
The use of PLS and MLR techniques to build QSAR model for benzodiazepme data set gave regression terms R2 of
0.57 and 0.76, and cross-validated terms R2 (CV) of 0.42 and
0.31 respectively.
Finally, we used a backpropagation neural network (NN) that contained 53 inputs, 2 nidden layers, and 1 output (logIC50 value) . In this NN, weights and bias were optimized by a Monte Carlo algorithm; the training step was achieved m 807 cycles with a best root-mean-square fit of 0.032. The best model had a training term R2 of 0.94 and a cross-validated term R-(CV) of 0.85. A plot of predicted vs. experimental logIC50 values for tne best model is shown m Fig. 4(a) . The predictive power of the model was determined by using 30% of initial data for the cross-validation.
The attempt to build QSAR models with the descriptors selected by VarSelect module failed when using the linear methods (PCA, PLS, and MLR) . By contrast, we obtained a satisfactory QSAR model by using artificial NNs .
2. Immunosuppressive Peptide Data Set
In Grassy et al10, we successfully applied In Silico screening method to identify a new immunosuppressive peptide to prevent allograft rejection m mice.10 Our descriptor selection was based on knowledge of the immunosuppressive peptide biology and chemistry.
By contrast, the present work, we used VarSelect module of OASIS to perform the rational choice of QSAR descriptors without any biological or chemistry peptide knowledge. The biological activity of the peptide used are shown in Table 2.
Initially, the structural models of the immunosuppressive peptides were built by the Builder module of OASIS. Physicochemical and topological descriptors were generated by TSAR V3. I23 resulting 83 descriptors representing 19 peptides. VarSelect module was used to select a pertinent subset of descriptors . A set of five additional peptides (RDP1257, RDP1259,
RDP1271, RDP1277 and RDP1258) with known immunosuppressive
activities were kept to test the validity of the predictive power of the QSAR model.
The optimisation process of VarSelect module converged after 254 iterations by evolving a population with 50 chromosomes (i.e. 50 descriptors subsets) . The best descriptors subsets is achieved with 22 uncorrelated and significantly clustering descriptors.
The 22 VarSelect-selected descriptors were used as static filters to screen the data test peptides by using their map m Evaluator module of OASIS.
This screening resulted 5 peptides satisfying 100% of filters and which are predicted active. The activity prediction by OASIS static filters is consistent to the experimental activity of the peptides except for RDP1277 peptide.
Once more, the use of linear methods failed to build a good QSAR model for immunosuppressive peptide activity, whereas non linear methods provided a powerful predictive model. In fact, PCA on the 22 selected descriptors resulted m only 56.97% of explained variance m the first 3 principal components. MLR and PLS resulted m poor predictive models since the regression terms R2 were about 0.46 and cross validated R2 (CV) terms were about 0.11. On the other hand, the use of backpropagation NNs with 22-2-1 architecture resulted m a training term R2 of 0.88 and a cross-validated term R2 (CV) of 0.77. A plot of calculated vs. observed activities for the final model is shown m Fig. 4 (b) .
As an additional test of the validity of the NN model using the selected 22 descriptors, we predicted the immunosuppressive activity of the compounds RDP1257, RDP1259, RDP1271, RDP1277, and RDP1258. The result of these predictions is summarised m Table 2. The predicted activities were highly consistent with the experimental activities except for compound RDP1277. Although the predicted value for compound RDP1277 is different from the
observed one, it remains higher than the inactive molecules. It is noteworthy that the 22 selected descriptors are of the same nature as those used m our previous study10 and which were selected on the basis of biological and chemical knowledge of the immunosuppressive peptides. These results show that the choice of descriptors by OASIS makes good chemical and biological sense, and can be used to build highly predictive QSAR models conjunction with artificial neural networks.
3. A second benzodiazepme example
In this example we used a set of 60 benzodiazepme analogues (Table 3) whose biological activity is derived from the work of Haefely et al21. Oasis software was used to generate chemical structure m Smiles code form, and the generated structures were introduced into TSAR 3.1 software (produced by the company Oxford Molecular) , and a set of 105 two-dimensional (2D) molecular descriptors were computed. Descriptors with zero variance were removed, and GA-CSA was then performed to reduce the number of descriptors to 21 (shown m table 4) . The calculations were performed on a Silicon Graphics Origin 200 workstation .
In the genetic process the population is important to get a good solution. If the population size is too small, there is not enough genetic diversity to make a good solution evolve. Larger population can broaden the genetic diversity, which may evolve into much higher fitness score, but this will require more time. In this case, the population containing the descriptor subsets was set to 40, and the population was evolved until its fitness variance reached a value of substantially zero.
The GA is illustrated m Fig. 5. The minimum, the maximum, the variance and the standard deviation are shown m Fig. 6. The evolution of these parameters indicated that the minimum score remains almost unchanged after the
85th generation. This convergence can be found m the decay of the fitness variance of the population during the genetic step (Fig. 6 (b) ) . The remaining fluctuations of the variance are due to the use of elitism by the GA wnich retains the best descriptor subset and continues to create new subsets randomly at every generation.
Fig 7 shows the evolution of statistical significance (darker line m Fig. 7(a)), the normalized mean square distance (lighter line m Fig. 7(a), and the correlation percentage (Fig. 7(b))). Here the horizontal axis numbers chromosomes m the order of their generation, 40 per iteration. The inset grapns show the behaviour during tie first 800 chromosomes. NMSD undergoes a slow decay and reacnes a stable plateau after 2400 solutions whereas the statistical significance presents an inverse evolution and a rapid stabilization m only 1000 tested solutions. These inverted evolutions suggest that the lower the NMSD and the higher the statistical significance, the better the descriptor subset. The best emergent descriptor subset (S-21) from the GA-CSA algorithm contained 21 uncorrelated descriptors out of 105 initial descriptors as shown m Fig. 8(b) . It is clear that the GA-CSA algorithm converged to a solution containing the less correlated descriptors (4%) with regard to those existing m the initial descriptor set
(Fig. 9(a)) . The remaining 9 correlations in the best 21 descriptors solution were m the interval ]-0.7,0.7[.
The S-21 descriptor subset selected by the GA-CSA method was submitted to a backpropagation neural network (NN) that contained 21 inputs, 2 hidden layers, and 1 output (logIC50 value) . In this NN, weights and bias were optimized by a Monte Carlo algorithm; the training step was achieved 1972 cycles with a best root-mean-square fit of 0.031. The best model had a training term R2 of 0.933 and a cross-validated term R2 (CV) of 0.87. A plot of predicted vs. experimental logIC50 values for the best
model is shown m Fig. 9. Tne predictive power of the model was determined by using 30% of initial data for tne cross-validation.
The selected descriptors contain information on the nature of the substituents, the molecular shape, the charge, the hydrophobicity, the connectivity, the topology, and some atomic elements (table 4) . All these descriptors describe essential interactions such as steπc, electronic, and hydrophobic parameters which are dominant factors m receptor-drug interactions. These results show that the choice of descriptors by GA-CSA makes good chemical sense.
The final NN model was compared to 10 NN models built by a random selection of descriptors with either different number of descriptors (R21-1, R21-2, R21-3, R21-4 and R21- 5) or the same numbers of descriptors (R8, R15, R30, R50, and R100) . We also compared the final NN model to a model built using all 105 initial descriptors (1-105), and to one built using the 84 descriptors (Rm-84) which were removed during the genetic process. The R2 and R2(CV) values of these NN models are shown m table 5. From this comparison, the best NN model built by the S-21 descriptor subset selected by GA-CSA showed the highest R2 and R2(CV) values. The randomly selected descriptors did not allow a powerful NN to be built.
In order to test the reproducibility of the GA-CSA, which belongs by its nature to random search techniques, different random sets of initial populations were used (RSI, RS2, RS3, RS4, RS5 and RS6), and evolved until the fitness variance m the population reached a value of zero. The results are summarized m table 6. The different descriptor subsets are compared on the basis of the fitness value, R2 and R2(CV) terms obtained for QSAR models built by NNs. Similar fitness values, which ranged from 4.98 to 5.12 were observed. The number of descriptors selected by GA-CSA varied between 21 and 25. However, the
solution possessing the minimum value of the fitness function corresponded to the best R2 and R- XV) values. The best solution RSI contained descriptors of tne same nature as the S-21 subset analysed above. Descriptors differing m solutions RSI and S-21, are correlated with a correlation coefficient higher than 0.8.
Many variations are possible on the embodiment described above without departing from the scope of tne invention. For example, although in the embodiment after the descriptors have been selected candidate molecules are derived using a genetic algorithm with successive screening of generations, an alternative is to predefine a class of molecules (e.g. by a "combinatorial explosion" m which all possible combinations of possible atomic selections are considered) and use the selected descriptors to screen all of them.
Also, whereas in the embodiment the GA only seeks to optimise the subset of descriptors which s selected, it is possible for additional variables to be simultaneously optimised, for example a numerical parameter of each respective descriptor which indicates the importance of that descriptor m determining or predicting activity.
Furthermore, as described above, once the descriptors are selected there are a variety of ways m which screening can be performed using them, for example using filters derived on the basis of active and inactive intervals (as illustrated m Fig. 1, and as used m Grassy et al) , or by using a neural network to predict activity levels of candidate molecules.
REFERENCES
The disclosure of all the following documents is incorporated herein by reference. 1. Hansch, C; Muir, R.M.; Fujita, T;Maloney, P.P.; Geiger, F.; Streich, M. The correlation of
biological activity of plant growth regulators and chloromvcetm derivatives with Hammett constants nd partition coefficients. J. Am. Chem. Soci . 1963, 85,
2817-2824. 2. Stahle, L; and Wold, S. Progress m Medicinal
Chemistry, eds Ellis, G.P and West, G.B. Elsevier
1988, Vol 25. 3. Draper, N.R., and Smith, H., Applied regression
Analysis, 2nd edition, John Wiley & Sons, 1981. 4. Chatfield, C, and Collins, A.K. Introduction to
Multivaπate Analysis. Chapman and Hall, London.
1980. 5. Manly, B.F.J. Multivaπate Statistical Methods A primer. Chapman and Hall, London 1986. 6. Hansch, C. On the structure of medicinal chemistry.
J. Med. Chem, 1976, 19, 1-6. 7. SO, S-S., and Richards, W.G. Application of neural networks : Quantitative structure-Activity relationships of the derivatives of 2 , 4-dιammo-5- (substituted-benzyl) pyrimidines as DHFR inhibitors,
J. Med. Chem. 1992, 35, 3201. 8 Goldberg, D.E. Genetic Algorithm Search,
Optimisation, and Machine Learning, Addison-Wesley,
Reading, MA 1989. 9. Holland, J.H. Genetic Algorithms, Scientific
American 267, 66-72(1992); Forest, S., Genetic
Algorithms: Principles of Natural Selection Applied to computation. Science 1993, 261, 872-878.
10. Grassy, G., Calas, B., Yasπ, A., Lahana, R., Woo, J., Iver, S., Kaczorek, M., Floc'h, R., and Buelow R. Computer-assisted rational design of immunosuppressive compounds, Nat. Biotech. 1998, 16, 748-752.
11. Yasπ, A., and Lahana, R., Rational selection of QSAR descriptors by using genetic algorithms combined to cluster significance analysis. Application to
benzodiazepme (unpublished) . 12. McFarland, J.W.; and Gans, D.J. On the significance of clusters m the graphical display of structure- activity data, J. Med., Chem. 1986, 29, 505-514. 13. Rogers, D.; Hopfmger, A.J. Application of Genetic Function Approximation to Quantitative Structure- Activity Relationships and Quantitative Structure- Property Relationships, J. Chem. Inf. Comput . Sci. 1994, 34, 854-866. 14. Davis, L. Handbook of genetic algorithm; Van Norstrand Remhold; New York 1991. 15. Hibbert D.B. Generation and display of chemical structures by genetic algorithms, Chemo . Intell. Lab. Syst. 1993, 20, 35-43. 16. Hibbert, D.B. Genetic Algorithm m chemistry, Chemom. Intell. Lab. Syst. 1993, 19, 277-293. 17. Leardi, R. ; Boggia, R. ; Terrile, M. Genetic algorithms as a strategy for feature selection, J. Chemom. 1992, 6, 267-281. 18. Wem ger, D. SMILES, a chemical language and information system. I. Introduction to methodology and encoding rules, J. Chem. Comput., Sci. 1988, 28, 31-36.
19. Daylight software Manual. Daylight Chemical Information Systems: Santa Fe, NM, USA, 1993.
20. Gsteiger, J., Rudolph, C, Sadowski, J. Automatic generation of 3D-Atomιc Coordinates for organic molecules. Tetrahedron Comp . , Method., 1990, 3, 537- 547. 21. Haefely, W., Kyburz, E., Gegecke, M., Mohloer, H. Recent advances m the molecular pharmacology of benzodiazepme receptors and the structure-activity relationships of their agonist and antagonists, Adv. Drug Res. 1985, 14, 165-322. 22. Molconn-Z, Molconn software version 3.15, Lowell H. Hall copyright 1998.
23. Oxford Molecular Group, the Medawar Centre, Oxford Science Park, Oxford 0X4 4GA, UK. 1997.
24. S-S. So and M. Karplus, Genetic Neural Networks for Quantitative Structure-Activity Relationships: Improvements and Applications of Benzodiazepme
Affinity for Benzodiazepme/GABA. receptors, J. Med. Chem. 1996, 39, p5246-5256.
Table I The different map built in OASIS analysis Each map represents the distribution of active (D) and inactive molecules (■) in the interval of given descnptor
OASIS Limit values of D and Information for zone to be activity maps classes explored '" Range " Validity of filter
-o-
[mm(D) , max(D)] min(D) > nun(i) ;
Descriptor interval max(D) < max(-i) Good
> -- <
> max(B) min(D) = min(B) ; max max(D) > max(β) Good Descπptor interval
>
< min(B) min(D) < min(B) ; max max(D) = max(B) Good Descriptor interval
> max(B) mιn(D) > mιn(H) , mm max
Descriptor interval max(D) = max(β) Explore nght
< mιn(B) mιn(D) = mιn(B) ; mm max max(D) < max(B) Explore left
Descriptor interval
Table 2 Biological activity of the initial and predicted peptides tested in a hetcrotopic allograft model of mouse
# Name Peptide Sequence * Expenmental MST Predicted ** MST
0 untreated 79
1 270275-84 RENLRIALRY 114 1337
2 270284-75 YRLAIRLNER 121 1213
3 D270275-84 rennalry 114 1143
4 D270284-75 yrlairlner 132 133
5 P2 RVNLRIALRY 115 1152
6 RP2 YRLAIRLNVR 125 1247
7 D2 rvnlπalry 131 1302
8 RD2 yrlairlnvr 122 1223
9 P15 NLRIALRYYW 118 1179
10 Kk 75-84 RVNLRTALRY 85 951
11 Dk 75-84 RVDLRTLLRY 72 717
12 Kb 75-84 RVDKRTLLGY 78 777
13 Db 75-84 RVSLRNLLGY 78 762
140775-84 RESLRLLRGY 74 758
15270575-84 REDLRTLLRY 77 774
16270276-83 ENLRIALR 85 851
17 D2702(E>V,R>P) rvnlpialry 95 952
18 E 75-84 RVNLRTLRJRY 80 798
19 G 75-84 RMNLQTLRGY 77 771
20 RDP1257 RLLLRLLLGY 131 1331
21 RDP1259 RVLLRLLLGY 131 1182
22 RDP1271 RWLLRLLLGY 113 1225
23 RDP1277 RYLLRLLLGY 90 1334
24 RDP1258 RiiLiiLnLRnLnLnLGY 127 1331
* nL = Norleucm
** MST Mouse survival time (day)
Table 3: Structure and chemical groups of benzodiazepines derivatives Experimental (Exp ) and predicted (Pred ) logIC50 values
Name R7 RI R2' R61 R3 R8 Exp logIC50 Pred LogIC50 clonazepam NO. H CI H H H 0 255 0 020 delorazepam CI " H CI H H H 0 255 0 073 dαazepam CI Me H H H H 0 908 0 822 fluiutrazepam N02 Me F H H H 0 580 0 649 halazepam CI CH,CF3 H H H H 1 964 2 031 lorazepam CI H CI H OH H 0 544 0 532 meClonazepam NO, H CI H Me H 0 079 0 075 n trazepam NO-, H H H H H 1 000 0 990 nordazepam CI H H H H H 0 973 1 086 oxazepam CI H H H OH H 1.255 0 665
Ro 05-2904 CF, H H H H H 1 114 1 062
Ro 05-2921 H H H H H H 2 544 2 430
Ro 05-3061 F H H H H H 1 602 1 723
Ro 05-3072 NH2 H H H H H 2 587 2 648
Ro 05-3367 CI H F H H H 0.301 0.102
Ro 05-34.18 NH, Me H H H H 2.663 2 500
Ro 05-3590 NO, H CF3 H H H 0 544 0 525
Ro 05-4082 N02 Me CI H H H 0 342 0.569
Ro 05-4336 H H F H H H 1.322 0 800
Ro 05-4336 H H F H H H 0 342 0.575
Ro 05-4435 N02 H F H H H 1 322 0 800
Ro 05-4520 H Me F H H H 1 146 1.193
Ro 05-4528 CN Me H H H H 2.580 2.000
Ro 05-4608 H Me CI H H H 0 580 1.168
Ro 05-4619 NH2 H CI H H H 1.875 1 852
Ro 05-4865 F Me H H H H 1 230 1 186
Ro 05-6820 F H F H H H 0.869 0765
Ro 05-6822 F Me F H H H 0.708 0 734
Ro 07-2750 CI (CH2)20H F H H H 1.389 1.402
Ro 07-3953 CI H F F H H 0 204 0.355
Ro 07-4065 CI Me F F H H 0 613 0.372
Ro 07-4419 H H F F H H 1.279 1.080
Ro 07-5193 CI H CI F H H 0 477 0.620
Ro 07-5220 CI Me CI CI H H 0.740 0 437
Ro 07-6198 H H F F H CI 1.447 1 690
Ro 07-9957 I Me F H H H 0.462 0.509
Ro 07-4878 CI H F H Me H 0.544 0.473
Ro 07-6896 N02 Me F H Me H 0 845 0.978
Ro 13-3780 Br Me F F H H 0.380 0.356
Ro 14-3074 N3 H F H H H 0.724 1.200
Ro 20-1310 CI C(CH3)3 H H H H 2.792 2.675
Ro 20-1815 NH2 Me F H H H 1.813 1.870
Ro 20-2533 Et H H H H H 1.566 1.566
Ro 20-2541 CN Me F H H H 1.477 1.319
Ro 20-3053 COMe H F H H H 1.255 1.261
Ro 20-5397 CHO H H H H H 1.633 1.636
Ro 20-5747 CH=CH2 H H H H H 1.380 1.397
Ro 20-7078 CI H F H CI H 0.724 0 645
Ro 20-8065 CI H F H H CI 0.556 0.646
Ro 20-8552 Me H F H H CI 1.146 1.138
Ro 20-8895 H H F H H Me 1.279 0.973
Ro 22-3294 CI H CI CI H H 0.845 0.429
Ro 22-4683 N02 C(CH3)3 F H H H 2.477 2.518
Ro 22-6762 CI Me H H H CI 1.602 1.610 temazepam CI Me H H OH H 1 .204 1 7.61
Table 4: Nature and number of descriptors used for benzodiazepines dataset before and after selection by GA- CSA
Descπptor Nature Before GA-CSA After GA-CSA
Verloop for substituents Steπc 36 5
Molecular Mass Mass 1 0
Molecular Surface Area Surface 1 0
Molecular Volume Volume 1 0
Moments of inertia Volume 6 2
Ellipsoidal volume Volume 1 0
Dipole moments Electronic 4 3
LogP and hpole moments Lipoptulicity 5 3
Molecular refractivity Refractivity 1 0
Kier CfuV indices Connectivity 20 2
Kappa and flexibility Shape 7 1
Wiener, Balaban and Randic index Topology 3 1
Sum of E-state indices Electrotopology 1 0
Atom count Atomic 11 3
H-bond donor and acceptor H-bond 2 1
Group count Chemical groups 5 0
Total 105 21
l bie : p- Λncj R:(CV) values ot NN models built with different size of descπptors subsets compared to the final NN model built bv the S-21 descπDtors suoset selected bv GA-CSA
NN model umber of descπptors R2 R-(CV)
S-21 21 0 933 0 87
R21-1 21 0 51 0 24
R21-2 21 0 43 0 3
R21-3 21 0 46 0 14
R21-4 21 0 52 0 2
R21-5 21 0 62 0 41
R8 8 0 24 0 11
R15 15 0 5 0 25
R30 30 0 64 0 49
R50 50 0 5 0 29
R100 100 0 54 0 28
1105 105 0 56 0 31
Rm84 84 0 57 0 29
Table 6: Fttness R" and R2(CV) values of NN models built with descπptors subsets selected by GA-CSA generated from different random seeds of GA, and compared to the final NN model built by the S-21 descπptors subset _=_
NN model Number of descπptors Fitness R- RZ(CV)
S-21 21 0885 0933 087 RSI 21 0901 0931 086 RS2 22 0910 0925 086 RS3 23 1086 0928 083 RS4 24 1180 0922 079 RS5 25 1208 0911 078 RS6 25 1211 0899 076