[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108319984A - The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level - Google Patents

The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level Download PDF

Info

Publication number
CN108319984A
CN108319984A CN201810120969.0A CN201810120969A CN108319984A CN 108319984 A CN108319984 A CN 108319984A CN 201810120969 A CN201810120969 A CN 201810120969A CN 108319984 A CN108319984 A CN 108319984A
Authority
CN
China
Prior art keywords
leaf
blade
sample
dna methylation
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810120969.0A
Other languages
Chinese (zh)
Other versions
CN108319984B (en
Inventor
张晓宇
胡梦瑶
吉心悦
宋跃朋
张德强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Forestry University
Original Assignee
Beijing Forestry University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Forestry University filed Critical Beijing Forestry University
Priority to CN201810120969.0A priority Critical patent/CN108319984B/en
Publication of CN108319984A publication Critical patent/CN108319984A/en
Application granted granted Critical
Publication of CN108319984B publication Critical patent/CN108319984B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Biotechnology (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioethics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides the construction methods and prediction technique of xylophyta leaf characters and photosynthesis characteristics model based on DNA methylation level, belong to bioassay technique field.The important feature variable for embodying geographical location difference is chosen the present invention is based on random forest, screening obtains 7 leaf characteristic variables, determines optimum cluster number, every group cluster blade sample is obtained using improved FCM clustering algorithms;According to the enzymes combinations importance that correlation between variable and gradient boosted tree obtain, obtain to important enzymes combinations in every group cluster blade sample;Using the DNA methylation level of the enzymes combinations as regression variable, LS SVM regressive prediction models are built based on Gaussian radial basis function;The DNA methylation level of important enzymes combinations is inputted accurately to predict the leaf shape factor, leaf area and Net Photosynthetic Rate.This method screens merit xylophyta individual for predicting xylophyta phenotypic characteristic and photosynthesis characteristics.

Description

Xylophyta leaf morphology feature and photosynthesis characteristics based on DNA methylation level are pre- Survey the construction method and prediction technique of model
Technical field
The invention belongs to biological information fields, and in particular to the xylophyta leaf morphology based on DNA methylation level is special It seeks peace the construction method and prediction technique of photosynthesis characteristics prediction model.
Background technology
DNA methylation is occurred most frequently on the 5th carbon atom of the cytimidine in dinucleotides, is a kind of important table See genetic modification.DNA methylation inhibits transposable element, pseudogene, the expression of repetitive sequence and genes of individuals, in many lifes It plays a crucial role in such as gene expression of object process, embryonic development, cell Proliferation, differentiation and chromosome stability.In plant In, DNA methylation betides the site CG, CHG, CHH (H represents other bases in addition to guanine), and in higher plant In cell, the cytimidine being methylated can at most account for the 50% of total cytimidine number.DNA methylation controls the growth of plant And development, more particularly to the regulation and control of gene expression and DNA replication dna.Therefore, studying the DNA methylation of plant contributes to us The mode of DNA methylation coordinate plant growth development is solved, there is realistic meaning.
Qualitative analysis was concentrated mainly on to the research of DNA methylation in the past, lacks quantitative study, and for the DNA of plant The research that methylates also is concentrated mainly on herbaceous plant, also less to the research of xylophyta.For example, 2015, Wanneng Yang et al. based on linear regression have studied complete genome DNA methylate how to influence the leaf characters of rice (leaf width, leaf is long, Leaf area etc.).In addition, 2015, Dong Ci et al. are reported using methyl-sensitive Polymorphism technique (MSAP) to leaflet The DNA methylation decorating site of poplar natural population carries out genome-wide screening, obtains the polymorphic site that methylates;It utilizes simultaneously Principal component analysis (PCA) and STRUCTURE softwares parse the apparent population genetic variations of populus simonii natural population.Hair Gene on existing DNA methylation site may play an important role in leaf development and photosynthesis regulation and control, to plant trait (including leaf), which is associated form and photosynthesis, certain influence.But the research still rests in qualitative analysis, lacks pair The quantitative study of DNA methylation analysis.Meanwhile the research to forest tree about DNA methylation depends on full-length genome at present DNA methylation scanning, research cost is higher, and data volume is huge causes result accuracy poor.
Invention content
In view of this, the purpose of the present invention is to provide the xylophyta leaf morphology features based on DNA methylation level With the construction method and prediction technique of photosynthesis characteristics prediction model, predictablity rate height.
In order to achieve the above-mentioned object of the invention, the present invention provides following technical scheme:
The present invention provides xylophyta leaf morphology features and photosynthesis characteristics prediction mould based on DNA methylation level The construction method of type, includes the following steps:
1) blade for collecting the same species xylophyta of national NATURAL DISTRIBUTION, obtains blade representative sample;
2) photosynthesis characteristics and phenotypic characteristic for measuring the blade representative sample, obtain blade photosynthesis characteristics data and Leaf morphology characteristic;
The phenotypic characteristic includes leaf area, leaf length, leaf width degree, leaf perimeter, ratio of length to breadth and the leaf shape factor;
The photosynthesis characteristics include Net Photosynthetic Rate, stomatal conductance, CO2Concentration and efficiency of water application;
3) methylation state of DNA for measuring endonuclease bamhi in the blade representative sample, obtains the DNA first of endonuclease bamhi Baseization is horizontal, and the full-length genome average dna methylation level of each blade representative sample is calculated;
4) with the average dna of the full-length genome of the photosynthesis characteristics data of the blade, leaf morphology characteristic and blade Methylation level is candidate variables, generates the important feature variable of difference to geographical location using Random Forest model screening, obtains To leaf morphology characteristic and full-length genome average dna methylation level;
5) it using the leaf morphology characteristic and full-length genome average dna methylation level as characteristic variable, utilizes Square error and 26 Cluster Evaluation indexs determine the optimum cluster group number of blade representative sample in being organized in NbClust software packages;
6) the optimum cluster group number of the blade representative sample is input in improved Fuzzy C-Means Clustering Algorithm, The subordinated-degree matrix of every group cluster sample is calculated;
The improved Fuzzy C-Means Clustering Algorithm is as follows:
A. it is sample data number to give optimum cluster group the number c, n, sets iteration stopping threshold value as ε=10-5If Maximum iteration t=300 is set, random initializtion Subject Matrix U enables iteration count t=0;
If finite aggregate X={ x1, x2..., xn, and the member in X is known as m characteristic variable, X is expressed as the square of n × m Battle array is as follows:
Wherein, m indicates that the number of characteristic variable, n represent blade representative sample number;
The n sample of matrix X is divided into c groups (2≤c≤n), the fuzzy clustering matrix U for the c groups being divided into is:
In the matrix U, μijIndicate sample xjWith the membership of classification i, and
C cluster centre be:
Select minimal error quadratic sum as clustering criteria, shown in the object function such as formula (1) of clustering:
In addition constraints formula as shown in formula (2):
Solve to obtain formula shown in formula (3):
B. update fuzzy clustering matrix and cluster centre matrix are calculated according to formula (3);
If c. P(t)-P(t-1)< ε then stop calculating, and export fuzzy clustering matrix U and cluster centre matrix P, otherwise enable T=t+1 turns to step (2) until output matrix U and matrix P;
7) it is based on the subordinated-degree matrix for every group cluster sample that the step 6) obtains, is calculated per complete in group cluster sample The enzymes combinations weight that the correlation and gradient boosted tree of genome average dna methylation level and each characteristic variable of blade obtain The property wanted obtains the important endonuclease bamhi combination of each group cluster sample;
Each characteristic variable of blade includes leaf area, Net Photosynthetic Rate and the leaf shape factor;
8) the DNA methylation level of the important endonuclease bamhi combination obtained using the step 7) is utilized as regression variable Gaussian radial basis function establishes LS-SVM regressive prediction models, obtains leaf morphology feature and photosynthesis characteristics model such as formula (9) institute Show;
There is no the limitation of time sequencing between step 2 and 3.
Preferably, use Random Forest model screening for the Mean of selection selection characteristic variable in the step 4) The mean value of Decrease Accuracy and Mean Decrease Gini 15 or more variable as important feature variable.
Preferably, the method for establishing LS-SVM regressive prediction models using Gaussian radial basis function, includes the following steps:
In SVM, it is assumed that sample training collection isSample training collection T In, xiThe input variable y of i-th of sample is concentrated for training sampleiThe output variable of i-th of sample, R are concentrated for training sample Real number field is represented, n represents input sample number, and regression function is
In formula (4), w and b are regression parameter,It is characterized mapping, x is the input variable of training sample set;
And solution is converted into problem in LS-SVM:
In formula (5), γ is regularization parameter, and ξ is relaxation factor, In×1=(1,1 ..., 1) ', In×1=(1,1 ..., 1) ' beSuch a matrix;
It is as follows to construct Lagrange functions:
Wherein α is Lagrange multiplier;Under ask the saddle point of L (w, b, ξ, α), i.e. optimum point;
W, ξ in subtractive (7), can obtain:
In formula (8), Ω=(xix′j), i, j=1,2 ..., n, E are the unit matrix of n × n.
α and b are obtained by solving formula (8), then the estimation function of least square method supporting vector machine is:
Wherein, k (xi, x 'j) it is kernel function, choose Gauss Radial basis kernel functions
Input variable is standardized before being returned, and by parameter optimization, enables γ=10,
σ=1.
Preferably, in the step 3) in blade representative sample the methylation state of DNA of endonuclease bamhi assay method, Include the following steps:
31) genome of the EcoRI/HpaII and EcoRI/MspI restriction enzymes to the blade representative sample is used DNA carries out digestion, obtained endonuclease bamhi;
32) endonuclease bamhi expanded in advance successively and selective amplification, according to obtained selective amplification product Parting is carried out, the methylation state of DNA of endonuclease bamhi is obtained according to genotyping result.
Preferably, the parting be by the selective amplification product carry out electrophoresis, by obtained electrophoretic band two into It scores in character matrix processed, indicates band missing with " 0 ", the presence of band is indicated with " 1 ";CNG (1,0) represents half first Base state, CG (0,1) represent permethylated state, and (1,1) is represented without methylation state, and (0,0) represents the unknown shape that methylates State;Shown in the methylation level calculation formula such as formula (10) of full-length genome:
The DNA methylation level=(site of hemimethylation state+permethylated state site+unknown first of full-length genome The site of base state)/(site+nothing of the site of hemimethylation state+permethylated state site+unknown methylation state The site of methylation state) formula (10).
Preferably, the quantity of the blade representative sample is 200 or more.
Preferably, shown in the calculation formula such as formula (11) of the leaf shape factor:
The present invention provides xylophyta leaf morphology features and photosynthesis characteristics prediction mould based on DNA methylation level The prediction technique of type, the DNA methylation level combined with important endonuclease bamhi input the leaf morphology feature of the method structure With the prediction leaf shape factor, leaf area and leaf Net Photosynthetic Rate in photosynthesis characteristics model;
The DNA methylation level of the important endonuclease bamhi combination is the important enzyme obtained in the construction method Cut the DNA methylation levels that fragment combination is calculated according to the methylation level calculation formula of full-length genome.
Preferably, the DNA methylation level of enzymes combinations is to the blade profile factor, the influence journey of leaf area and Net Photosynthetic Rate Degree draws edge effect figure.
The method for drafting of the edge effect figure is the DNA methylation level that combines important endonuclease bamhi as input Then variable calls the function of the drafting edge effect figure carried in gbm to draw more important endonuclease bamhi to each blade The edge effect figure of the leaf shape factor, leaf area and leaf Net Photosynthetic Rate.
Xylophyta leaf morphology characteristic and photosynthesis characteristics provided by the invention based on DNA methylation level predict mould The construction method of type carries out photosynthesis characteristics using random forest and phenotypic characteristic selects by blade representative sample before cluster, Obtain the characteristic variable for having larger impact to the xylophyta Geographical distribution differences.When blade representative sample clusters, it is not used Traditional clustering method (K-Means is clustered, PAM clusters etc.), but the improved Fuzzy C-Means Clustering Algorithm used, institute Result, which must be exported, has probability meaning, remains the uncertainty of xylophyta inter-individual difference.Establishing prediction model simultaneously Before, more important enzymes combinations are screened first, reduce the complexity of prediction model and practical operation, keep prediction more acurrate.Together When the present invention establish the apparent population genetic study strategy of forest for the first time, parsed forest tree population epigenetic structure, propose DNA methylation may influence the phenotypic characteristic and photosynthesis characteristics of populus simonii natural population.
Further, method provided by the invention, the DNA methylation combined using the important endonuclease bamhi of screening are horizontal Edge effect figure is made to phenotypic characteristic and photosynthesis characteristics, realizes quantitative study DNA methylation to phenotypic characteristic and photosynthesis characteristics Influence.
Description of the drawings
Fig. 1 is to screen populus simonii characteristic variable figure using forest stochastic model in embodiment 1;
Fig. 2 is to determine populus simonii sample optimum cluster number based on 26 Cluster Evaluation indexes of SSE and in embodiment 1 Figure;
Fig. 3 is the individual figure determined using improved FCM algorithms in embodiment 1 per class sample;
Fig. 4 is that correlation clusters situation map between each characteristic variable in embodiment 1;
Fig. 5 is the importance of endonuclease bamhi and three variables in the first group cluster populus simonii sample in embodiment 1;
Fig. 6 is the importance of endonuclease bamhi and three variables in the first group cluster populus simonii sample in embodiment 1;
Fig. 7 is influence of the DNA methylation level to the marginal utility of leaf area of important enzymes combinations in embodiment 2, figure 7-1 is the first group cluster populus simonii sample;Fig. 7-2 is the second cluster populus simonii sample;
Fig. 8 is shadow of the DNA methylation level to the marginal utility of Net Photosynthetic Rate of important enzymes combinations in embodiment 2 It rings, Fig. 8-1 is the first group cluster populus simonii sample;Fig. 8-2 is the second cluster populus simonii sample;
Fig. 9 is marginal utility of the DNA methylation level to the blade shape factor of important enzymes combinations in embodiment 2 It influences, Fig. 9-1 is the first group cluster populus simonii sample;Fig. 9-2 is the second cluster populus simonii sample;
Figure 10 is the leaf morphology of two willow subgroups in embodiment 2.
Specific implementation mode
The present invention provides xylophyta leaf morphology features and photosynthesis characteristics prediction mould based on DNA methylation level The construction method of type, includes the following steps:
1) blade for collecting a kind of xylophyta of national NATURAL DISTRIBUTION, forms blade representative sample;
2) photosynthesis characteristics and phenotypic characteristic for measuring the blade representative sample, obtain blade photosynthesis characteristics data and Leaf morphology characteristic;
The phenotypic characteristic includes leaf area, leaf length, leaf width degree, leaf perimeter, ratio of length to breadth and the leaf shape factor;
The photosynthesis characteristics include Net Photosynthetic Rate, stomatal conductance, CO2Concentration and efficiency of water application;
3) methylation state of DNA for measuring endonuclease bamhi in the blade representative sample, obtains the DNA first of endonuclease bamhi Baseization is horizontal, and the full-length genome average dna methylation level of each blade representative sample is calculated;
4) with the average dna of the full-length genome of the photosynthesis characteristics data of the blade, leaf morphology characteristic and blade Methylation level is candidate variables, generates the important feature variable of difference to geographical location using Random Forest model screening, obtains To leaf morphology feature and full-length genome average dna methylation level;
5) it using leaf morphology characteristic and full-length genome average dna methylation level as characteristic variable, utilizes Square error and 26 Cluster Evaluation indexs determine the optimum cluster group number of blade representative sample in being organized in NbClust software packages;
6) optimum clustering number of the blade representative sample is input in improved Fuzzy C-Means Clustering Algorithm, is counted Calculation obtains the subordinated-degree matrix of every group cluster sample;
The improved Fuzzy C-Means Clustering Algorithm is as follows:
A. the optimum clustering number c is given, n is sample data number, sets iteration stopping threshold value as ε=10-5, setting Maximum iteration t=300, random initializtion Subject Matrix U, enables iteration count t=0;
If finite aggregate X={ x1, x2..., xn, and the member in X is known as m characteristic variable, X is expressed as the square of n × m Battle array is as follows:
Wherein, m indicates that the number of characteristic variable, n represent blade representative sample number;
The n sample of matrix X is divided into c classes (2≤c≤n), fuzzy clustering matrix U is:
Wherein, μijIndicate sample xjWith the membership of classification i, and
C cluster centre be:
Select minimal error quadratic sum as clustering criteria, the object function of clustering is:
In addition constraints obtains:
It solves:
B. update fuzzy clustering matrix and cluster centre matrix are calculated according to formula (3);
If c. P(t)-P(t-1)< ε then stop calculating, and export fuzzy clustering matrix U and cluster centre matrix P, otherwise enable T=t+1 turns to step (2) until obtaining fuzzy clustering matrix U and cluster centre matrix P;
7) subordinated-degree matrix based on every group cluster sample is calculated per full-length genome average dna methyl in group cluster sample The enzymes combinations importance that the correlation and gradient boosted tree of change level and each characteristic variable of blade obtain obtains each group and gathers The important endonuclease bamhi of class sample combines;
Each characteristic variable of blade includes leaf area, Net Photosynthetic Rate and the leaf shape factor;
8) it using the DNA methylation level of important endonuclease bamhi combination as regression variable, is built using Gaussian radial basis function Vertical LS-SVM regressive prediction models input LS-SVM regressive prediction models with the DNA methylation level that important endonuclease bamhi combines The middle prediction leaf shape factor, leaf area and leaf Net Photosynthetic Rate.
The present invention collects the blade of the same species xylophyta of national NATURAL DISTRIBUTION, obtains blade representative sample.
In the present invention, the collection of blade preferably Shaanxi, Qinghai, Hebei, Henan, Ningxia, Shanxi, Beijing, Inner Mongol It is ancient.The type of the xylophyta is not particularly limited, the applicable all xylophytas of method provided by the invention.It is described Xylophyta is preferably willow category, most preferably populus simonii.The quantity of the blade sample of xylophyta be preferably 1000 with On.Blade representative sample is selected from the blade of the xylophyta of acquisition.The standard selected is can to cover nature leaflet The entire geographical distribution of Yang Qunti.The quantity of the blade representative sample is 200~500.
After obtaining blade representative sample, the present invention measures the photosynthesis characteristics and phenotypic characteristic of the blade representative sample, obtains To the photosynthesis characteristics data and leaf morphology characteristic of blade;The leaf morphology feature includes leaf area, leaf length, leaf Width, leaf perimeter, ratio of length to breadth and the leaf shape factor;The photosynthesis characteristics include Net Photosynthetic Rate, stomatal conductance, CO2Concentration And efficiency of water application.
In the present invention, the measurement leaf area, leaf length, leaf width degree, leaf perimeter and leaf width degree five phenotypic characters of ratio When, use laser blade area measuring device measurement.The present invention is not particularly limited laser blade area measuring device, Using leaf area measuring instrument known in the art.In the embodiment of the present invention, leaf area measuring instrument is portable laser leaf Piece area measuring device (CI-202).In the present invention, shown in the formula such as formula (11) of the leaf shape factor:
In the present invention, the photosynthesis characteristics are surveyed using portable gas exchange system (Li-6400xt, LiCor) instrument It is fixed.In order to obtain maximum instantaneous Net Photosynthetic Rate, photosynthetic photon flux density (PPFD) is set as 1600, CO2Concentration is arranged It is 400.Net Photosynthetic Rate, stomatal conductance, iuntercellular CO2Concentration and water use efficiency (WUE) have under Net Photosynthetic Rate Record.
The present invention measures the methylation state of DNA of endonuclease bamhi in the blade representative sample, obtains endonuclease bamhi DNA methylation is horizontal, and the full-length genome average dna methylation level of each blade representative sample is calculated.
In the present invention, in the blade representative sample methylation state of DNA of endonuclease bamhi assay method, preferably Include the following steps:
31) use EcoRI/HpaII and EcoRI/MspI restriction enzymes respectively to the base of the blade representative sample Because group DNA carries out digestion, obtained endonuclease bamhi;
32) endonuclease bamhi expanded in advance successively and selective amplification, according to obtained selective amplification product Parting is carried out, the methylation state of DNA of endonuclease bamhi is obtained according to genotyping result.
The present invention preferably extracts the genomic DNA of the blade representative sample.
In the present invention, the extracting method of the genomic DNA of the blade representative sample is not particularly limited, using this Genome DNA extracting method known to field.In embodiments of the present invention, the genome of the blade representative sample The extraction of DNA extracts to obtain using RNA isolation kit.The kit using DNA Plant Mini Kits (base root China, on Sea).
After obtaining the genomic DNA of the blade representative sample, the present invention uses EcoRI/HpaII and EcoRI/MspI Restriction enzyme carries out digestion, obtained endonuclease bamhi to the genomic DNA of the blade representative sample respectively.
In the present invention, the source of the EcoRI/HpaII and EcoRI/MspI restriction enzymes does not have special limit System, using enzyme source known in the art.The present invention is not particularly limited the digestion method, using ability The enzymatic cleavage methods of EcoRI/HpaII and EcoRI/MspI restriction enzymes known to domain.
In the present invention, the endonuclease bamhi carries out amplification and selective amplification in advance, obtains selective amplification PCR product. The present invention is not particularly limited the pre- amplification and selective amplification, the Variation in delivered using Dong Ci et al. genomic methylation in natural populations of Populus simonii is associated With leaf shape and photosynthetic traits (Journal of Experimental Botany, Vol.67, No.3pp.723-737,2016).
After obtained selective amplification product, the present invention carries out parting to the selective amplification product, according to parting knot Fruit obtains the methylation state of DNA of endonuclease bamhi.
In the present invention, when the selective amplification product is EcoRI/HpaII digestion with restriction enzyme product bands It is indicated with H;The selective amplification product is indicated when being EcoRI/MspI digestion with restriction enzyme product bands with M.If choosing After the PCR product of selecting property amplification carries out electrophoresis, HM has band, then illustrates that the two cut place does not all methylate, i.e. CCGG sequences Row;If H has band, when M is without band, be hemimethylation also have be it is outer methylate, i.e., 5 ' mCCGG sequences;And H is without band, when M has band, Permethylated, also say be it is interior methylate, i.e. 5 ' CmCGG sequences.HpaII cannot cut any full methyl of double-strand cytimidine Change, can only methylate outside cutting single-chain.As for MspI, it can cut inside methylation sites, either it is permethylated, It can also be hemimethylation.
After obtained selective amplification product, the present invention carries out parting to the selective amplification product, according to parting knot Fruit obtains the methylation state of DNA of endonuclease bamhi.
In the present invention, the selective amplification product is preferably carried out electrophoresis, the electrophoretic band that will be obtained by the parting It scores in binary-coded character matrix, indicates band missing with " 0 ", the presence of band is indicated with " 1 ";(CNG (1,0)) Hemimethylation state is represented, (CG (0,1) represents permethylated state, and (1,1) represents permethylated state, and (0,0) represents not Know methylation state) the methylation level calculation formula of full-length genome is shown in formula (1,0):
The DNA methylation level=(site of hemimethylation state+permethylated state site+unknown first of full-length genome The site of base state)/(site+nothing of the site of hemimethylation state+permethylated state site+unknown methylation state The site of methylation state) formula (10).
Obtain the average dna methyl of the full-length genome of the photosynthesis characteristics data of blade, leaf morphology characteristic and blade After changing level, the present invention is with the flat of the full-length genome of the photosynthesis characteristics data of the blade, leaf morphology characteristic and blade Equal DNA methylation level is candidate variables, and the important feature for being generated difference to geographical location using Random Forest model screening is become Amount, obtains leaf morphology feature and full-length genome average dna methylation level.
In the present invention, the important feature variable for generating difference to geographical location using Random Forest model screening Method, be by the mean value of Mean Decrease Accuracy and Mean Decrease Gini 15 or more variable make For important characteristic variable.For the application using populus simonii as sample, obtained important feature variable is the leaf morphology spy of the blade Levy the average dna methylation level of the full-length genome of data and blade.
After obtaining leaf morphology feature and full-length genome average dna methylation level, the present invention is by leaf morphology characteristic , as characteristic variable, square error knot in being organized in NbClust software packages is utilized according to full-length genome average dna methylation level Close the optimum cluster group number that 26 Cluster Evaluation indexs determine blade representative sample.
In the present invention, when being assessed using square error in group (SSE), the cluster numbers of slope minimum are selected, and use 26 When a Cluster Evaluation index evaluation, it is optimum cluster group to select group internal standard number (Number Criteria) maximum cluster numbers Number.
The optimum clustering number of the blade representative sample is input to improved Fuzzy C-Means Clustering Algorithm by the present invention In, the subordinated-degree matrix of every group cluster sample is calculated;
The improved Fuzzy C-Means Clustering Algorithm is as follows:
A. it is sample data number to give optimum cluster group the number c, n, sets iteration stopping threshold value as ε=10-5If Maximum iteration t=300 is set, random initializtion Subject Matrix U enables iteration count t=0;
If finite aggregate X={ x1, x2..., xn, and the member in X is known as m characteristic variable, X is expressed as the square of n × m Battle array is as follows:
Wherein, m indicates that the number of characteristic variable, n represent blade representative sample number;
The n sample of matrix X is divided into c classes (2≤c≤n), fuzzy clustering matrix U is:
Wherein, μijIndicate sample xjWith the membership of classification i, and
C cluster centre be:
Select minimal error quadratic sum as clustering criteria, the object function of clustering is:
In addition constraints obtains:
It solves:
B. update fuzzy clustering matrix and cluster centre matrix are calculated according to formula (3);
If c. P(t)-P(t-1)< ε then stop calculating, and export fuzzy clustering matrix U and cluster centre matrix P, otherwise enable T=t+1 turns to step (2) until obtaining fuzzy clustering matrix U and cluster centre matrix P.
After obtaining the subordinated-degree matrix of every group cluster sample, the present invention is based on the subordinated-degree matrix of every group cluster sample, meters The correlation per full-length genome average dna methylation level and each characteristic variable of blade in group cluster sample is calculated, and according to ladder The enzymes combinations importance that degree boosted tree obtains.
In the present invention, the method for determining enzymes combinations importance is as follows:
First, the method for analyzing each correlation of variables is as follows:
In order to analyze the correlativity and combined effect situation of each variable, converts related coefficient to distance metric, use Method be d=1- | r |, wherein d be metric range, r is related coefficient;The formula of r is as follows:
Cov (X, Y) is X in formula, and the covariance of Y, D (X), D (Y) are respectively the variance of X, Y.).By it is multiple dimensioned from Main double sampling can obtain the p value that every part of data carry out hierarchical clustering, the uncertainty of hierarchical clustering is assessed with this.We In this way to 51 variables (leaf area, leaf length, leaf width, leaf perimeter, ratio of length to breadth, the leaf shape factor, net photosynthesis Rate, stomatal conductance, CO2The DNA methylation of concentration and efficiency of water application and 41 enzymes combinations is horizontal) to carry out system poly- Class, has obtained the correlativity between each variable, and the stronger variable of correlation is marked with red boxes.
Secondly, the enzymes combinations importance obtained according to gradient boosted tree, the specific method is as follows:
Using the gbm packet training gradient boosted trees in R language,
The setting of parameter is as follows:
Distribution=' gaussian ',
N.trees=10000,
Shrinkage=0.01,
Interaction.depth=5,
Bag.fraction=0.5,
Cv.folds=10.
Gbm packets calculate each input variable to the importance of relevant variable according to training pattern, and score is higher, to sound Dependent variable influences bigger.The important endonuclease bamhi combination of each group cluster sample is thus obtained, we select importance score Endonuclease bamhi 2 or more combines.
In the present invention, it is drawn using the function and arrange parameter of drawing edge effect figure in the gbm packets in R language, instruction Practice gradient boosted tree, after obtaining the enzymes combinations of great influence, we draw edge effect figure.
The setting of parameters is as follows in the drawing process:
Distribution=' gaussian ',
N.trees=10000,
Shrinkage=0.01,
Interaction.depth=5,
Bag.fraction=0.5,
Cv.folds=10.
According to obtained edge effect figure, so that it may to analyze in every a kind of populus simonii, DNA methylation horizontal blade face Product, the edge effect of the Net Photosynthetic Rate leaf shape factor, to analyze DNA methylation to populus simonii leaf characters and photosynthetic spy Property influence it is last, after obtaining the combination of important endonuclease bamhi, DNA methylation water that the present invention is combined with important endonuclease bamhi It is flat to be used as regression variable, LS-SVM regressive prediction models are established using Gaussian radial basis function, obtain the leaf as shown in formula (9) Phenotypic characteristic model;
In the present invention, in SVM, it is assumed that sample training collection isIt returns The function is returned to be
Wherein, w, b are regression parameter,It is characterized mapping.
And solution is converted into problem in LS-SVM:
Wherein, γ is regularization parameter, and ξ is relaxation factor, In×1=(1,1 ..., 1) ';
It is as follows to construct Lagrange functions:
Wherein α is Lagrange multiplier;Under ask the saddle point of L (w, b, ξ, α), i.e. optimum point;
W, ξ in subtractive (7), can obtain:
In formula (8), Ω=(xix′j), i, j=1,2 ..., n, E are the unit matrix of n × n.
α and b are obtained by solving formula (8), then the estimation function of least square method supporting vector machine is:
Wherein, k (xi, x 'j) it is kernel function, choose Gauss radial directions base (RBF) kernel function Input variable is standardized before being returned, and by parameter optimization, enables γ=10, σ=1.
The present invention provides xylophyta leaf morphology features and photosynthesis characteristics prediction mould based on DNA methylation level The prediction technique of type, the DNA methylation level combined with important endonuclease bamhi input the leaf morphology feature of the method structure With the prediction leaf shape factor, leaf area and leaf Net Photosynthetic Rate in photosynthesis characteristics model;
The DNA methylation level of the important endonuclease bamhi combination is the important enzyme slice obtained in the construction method Duan Zuhe is horizontal according to the DNA methylation that the methylation level calculation formula of full-length genome is calculated.
In the present invention, the DNA methylation level of enzymes combinations is to the blade profile factor, the shadow of leaf area and Net Photosynthetic Rate The degree of sound draws edge effect figure.
The method for drafting of the edge effect figure is the DNA methylation level that combines important endonuclease bamhi as input Then variable calls the function of the drafting edge effect figure carried in gbm to draw more important endonuclease bamhi to each blade The edge effect figure of the leaf shape factor, leaf area and leaf Net Photosynthetic Rate.By making marginal utility figure, we can obtain The DNA methylation level of enzymes combinations is to the blade profile factor, the influence of leaf area and Net Photosynthetic Rate.Influence can be directly found Larger endonuclease bamhi, further the gene loci of research thereon, reduces time and economic cost.
With reference to embodiment to a kind of blade based on DNA methylation horizontal forecast xylophyta provided by the invention The construction method and prediction technique of shape and photosynthesis characteristics model are described in detail, but they cannot be interpreted as pair The restriction of the scope of the present invention.
Embodiment 1
Below in the collected 235 populus simonii individuals in the whole nation.
The acquisition of experimental data:
The Variation in genomic methylation in natural delivered using Dong Ci et al. populations of Populus simonii is associated with leaf shape and Photosynthetic traits (Journal of Experimental Botany, Vol.67, No.3pp.723-737, 2016) 235 populus simonii individual of sample DNA genomic methylation water are calculated in the data that methylate of endonuclease bamhi disclosed in Flat, the results are shown in Table 1, and (CC indicates Chicheng County:Chicheng County, ZJK:Zhangjiakou FX:Shaanxi Fu County;LY: The Linyou Counties Linyou County;LX:The Langao Counties Langao County, LC:Luochuan County Luochuan Counties, GQ: The Gaoling Counties Gaoling Count;HZ:The Danma Huzhus Huzhu County, XH:The Xinghai Counties Xinghai County;W:Dulan County Dulan Counties, MY:The counties Menyuan County Men Yuan, SX:The Songxian County Song County, YC:Yichuan County she Chuan Xian, JL:The Zhongning Counties Zhongning County, NM:The Baotous Baotou City, NW:The Ningwu Counties Ningwu County, TRT:The Beijing Taoranting Park Joyous Pavilion Park).Portable laser blade area measuring device (CI-202) is used simultaneously Measure leaf morphology, leaf width (abbreviation width), perimeter (abbreviation perim), leaf area (abbreviation area), (letter of the leaf shape factor Claim fact), leaf length (abbreviation length), ratio of length to breadth (abbreviation ratio) and average DNA methylation are horizontal (referred to as Dmavg));The photosynthesis characteristics of blade use portable gas exchange system (li-6400xt licor) Instrument measuring.In order to obtain Maximum instantaneous Net Photosynthetic Rate is obtained, photosynthetic photon flux density (PPFD) is set as 1600, CO2Concentration is set as 400.Only Photosynthetic rate, stomatal conductance, iuntercellular CO2Concentration and water use efficiency (WUE) have record under Net Photosynthetic Rate.Light It closes performance data and the results are shown in Table 2 (place name mark is same as above).
1. being classified to populus simonii sample using improved FCM algorithm.
1.1 choose the important feature variable that can embody geographical location difference based on random forest.
Fig. 1 is that the important feature variable that can embody geographical location difference is chosen based on random forest.It is by Fig. 1 it is found that small The difference of leaf poplar sample depends primarily on seven variables, such as leaf width (abbreviation width), perimeter (abbreviation perim), leaf area (abbreviation area), the leaf shape factor (abbreviation fact), leaf length (abbreviation length), ratio of length to breadth (abbreviation ratio) and average DNA methylation level (abbreviation Dmavg).
1.2 determine populus simonii sample optimum cluster number.Fig. 2 is to determine populus simonii sample optimum cluster number.From Fig. 2 In as can be seen that using group in a square error (SSE) assess when, it is 2 select the cluster numbers of slope minimum, when use 26 gather When class evaluation index is assessed, it is 2 to select group internal standard number (Number Criteria) maximum cluster numbers.Therefore, comprehensive SSE Optimum cluster number with 26 indexs, recommendation is 2.
1.3 determine the individual per class sample using improved FCM algorithms.From figure 3, it can be seen that first kind sample packet Containing 139 populus simonii individuals, the second class sample includes 97 populus simonii individuals.
2. the selection of prediction model regression variable
The research of 2.1 correlation of variables
As seen from Figure 4, find have the DNA methylation horizontal correlation of four groups of enzymes combinations extremely strong, such as H44E4 And H47E4, H80E1 and H80E14, H80E9 and H86E9, and the 4th group comprising 15 enzymes combinations (H65E7, H63E9, H86E7, H60E3, H60E1, H46E4, H31E1, H60E15, H65E8, H47E6, H63E11, H65E6, H34E3, H46E10, H46E11).These results show that DNA methylation is regiospecificity in willow genome, and the digestion group of strong correlation The modification that the DNA methylation levels of conjunction may represent DNA methylation on these areas is similar.
2.2 select regression variable based on the enzymes combinations importance obtained by machine learning algorithm (gradient boosted tree), due to There are certain errors for the machine learning method, so the important endonuclease bamhi selected after study every time can be slightly different.Therefore it is Reduction error, herein by the endonuclease bamhi for taking out existing more number after multiple study, and according to importance degree size into Row sequence.
As shown in Figure 5, in first kind populus simonii sample, enzymes combinations H31E1, H65E12, H80E14, H65E5, The DNA methylation level of H80E2, H46E12, H80E12, H65E4 and H80E1 have larger impact, enzymes combinations to the blade profile factor The DNA methylation level of H31E3, H46E11, H60E3, H63E10, H44E4, H80E2, H65E5, H63E11 and H60E2 are to leaf Piece area is affected, enzymes combinations H60E3, H31E3, H63E12, H46E4, H65E7, H86E16, H82E5, H80E1 and The DNA methylation level of H63E11 is affected to Net Photosynthetic Rate.
It will be appreciated from fig. 6 that in the second class populus simonii sample, enzymes combinations H60E15, H60E1, H63E10, H65E6, The DNA methylation level of H34E5, H65E7, H80E13, H82E5 and H60E2 have larger impact, enzymes combinations to the blade profile factor The DNA methylation of H80E13, H60E2, H82E5, H60E15, H63E11, H86E7, H80E9, H86E16, H65E6 and H44E4 Level is affected to blade area, enzymes combinations H65E8, H60E1, H80E2, H86E7, H46E11, H44E15, H63E12, The DNA methylation level of H82E5 and H46E4 is affected to Net Photosynthetic Rate.
3. using the enzymes combinations that filter out as regression variable, based on LS-SVM prediction populus simonii phenotypic characteristics and photosynthetic The value of characteristic (the blade profile factor, blade area, Net Photosynthetic Rate), while populus simonii phenotypic characteristic and photosynthesis characteristics data are measured, It the results are shown in Table 3~table 8.
In first kind populus simonii sample:Predictablity rate to the blade profile factor, blade area, Net Photosynthetic Rate is respectively 96.26%, 94.42% and 96.88%.In second class populus simonii sample, to the blade profile factor, blade area, Net Photosynthetic Rate Predictablity rate respectively reaches 81.27%, 92.1% and 95.8%.
Embodiment 2
DNA methylation level and populus simonii leaf morphology feature, the relationship of photosynthesis characteristics.
In order to inquire into DNA methylation level to phenotypic character (the blade shape factor, blade area and Net Photosynthetic Rate) It influences, we analyze the numerical value of marginal utility.
In first kind populus simonii sample, enzymes combinations H31E3, H60E3, H63E10, H44E4, H80E2, H65E5, The DNA methylation level of H63E11 and H60E2 is apparent (such as Fig. 7-1) to the edge effect of blade area.The meeting of blade area with It the raising of the DNA methylation level of enzymes combinations H60E3, H44E4 and H60E2 and reduces, with H31E3, H63E10, The raising of the DNA methylation levels of H80E2, H65E5 and H63E11 and improve.In second class populus simonii sample, enzymes combinations The DNA methylation of H80E13, H60E2, H82E5, H60E15, H63E11, H86E7, H80E9, H86E16, H65E6 and H44E4 Level is fairly obvious to the edge effect of blade area (with reference to figure 7-2), also, in addition to enzymes combinations H80E13, H60E15, H63E11 and H86E7, the very high of other enzymes combinations DNA methylations levels can make blade area become smaller.
For Net Photosynthetic Rate (as shown in Figure 8), in first kind populus simonii sample, enzymes combinations H60E3, The edge effect of the DNA methylation level of H31E3, H63E12, H46E4, H65E7, H86E16, H82E5, H80E1 and H63E11 Obviously.Wherein, Net Photosynthetic Rate can be reduced with the raising of the DNA methylation level of H31E3, H63E12 and H65E7, with It the raising of the DNA methylation level of H60E3, H46E4, H86E16, H82E5, H80E1 and H63E11 and increases.In the second class In sample, the edge effect of the DNA methylation level of enzymes combinations H65E8, H80E2 and H86E7 is apparent, Net Photosynthetic Rate meeting It is reduced with the raising of the DNA methylation level of enzymes combinations H65E8 and H86E7, with the DNA first of enzymes combinations H80E2 The raising of baseization level and increase.
It can be seen in figure 9 that in first kind populus simonii sample, enzymes combinations H31E1, H65E12, H80E14, Edge effect ten of the DNA methylation level of H65E5, H80E2, H46E12, H80E12, H65E4 and H80E1 to the leaf factor Clearly demarcated aobvious, the DNA methylation level of enzymes combinations H31E1, H65E12, H80E14, H65E5, H46E12 and H65E4 are higher, leaf The shape factor is smaller, but the leaf factor can be with the improve of the DNA methylation level of enzymes combinations H80E2, H80E12 and H80E1 Two improve.In second class populus simonii sample, the DNA methylation level of enzymes combinations H82E5 and H60E2 are higher, the leaf factor It is bigger.But the leaf factor can be with the DNA methylation of H60E15, H60E1, H63E10, H65E6, H34E5, H65E7 and H80E13 Horizontal raising and reduce.
By analyzing two class populus simonii samples, it is found that the raising of DNA methylation level may be such that the leaf factor reduces. The kind of Populus has the diversity of phenotype.And DNA methylation is bigger to the contribution of Populus plasticity.The blade profile factor can at this time The important references factor as Phenotypic Diversity.The value of the blade profile factor is closer to 1, then the shape of blade is closer to round.Leaflet The blade of poplar mainly has following two (such as Figure 10).Pass through their blade profile factor of calculating, it has been found that for the first blade Form, the blade profile factor is larger, in 0.698-0.7853 or so.The blade profile factor of the populus simonii of second of leaf morphology is generally small In 0.5. and its blade profile factor values it is bigger, vane curvature is smaller.Pass through long-term experiment and research, it has been found that second The populus simonii resistance of type is stronger.Therefore increase it is concluded that obtaining the resistance that DNA methylation may be this kind of populus simonii Strong reason.
The present invention provides theoretical foundation for growth and development of the populus simonii under DNA methylation effect.
Pass through the analysis of two subpopulations to populus simonii, it has been found that the raising of DNA methylation level may reduce The blade shape factor.The kind of Populus has the diversity of phenotype.And DNA methylation is bigger to the contribution of Populus plasticity.This When the blade profile factor can be used as the important references factor of Phenotypic Diversity.The value of the blade profile factor is got over closer to 1, the then shape of blade Close to circle.The blade of populus simonii mainly has following two.Pass through their blade profile factor of calculating, it has been found that for first Kind leaf morphology, the blade profile factor is larger, in 0.698-0.7853 or so.The blade profile factor of the populus simonii of second of leaf morphology Generally less than 0.5. and its blade profile factor values is bigger, vane curvature is smaller.Pass through long-term experiment and research, it has been found that The populus simonii resistance of second of type is stronger.Therefore it is concluded that it may be the anti-of this kind of populus simonii to obtain DNA methylation The reason of inverse property enhancing.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.
The measured value and actual value of the Net Photosynthetic Rate of 4 second group cluster blade sample of table
The measured value and actual value of the leaf shape factor of 5 second group cluster blade sample of table
Feature The measured value of Pn Predicted value Feature The measured value of Pn Predicted value
CC27 0.09 0.10210262 FX63 0.65 0.61308217
CC40 0.15 0.15696717 FX7 0.1 0.11401913
FX104 0.05 0.0656351 FX76 0.6 0.56567484
The measured value and actual value of the leaf shape factor of 6 first group cluster blade sample of table
Feature Measured value Predicted value Feature Measured value Predicted value
CC10 0.63 0.6242989 FX61 0.59 0.5846074
CC11 0.61 0.6045739 FX64 0.7 0.6869076
CC12 0.6 0.5941265 FX65 0.64 0.636514
The measured value and actual value of the blade area of 7 first group cluster blade sample of table
Feature Measured value Predicted value Feature Measured value Predicted value
CC10 26.98 27.45045 GQ54 10.78 12.71664
CC11 17.85 19.23766 GQ6 11.85 13.69531
CC12 15.07 16.25292 HZ19 18.23 19.44564
The measured value and actual value of the Net Photosynthetic Rate of 8 first group cluster blade sample of table
Feature Measured value Predicted value Feature Measured value Predicted value
CC10 18.1 17.869152 FX64 15.33 15.324247
CC11 17.07 16.932437 FX65 10.58 10.968846
CC12 17.03 17.026104 FX66 12.21 12.541189

Claims (9)

1. a kind of structure side of the leaf morphology feature and photosynthesis characteristics prediction model of the xylophyta based on DNA methylation level Method includes the following steps:
1) blade for collecting the same species xylophyta of national NATURAL DISTRIBUTION, obtains blade representative sample;
2) phenotypic characteristic and photosynthesis characteristics for measuring the blade representative sample, obtain the photosynthesis characteristics data and blade table of blade Type characteristic;
The leaf morphology feature includes leaf area, leaf length, leaf width degree, leaf perimeter, ratio of length to breadth and the leaf shape factor;
The photosynthesis characteristics include Net Photosynthetic Rate, stomatal conductance, CO2Concentration and efficiency of water application;
3) methylation state of DNA for measuring endonuclease bamhi in the blade representative sample, obtains the DNA methylation water of endonuclease bamhi It is flat, the full-length genome average dna methylation level of each blade representative sample is calculated;
4) with the average dna methyl of the full-length genome of the photosynthesis characteristics data of the blade, leaf morphology characteristic and blade It is candidate variables to change horizontal, generates the important feature variable of difference to geographical location using Random Forest model screening, obtains leaf Piece phenotypic characteristic data and full-length genome average dna methylation level;
5) it using the leaf morphology characteristic and full-length genome average dna methylation level as characteristic variable, utilizes Square error and 26 Cluster Evaluation indexs determine the optimum cluster group number of blade representative sample in being organized in NbClust software packages;
6) the optimum cluster group number of the blade representative sample is input in improved Fuzzy C-Means Clustering Algorithm, is calculated To the subordinated-degree matrix of every group cluster sample;
The improved Fuzzy C-Means Clustering Algorithm is as follows:
A. it is sample data number to give optimum cluster group the number c, n, sets iteration stopping threshold value as ε=10-5, setting maximum Iterations t=300, random initializtion Subject Matrix U, enables iteration count t=0;
If finite aggregate X={ x1,x2,...,xn, and the member in X is known as m characteristic variable, X is expressed as the matrix of n × m such as Under:
Wherein, m indicates that the number of characteristic variable, n represent blade representative sample number;
The n sample of matrix X is divided into c groups (2≤c≤n), the fuzzy clustering matrix U for the c groups being divided into is:
In the matrix U, μijIndicate sample xjWith the membership of classification i, and 0≤μij≤1,C cluster centre For:
Select minimal error quadratic sum as clustering criteria, shown in the object function such as formula (1) of clustering:
In addition constraints formula as shown in formula (2):
Solve to obtain formula shown in formula (3):
B. update fuzzy clustering matrix and cluster centre matrix are calculated according to formula (3);
If c. P(t)-P(t-1)< ε then stop calculating, and export fuzzy clustering matrix U and cluster centre matrix P, otherwise enable t=t+ 1, step (2) is turned to until output matrix U and matrix P;
7) it is based on the subordinated-degree matrix for every group cluster sample that the step 6) obtains, is calculated per full-length genome in group cluster sample The enzymes combinations importance that the correlation and gradient boosted tree of average dna methylation level and each characteristic variable of blade obtain, obtains Important endonuclease bamhi to each group cluster sample combines;
Each characteristic variable of blade includes leaf area, Net Photosynthetic Rate and the leaf shape factor;
8) the DNA methylation level of the important endonuclease bamhi combination obtained using the step 7) utilizes Gauss as regression variable Radial basis function establishes LS-SVM regressive prediction models, obtains shown in leaf morphology feature and photosynthesis characteristics model such as formula (9);
There is no the limitation of time sequencing between step 2 and 3.
2. construction method according to claim 1, which is characterized in that screened using Random Forest model in the step 4) Select the mean value of the Mean Decrease Accuracy and Mean Decrease Gini of characteristic variable 15 or more for selection Variable as important feature variable.
3. the method for the xylophyta leaf morphology feature and photosynthesis characteristics of prediction according to claim 1, feature exist In the method for establishing LS-SVM regressive prediction models using Gaussian radial basis function includes the following steps:
In SVM, it is assumed that sample training collection isIn sample training collection T, xiFor Training sample concentrates the input variable of i-th of sample, yiThe output variable of i-th of sample, R is concentrated to represent for training sample Real number field, n represent input sample number, and regression function is
In formula (4), w and b are regression parameter,It is characterized mapping, x is the input variable of training sample set;
And solution is converted into problem in LS-SVM:
In formula (5), γ is regularization parameter, and ξ is relaxation factor, In×1=(1,1 ..., 1) ', In×1=(1,1 ..., 1) ' beSuch a matrix;
It is as follows to construct Lagrange functions:
Wherein, α is Lagrange multiplier;Under ask the saddle point of L (w, b, ξ, α), i.e. optimum point;
W, ξ in subtractive (7), can obtain:
In formula (8), Ω=(xix′j), i, j=1,2 ..., n, E are the unit matrix of n × n.
α and b are obtained by solving formula (8), then the estimation function of least square method supporting vector machine is:
Wherein, k (xi,x′j) it is kernel function, choose Gauss Radial basis kernel functions
Input variable is standardized before being returned, and by parameter optimization, enables γ=10, σ=1.
4. construction method according to claim 1, which is characterized in that endonuclease bamhi in blade representative sample in the step 3) Methylation state of DNA assay method, include the following steps:
31) use EcoRI/HpaII and EcoRI/MspI restriction enzymes to the genomic DNA of the blade representative sample into Row digestion, obtained endonuclease bamhi;
32) endonuclease bamhi expanded in advance successively and selective amplification, is carried out according to obtained selective amplification product Parting obtains the methylation state of DNA of endonuclease bamhi according to genotyping result.
5. construction method according to claim 5, which is characterized in that the parting is to carry out the selective amplification product Electrophoresis scores obtained electrophoretic band in binary-coded character matrix, indicates band missing with " 0 ", item is indicated with " 1 " The presence of band;CNG (1,0) represents hemimethylation state, and CG (0,1) represents permethylated state, and (1,1) is represented without the shape that methylates State, (0,0) represent unknown methylation state;Shown in the methylation level calculation formula such as formula (10) of full-length genome:
The DNA methylation of full-length genome is horizontal=and (site of hemimethylation state+permethylated state site+unknown methylates The site of state)/(site of the site of hemimethylation state+permethylated state site+unknown methylation state+without methyl The site of change state) formula (10).
6. construction method according to claim 1, which is characterized in that the quantity of the blade representative sample is 200 or more.
7. construction method according to claim 1, which is characterized in that calculation formula such as formula (11) institute of the leaf shape factor Show:
8. a kind of prediction side of the leaf morphology feature and photosynthesis characteristics prediction model of the xylophyta based on DNA methylation level Method, which is characterized in that with the DNA methylation water of the important endonuclease bamhi combination in claim 1~7 any one the method It is predicted in the leaf morphology feature and photosynthesis characteristics model of flat input claim 1~7 any one the method structure leaf The shape factor, leaf area and leaf Net Photosynthetic Rate;
The DNA methylation level of the important endonuclease bamhi combination is construction method described in claim 1~7 any one In obtained important endonuclease bamhi combine the DNA methylation water being calculated according to the methylation level calculation formula of full-length genome It is flat.
9. prediction technique according to claim 8, which is characterized in that the DNA methylation level of enzymes combinations to the blade profile factor, The influence degree of leaf area and Net Photosynthetic Rate draws edge effect figure;
The method for drafting of the edge effect figure is the DNA methylation level that combines important endonuclease bamhi as input variable, Then the function of the drafting edge effect figure carried in gbm is called to draw leaf shape of the more important endonuclease bamhi to each blade The edge effect figure of the factor, leaf area and leaf Net Photosynthetic Rate.
CN201810120969.0A 2018-02-06 2018-02-06 The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level Expired - Fee Related CN108319984B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810120969.0A CN108319984B (en) 2018-02-06 2018-02-06 The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810120969.0A CN108319984B (en) 2018-02-06 2018-02-06 The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level

Publications (2)

Publication Number Publication Date
CN108319984A true CN108319984A (en) 2018-07-24
CN108319984B CN108319984B (en) 2019-07-02

Family

ID=62903048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810120969.0A Expired - Fee Related CN108319984B (en) 2018-02-06 2018-02-06 The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level

Country Status (1)

Country Link
CN (1) CN108319984B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109554502A (en) * 2019-01-03 2019-04-02 北京林业大学 A kind of detection DNA methylation site is to the method and its application technology of quantitative character additivity and disconnected partial allel
CN111027612A (en) * 2019-12-04 2020-04-17 国网天津市电力公司电力科学研究院 Energy metering data feature reduction method and device based on weighted entropy FCM
CN111915062A (en) * 2020-07-08 2020-11-10 西北农林科技大学 Greenhouse crop water demand regulation and control method with water utilization rate and photosynthetic rate coordinated
CN112950571A (en) * 2021-02-25 2021-06-11 中国科学院苏州生物医学工程技术研究所 Method, device and equipment for establishing positive and negative classification model and computer storage medium
WO2022023208A1 (en) * 2020-07-30 2022-02-03 Evonik Operations Gmbh Dna-methylation-based quality control of the origin of organisms
CN114814099A (en) * 2022-04-25 2022-07-29 南京农业大学 Photosynthesis prediction method based on grape leaf shape
CN114885163A (en) * 2018-09-02 2022-08-09 Lg电子株式会社 Method for encoding and decoding image signal and computer readable recording medium
CN116153437A (en) * 2023-04-19 2023-05-23 乐百氏(广东)饮用水有限公司 Water quality safety evaluation and water quality prediction method and system for drinking water source

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102224247A (en) * 2008-09-24 2011-10-19 巴斯夫植物科学有限公司 Plants having enhanced yield-related traits and a method for making the same
CN103233072A (en) * 2013-05-06 2013-08-07 中国海洋大学 High-flux mythelation detection technology for DNA (deoxyribonucleic acid) of complete genome
CN104899474A (en) * 2015-06-09 2015-09-09 大连三生科技发展有限公司 Method and system for rectifying MB-seq methylation level based on ridge regression
CN107025384A (en) * 2015-10-15 2017-08-08 赵乐平 A kind of construction method of complex data forecast model
CN107114235A (en) * 2017-04-10 2017-09-01 中国林业科学研究院林业研究所 A kind of method that utilization DNA methylation inhibitor builds plant population
CN107301330A (en) * 2017-06-02 2017-10-27 西安电子科技大学 A kind of method of utilization full-length genome data mining methylation patterns

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102224247A (en) * 2008-09-24 2011-10-19 巴斯夫植物科学有限公司 Plants having enhanced yield-related traits and a method for making the same
CN103233072A (en) * 2013-05-06 2013-08-07 中国海洋大学 High-flux mythelation detection technology for DNA (deoxyribonucleic acid) of complete genome
CN104899474A (en) * 2015-06-09 2015-09-09 大连三生科技发展有限公司 Method and system for rectifying MB-seq methylation level based on ridge regression
CN107025384A (en) * 2015-10-15 2017-08-08 赵乐平 A kind of construction method of complex data forecast model
CN107114235A (en) * 2017-04-10 2017-09-01 中国林业科学研究院林业研究所 A kind of method that utilization DNA methylation inhibitor builds plant population
CN107301330A (en) * 2017-06-02 2017-10-27 西安电子科技大学 A kind of method of utilization full-length genome data mining methylation patterns

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114885163B (en) * 2018-09-02 2024-04-23 Lg电子株式会社 Method for encoding and decoding image signal and computer readable recording medium
CN114885163A (en) * 2018-09-02 2022-08-09 Lg电子株式会社 Method for encoding and decoding image signal and computer readable recording medium
CN109554502A (en) * 2019-01-03 2019-04-02 北京林业大学 A kind of detection DNA methylation site is to the method and its application technology of quantitative character additivity and disconnected partial allel
CN111027612A (en) * 2019-12-04 2020-04-17 国网天津市电力公司电力科学研究院 Energy metering data feature reduction method and device based on weighted entropy FCM
CN111027612B (en) * 2019-12-04 2024-01-30 国网天津市电力公司电力科学研究院 Energy metering data feature reduction method and device based on weighted entropy FCM
CN111915062B (en) * 2020-07-08 2023-06-20 西北农林科技大学 Greenhouse crop water demand regulation and control method with water utilization rate and photosynthesis rate being coordinated
CN111915062A (en) * 2020-07-08 2020-11-10 西北农林科技大学 Greenhouse crop water demand regulation and control method with water utilization rate and photosynthetic rate coordinated
WO2022023208A1 (en) * 2020-07-30 2022-02-03 Evonik Operations Gmbh Dna-methylation-based quality control of the origin of organisms
CN112950571A (en) * 2021-02-25 2021-06-11 中国科学院苏州生物医学工程技术研究所 Method, device and equipment for establishing positive and negative classification model and computer storage medium
CN112950571B (en) * 2021-02-25 2024-02-13 中国科学院苏州生物医学工程技术研究所 Method, device, equipment and computer storage medium for establishing yin-yang classification model
CN114814099B (en) * 2022-04-25 2023-09-12 南京农业大学 Photosynthesis prediction method based on grape leaf shape
CN114814099A (en) * 2022-04-25 2022-07-29 南京农业大学 Photosynthesis prediction method based on grape leaf shape
CN116153437A (en) * 2023-04-19 2023-05-23 乐百氏(广东)饮用水有限公司 Water quality safety evaluation and water quality prediction method and system for drinking water source

Also Published As

Publication number Publication date
CN108319984B (en) 2019-07-02

Similar Documents

Publication Publication Date Title
CN108319984A (en) The construction method and prediction technique of xylophyta leaf morphology feature and photosynthesis characteristics prediction model based on DNA methylation level
Vincent et al. Host associations and beta diversity of fungal endophyte communities in New Guinea rainforest trees
Lepais et al. Species relative abundance and direction of introgression in oaks
Liorzou et al. Nineteenth century French rose (Rosa sp.) germplasm shows a shift over time from a European to an Asian genetic background
Şakiroğlu et al. Inferring population structure and genetic diversity of broad range of wild diploid alfalfa (Medicago sativa L.) accessions using SSR markers
CN106446600A (en) CRISPR/Cas9-based sgRNA design method
Fayaz et al. Genetic diversity and molecular characterization of Iranian durum wheat landraces (Triticum turgidum durum (Desf.) Husn.) using DArT markers
CN106755441B (en) Method for performing forest multi-character polymerization breeding based on multi-character genome selection
CN109345089A (en) Enterprise development state evaluating method and system based on big data
CN107278877A (en) A kind of full-length genome selection and use method of corn seed-producing rate
CN109545278A (en) A kind of method of plant identification lncRNA and interaction of genes
CN111243676B (en) High-throughput sequencing data-based wilt disease onset prediction model and application
Duk et al. The genetic landscape of fiber flax
Hong et al. Genetic diversity and distinctness based on morphological and SSR markers in peanut
CN110564884B (en) Method for excavating salix matsudana salt-tolerant pivot gene
Yardibi et al. The trend of breeding value research in animal science: bibliometric analysis
CN107918725A (en) A kind of DNA methylation Forecasting Methodology based on machine learning selection optimal characteristics
CN118216422B (en) Phenotype assisted lemon breeding method based on deep learning
CN113584175A (en) Group of molecular markers for evaluating renal papillary cell carcinoma progression risk and screening method and application thereof
CN110853711B (en) Whole genome selection model for predicting fructose content of tobacco and application thereof
Wang et al. Genetic diversity analysis and potential suitable habitat of Chuanminshen violaceum for climate change
CN105907860B (en) It is a kind of to utilize | Δ (SNP-index) | carry out the QTL-seq method and its application of character positioning
CN110853710B (en) Whole genome selection model for predicting starch content of tobacco and application thereof
Mao et al. Species identification in the Rhododendron vernicosum–R. decorum species complex (Ericaceae)
Mugnai et al. Camellia japonica L. genotypes identified by an artificial neural network based on phyllometric and fractal parameters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20190702

Termination date: 20210206