Method for tracking honey source by using metagenome and machine learning
Technical Field
The invention relates to the fields of metagenome sequencing, species diversity analysis, machine learning and bioinformatics, in particular to a method for tracking honey source by using metagenome and machine learning.
Background
In the domestic honey market, most of nectars collected by western bees are single-flower honey (single-flower honey), such as sophora flower honey, rape honey and the like, and nectars collected by Chinese bees are multiple-flower honey (all-flower honey). Western bees have better honey production capacity than chinese bees and are therefore first introduced into china around 1900 years. And with the introduction of western bees, a series of modern industrialized management modes are introduced, such as daily cleaning and migration of bee colonies. Bee farmers in a large number of places still have higher enthusiasm for breeding Chinese bees, especially in remote mountainous areas. Nevertheless, the introduction of western bees inevitably threatens the Chinese bee population, resulting in a large drop in population density over the past 100 years. The influence of Chinese bee populations is mostly concentrated in areas with dense agricultural activities, and in mountain areas with sparse population, influence of western bees on local Chinese bee populations is limited. The ecological competition of the two bees has brought about unexpected results in honey production and market sales: consumers generally believe that honey products from rural and mountain areas are obtained through the original production mode and are truly green organic food, so that the honey products are healthier compared with the honey produced in a centralized manner, and the most direct and remarkable result is that the honey products of local bees (namely Chinese bees), especially the honey products collected in the natural ecological protection area, are rated as high-end honey in the market and are priced by 2-8 times more expensive than the honey of western bees. This huge profit difference also stimulates a new honey product fraud: the honey produced by western bees or the honey produced by Chinese bees in non-natural ecological areas is labeled with 'high-end honey' and is mixed with fish eyes in the market.
The pollen characteristics in honey provide a thought and a method for tracing honey sources: since pollen grains are often mixed into honey by worker bees during the storage or taking of nectar/honey, or by beekeepers during honey collection (e.g. by squeezing, filtering or centrifugation), the composition of pollen in honey essentially reflects the composition of the diversity of the species of the local flowering plants. Therefore, honey from the same region, or a geographically close region, should theoretically have a similar pollen composition, whereas the species composition of pollen should have a larger difference compared to honey that is geographically far apart. Therefore, the differences in honey source areas can be theoretically determined by comparing honey samples using the differences in the composition of flowering plants in different geographical areas.
However, pollen identification is not straightforward. The traditional identification method, also known as sporopollenizology, performs species classification and identification by observing the microstructure of pollen grains. This method, like other methods of species identification based on morphological characteristics, requires a significant amount of manpower and requires a professional knowledge base. Furthermore, because of the limited resolution of pollen morphological data, it is also difficult to effectively distinguish siblings within many plant populations. In addition to the above mentioned pollen-based analysis of plant composition of honey, there are some studies to identify and characterize specific species by biochemical-based methods using protein or carbohydrate biomarkers unique to specific plants, but such methods can only detect specific indicator plant species and cannot perform analysis and investigation of various flowering plants at once.
The macro-bar code technology of high-throughput sequencing is utilized to analyze and identify the plant composition in the pollen/honey. These studies utilize sequence characteristics of nucleic acid material (DNA) in pollen for species identification and analysis, which in turn yields flowering plant composition in honey. Although the macro-barcode analysis method based on the high-throughput sequencing method can identify the honey powder source plants, the species identification accuracy of flowering plants still has great difference among different researches, and the species identification difference is caused by the problems of the resolution of genetic markers adopted in different researches, the universality difference of primers, the incompleteness of background plant molecule reference databases in sample collection areas and the like. For example, the trnL-UAA standard fragment or rbcLa commonly used in plant species identification can usually be identified only to the level of family or genus. In addition, the macro-barcode method relies on the PCR amplification process of target DNA fragments, including matK, trnL-UAA, rbcLa, trnH-psbA, etc., while in the PCR amplification process, especially when multiple distant species exist in a sample at the same time, the amplification efficiency of different species is easily affected by the amplification bias, even the "universal primer" can cause significant amplification bias due to the base difference between the primer and the template, which often affects the quantification of species taxa and even the qualitative analysis of specific taxa. Moreover, the macro-barcode approach relies heavily on the integrity and comprehensiveness of the native plant reference sequence (i.e., the plant barcode or chloroplast genome). Indeed, the lack of a local plant reference database presents challenges to all high-throughput sequencing-based pollen identification methods. This challenge is even more acute for honey source tracing, since to obtain the key indicator species for each target honey source region, a thorough investigation of the local plant species is required, and the difficulty of identifying the indicator species is further amplified by the various comparisons with each other. Admittedly, the current problem of lack of molecular databases of local plant diversity is difficult to solve in the short term for most regions of the world, and is particularly prominent when our research targets, honey products, are from remote undeveloped mountains.
Disclosure of Invention
The invention aims to solve the problem of shortage of plant diversity molecule databases, and provides a method for establishing a weight coefficient between a plant sequence and a honey source place by a machine learning method so as to realize the purpose of tracing the honey source place of a honey sample. The method fully considers the problem of shortage of molecular databases of plant species in honey source places, particularly remote mountain areas, analyzes sequence information of nucleic acid substances in honey to further obtain plant sequences contained in the honey, calculates a weight relation between the nucleic acid sequences (without specific species identification results) of the plant sources in the honey and the honey source places through a machine learning method, and tracks honey sources of honey samples by taking the weight relation as a reference database. The method does not depend on the existing honey source plant molecule database, can directly obtain honey source information of honey through plant sequence analysis and machine learning in honey, and has extremely important value for identification of honey products and market specification.
The technical scheme of the invention is as follows:
a honey source tracking method of honey comprises the following steps:
1) collecting honey samples from known honey origins requires that each honey origin comprises a plurality of samples to adequately represent the diversity and heterogeneity of the honey origin. Pollen enrichment and extraction and purification of nucleic acid substances (DNA) were performed on honey samples.
2) DNA sequencing and mass filtration: nucleic acid sequence determination was performed on each honey sample and routine mass filtering of sequencing data, including removal of low quality sequences and removal of sequencing linker sequences.
3) And assembling the sequencing data after the quality filtering, and comparing all the original sequencing data back to an assembling result to obtain the richness information of the assembled sequence. In order to ensure the reliability of subsequent sequence analysis, the method only reserves the assembly sequences with the assembly length of more than 500bp and the richness of more than 3.
4) Annotating species information of the sequence assembled in the step 3), comparing the assembled sequence to a nucleotide sequence database (nt database) of NCBI, and recording species information of an optimal comparison result. In order to ensure the reliability of species information, the comparison coverage reaches more than 80% of the length of the sequence itself, or the comparison result is more than 1,000 bp. According to the annotation information of the species, the sequence of the plant source is extracted for subsequent analysis.
5) Performing pairwise comparison between samples on the plant source sequences obtained in the step 4), firstly integrating the assembly sequences from the same species in the same honey sample according to pairwise comparison results, and recording sequence information shared among different samples. The shared sequence needs to satisfy the sequence similarity of more than or equal to 98 percent and the alignment total length of more than or equal to 1,000 bp.
6) Analyzing and integrating the sequence information shared among the samples obtained in the step 5) to construct a data matrix of different honey samples, wherein the variable of the matrix is the plant source representative sequence of all the honey samples, the honey samples contain the sequence, or the sequence is shared, the value of the plant source sequence variable of the honey sample is 1, and if the variable does not meet the condition, the value is 0. Meanwhile, the honey samples are subjected to fixed value of geographical position variables, probability values of different geographical positions of the honey samples are obtained according to honey source places of the honey samples, the probability value of each honey sample in the honey source place is 100%, and the probability values of other geographical positions are 0%.
7) And 6) performing data training based on a neural network model on the data matrix obtained in the step 6), wherein plant source sequence variables serve as Input layers (Input layers) of the neural network, geographic position variables serve as Output layers (Output layers) of the neural network, and the number of Hidden layers (Hidden layers) and the number of neurons are determined according to the number of the Input Layer variables.
8) Providing input layer variable values for the honey sample to be detected, and estimating the probability of the honey source by using the training model obtained in the step 7). The geographical location with the highest probability value is its honey source.
In the step 1), collected honey samples are ensured to come from different honeycombs and sites as much as possible, and the accuracy and traceability of honey source sites are ensured in the selection of the honey samples, so that the local plant sources can be completely and accurately covered as much as possible in the process of determining the plant source nucleic acid sequences of the subsequent honey samples.
Preferably, raw honey collected by local beekeepers is used in step 1), and the sequencing amount of each honey sample in step 2) is 4-5G of raw data on average.
Preferably, in step 3), a metagenome-optimized assembly algorithm should be adopted for assembly of the sequence, the assembly algorithm in the MitoZ software package may be used for assembly of the sequence, the length of the K-mer is set to be greater than 50, and the minimum value in edge abundance is set to be 3. And then, comparing the original sequencing sequence to the assembly sequence by utilizing the Minimap2, counting the coverage information of each assembly sequence through SamBamba, and carrying out homogenization (taking one million sequencing sequences as a base number) processing on the sample data to further obtain the richness information of the assembly sequence.
In the step 4), for the plant source sequence in each honey sample, selecting 50 with the highest abundance, or selecting the sequence with the abundance ratio of more than 99% in total to improve the effectiveness of the data matrix variable, thereby improving the efficiency and accuracy of the neural network data training. Meanwhile, when the data matrix is constructed in the step 6), the plant source sequence variables can be further filtered, and invalid variables of the honey source places, namely variables which only occur once in the same honey source place, are filtered.
In the step 8), each sample to be detected can be independently operated for multiple detections, and when the highest value of the probability is required to be 10 times higher than the second highest probability value, the sample to be detected is considered to be one-time effective detection, so that the detection reliability is ensured.
The invention combines the honey metagenome sequencing and machine learning methods to track honey source and analyze attribution. Compared with the existing sequencing-based honey source tracing method, the method can directly utilize the plant source assembly sequence obtained from honey as a characteristic variable, does not depend on a honey source background plant molecule database, avoids a large amount of labor cost to search and locate the representative characteristic plant source sequence of a geographic region, and can obtain the weight and contribution ratio of each plant source sequence to the honey source through the training of a known data set. The invention is of great significance to the identification of honey and the standardization of honey industry.
Drawings
FIG. 1 is a flow chart of the honey source tracing method of the present invention.
Fig. 2 shows geographical location information of honey samples in the embodiment, wherein the detailed distribution of sampling points in Sichuan is shown in the enlarged view of the lower left corner.
FIG. 3 is a probability distribution graph of attribution of honey samples of the same honey source in a cross-validation experiment, wherein a histogram shows the result distribution of honey samples of a specific geographical source (X axis) in a cross-validation test, the probability distributions of different geographical positions are distinguished by different color depths, the result of the cross-validation result which accords with the source of the honey source is shown in a dark color, and the result of the honey source which does not accord with the source of the honey source is shown in a light color; wherein, Qinghai: qinghai; beijing: beijing; jilin: jilin; aba: abam; laohegou: old river ditches; guanba: and closing the dam.
FIG. 4 is a graph of the probability distribution of the place of ownership of individual honey samples in a cross-validation experiment, wherein a single cross-check of each honey sample will yield 7 points, corresponding to probability scores of 7 different honey origins, which are distinguished by different color depths, and the results of the cross-validation result corresponding to their origin are shown in dark colors, while the results of non-honey origins are shown in light colors; wherein, Qinghai: qinghai; beijing: beijing; jilin: jilin; aba: abam; laohegou: old river ditches; guanba: and closing the dam.
Detailed Description
The following is a more detailed description of the present invention, and the parameters and specific implementation details thereof are used to explain the feasibility and the implementation of the present invention, and are not to be construed as limiting the present invention.
This example includes 28 honey samples, all 27 honey samples except 1 from the western bees from the chinese bee. Of these, 9 samples were from the old gutter Nature Protect zone (LHG) of Sichuan province, China, which belongs to the world of panda's natural habitat. In order to test whether the honey samples collected in other regions and the honey samples from old ditches can be accurately distinguished by using the local plant composition, two sample points (Guanba-GB and Wanglan-WL) close to the old ditches and honey samples of 5 sample points in other different regions of China are collected, wherein the honey samples comprise Aba-AB, Beijing-BJ, Jilin-JL, Shaanxi-SX and Qinghai-QH. The number of honey samples of the old river ditch, guan dam, Wang Lang, Abao dam, Beijing, Jilin, Shaanxi and Qinghai sample points is 9, 2, 1, 3, 4, 3 and 3 respectively (figure 2).
1. Sample nucleic acid species acquisition and sequencing
50g of honey was taken from each sample, weighed into 4 50mL sterile centrifuge tubes (4X 12.5g), and 30mL of ultrapure water was added to each tube. The honey was then dissolved in a water bath at 65 ℃ for 15 minutes. After that, the mixture was centrifuged at 12,000rpm for 15 minutes, and the supernatant was discarded. Finally, pollen pellets were removed from 4 different centrifuge tubes and pooled into 1 centrifuge tube and the above lysis and centrifugation steps were repeated 1 time. The final pellet was used for DNA extraction. About 0.5g of precipitate (mixed pollen grains) per honey sample was subjected to DNA extraction using a modified Wizard method (Soares, Amaral, Oliveira, & Mafra, 2015). 200ng of DNA per honey sample was used to construct a sequencing library (Dual-indexed, double-labeled) with an insert size of 350bp, followed by 150PE sequencing on the Illumina HiSeq X Ten platform.
2. Sequence assembly and annotation
We assemble each sample using the assembly module of MitoZ to obtain a scaffold sequence (K-mer 31, minimum _ edge _ depth 3, minimum _ output _ length 500). Then, the original sequencing sequence is compared to the assembly sequence by utilizing Minimap2, then the coverage information of each assembly result sequence is counted through SamBamba, and the sample data is subjected to homogenization (taking one million sequencing sequences as a base number) processing, so that the richness information of the assembly result is obtained. We then aligned the assembly results with the nt library (NCBI) using BLASTN to obtain homologous species information for the assembled sequences. Meanwhile, the similarity score and species classification information of the nearest neighbor sequence in the public database are extracted from the comparison result of the assembled sequence and the nt database.
3. Sequence matrix feature construction
We used a mutual optimal alignment method (reciprocal BLAST) to compare pairwise sequence compositions between honey samples by BLAST aligning the assembled sequences of each honey sample with the assembled sequences of all other honey samples, when multiple assembled sequences of the same sample can be mapped to different regions of the same sequence in other samples with similarity greater than or equal to 98% and coverage greater than or equal to 80%, we consider these sequences to be from the same species and combine these sequences and their abundance for downstream analysis. Subsequently, to further improve the computational efficiency, we selected the 50 most abundant plant representative sequences from a single honey sample for subsequent analysis. In comparative analysis among honey samples, BLAST alignments between different sample sequences with similarity of 98% or more and alignment length of 1,000bp or more were defined as the same plant species, shown as 1 in the data matrix, otherwise recorded as 0. Finally, specific sequences in a single honey sample, i.e. assembly sequences that occur only once in a geographical area, which we believe do not contain any geographically valid information, will be deleted in subsequent analyses.
4. Machine learning model training
We train the data feature matrix using the artificial neural network algorithm of Resilient back propagation (Resilient feedback) in the "neural net" R packet, where a single Hidden Layer (Hidden Layer) and half the number of neurons (Neuron) of sequence feature variables are used in the data training process. The calculation method of the neural network takes a plurality of variables (such as the plant-derived sequences in the research) as an Input Layer (Input Layer) of the neural network, then for all honey samples, the probability of the geographical variable of the honey sample is set as 100% according to the geographical attribution of the honey sample, the probability of other geographical variables is set as 0%, and the probability values are taken as an Output Layer (Output Layer) of the neural network. The neural network algorithm estimates the eigenvalue and weight of each sequence variable in the training data set through a plurality of neurons (1/2, the number of plant source sequence variables in the present case) in the hidden layer, and finally obtains the geographical attribution probability of each honey sample in the output layer through optimal fitting, wherein the optimal fitting is realized through mathematical functions contained in the hidden layer, and the mathematical functions perform nonlinear transformation (a resilient backspace algorithm in the method) on each observed value in the input data matrix to obtain the expected result of the observed value.
5. Leave one out cross validation
The accuracy of the training model and the performance of machine learning were evaluated by Leave-One-Out Cross Validation (LOOCV). One-left cross validation is one of exhaustive cross validation methods, and as the name suggests, one observation result is extracted as a validation set, and the rest observation results are used as a training set to validate all the observation results in a data set one by one so as to achieve the purpose of evaluating the accuracy of a neural network algorithm. In each cross validation test, the weight and feature information of variables of an input layer and parameters of mathematical functions in an implied layer in a data model are obtained through machine learning of a training data set, then the variables in the validation data are calculated to obtain the distribution probabilities of the variables in a plurality of geographic positions, the geographic position with the highest probability is considered as a honey source of the test sample, and meanwhile, the maximum probability value of geographic attribution of the test sample is required to exceed 10 times of the second probability value to be used as effective cross validation. For example, we took a sample of an old gutter from the entire data set, and then obtained a training set that included the remaining 8 old gutter samples and all other honey samples. Then, we can use the training set to perform machine learning based on neural network, and calculate the distribution probability of multiple geographic positions for the extracted old river ditch sample through the trained model. We performed a total of 1,000 cross-validation tests and recorded the results for subsequent statistics. Since wang contains only one sample, cross validation will not be performed as a test sample.
6. Honey source tracing place for honey samples
In conclusion, the honey source tracing of each honey sample can be carried out by a training model obtained by machine learning and a result of a plurality of honey source attribution probabilities obtained by analyzing the training model. FIG. 3 is the result of the attribution tracking probabilities of all honey samples from the same honey source, wherein the probabilities of different geographical locations are summarized and shown in bar chart form, and different colors are used to distinguish different geographical locations. For example, honey samples from abba gave higher probability scores (89.38-99.74%) in cross-testing to be assigned to abba, and lower probability scores in other 6 geographical locations. It is noteworthy that samples H0003, H0018 and H0028 contributed most outliers of old river, shanxi and beijing, respectively. Fig. 4 is a summary and demonstration of the probability of honey source attribution for each honey sample independently, each point representing an independent cross-validation result, differentiating different geographical locations with different colors. For example, one cross-check of the H0028 sample will produce probability score points for 7 different honey sources at the X-axis H0028 location. It can be seen that almost all honey samples can be accurately tracked in honey source by using the method, and the resolution of the honey source can reach within hundred kilometers, for example, the distance between a dam and an old ditch sample point is only about 40 km.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be covered thereby.