[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2023202345A1 - 一种基于层次基团构建的纯组分炼化性质的预测方法 - Google Patents

一种基于层次基团构建的纯组分炼化性质的预测方法 Download PDF

Info

Publication number
WO2023202345A1
WO2023202345A1 PCT/CN2023/085001 CN2023085001W WO2023202345A1 WO 2023202345 A1 WO2023202345 A1 WO 2023202345A1 CN 2023085001 W CN2023085001 W CN 2023085001W WO 2023202345 A1 WO2023202345 A1 WO 2023202345A1
Authority
WO
WIPO (PCT)
Prior art keywords
group
component
groups
hierarchical
model
Prior art date
Application number
PCT/CN2023/085001
Other languages
English (en)
French (fr)
Inventor
王耀宗
陈松航
陈豪
王森林
张剑铭
连明昌
钟浪
刘哲夫
Original Assignee
泉州装备制造研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 泉州装备制造研究所 filed Critical 泉州装备制造研究所
Publication of WO2023202345A1 publication Critical patent/WO2023202345A1/zh

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/30Prediction of properties of chemical compounds, compositions or mixtures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Definitions

  • the invention relates to the field of data analysis in the refining and chemical industry, and in particular to a method for predicting the refining and chemical properties of pure components based on hierarchical group construction.
  • the molecular dynamics model can not only predict the distribution of products in the refining unit, that is, qualitative analysis, but can also quantitatively predict the corresponding refining properties of the product. This progress can enable decision makers to directionally design the chemical structure of pure components in products and optimize unit operating conditions, thereby guiding the direction of refining theoretical research and industrial production. Among them, the accuracy of predicting the refining properties of pure components is directly related to the accuracy of product quality assessment, which in turn affects the optimization direction of each operating unit. It is a key point of the molecular dynamics model and is also related to whether the molecular management technology can be successfully applied. Refinery optimization.
  • the technical problem to be solved by the present invention is to provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the pure components in petroleum products, such as the octane number of each component of gasoline products and the octane number of each component of diesel products.
  • the cetane number of each component is predicted and constructed hierarchically through the component feature set.
  • Bayes' rule is introduced, so that the posterior probability distribution can be estimated.
  • hierarchical group construction is introduced to construct group fragments hierarchically to avoid Avoid the risk of overfitting in the final prediction.
  • the present invention specifically includes the following steps:
  • Step 10 Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups.
  • the primary group is a basic group containing a component structure
  • the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers
  • the Tertiary groups are descriptors describing the topological structure of components
  • Step 20 Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component feature set with the highest posterior probability;
  • Step 30 Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
  • Step 40 Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
  • step 20 the component feature set with the largest posterior probability is screened, which specifically includes:
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the posterior probability is:
  • m) is the edge similarity, calculated by f(y
  • m) ⁇ f(y
  • step 30 includes a data preprocessing process and a modeling verification process
  • the data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
  • the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • the first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the third-level descriptor pair The total contribution of a given property
  • the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
  • Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with a limited amount of data; on this basis, Bayes' rule is introduced again, so that the final model can be evaluated posteriorly Probability distribution estimation to avoid the risk of overfitting of the final prediction model.
  • Figure 1 is a schematic flow chart of the method of the present invention
  • Figure 2 is a schematic diagram of hierarchical groups of the present invention.
  • Figure 3 is a schematic diagram of the component feature set construction and screening process of the present invention.
  • Figure 4 is a schematic diagram of the modeling process of hierarchical groups in the present invention.
  • Figure 5 is one of the schematic diagrams of the uncertainty analysis process of the candidate model of the present invention.
  • Figure 6 is the second schematic diagram of the uncertainty analysis process of the candidate model of this method.
  • the embodiments of the present invention provide a method for predicting the refining properties of pure components based on hierarchical groups, aiming at the octane number of pure components in petroleum products, such as the octane number of each component in gasoline products, and the octane number of each component in diesel products.
  • the cetane number is predicted by hierarchically constructing the component feature set.
  • Bayes' rule is introduced, so that the posterior probability distribution can be estimated.
  • the Hierarchical group construction hierarchically constructing group fragments to avoid the risk of overfitting in the final prediction.
  • This component feature set combines the characteristic groups of the mechanism and the component descriptors screened by machine learning to characterize the refining properties of the components. While constructing groups and descriptors, they are divided into levels. Higher-level groups contain more detailed descriptions of components.
  • the primary group includes the basic groups of the component structure such as -CH, -CH3, -CO, etc.
  • Simple structural components such as alkanes can be disassembled and characterized through this hierarchical group.
  • this hierarchical group can only represent the basic composition of the component and cannot represent the linking position of the group in the component. The difference in linking position has a decisive impact on the refining properties of the component.
  • secondary groups focus on building up group blocks, which are combinations of basic groups that distinguish aromatics from paraffins and the corresponding isomers.
  • the primary basic group includes the R group representing the aromatic ring A6 group and CH2.
  • the -CH2 group is linked to the benzene ring, so it is The carbon on the benzene ring to which it is connected forms a new group block aC-CH, which is represented in the secondary group block to characterize the component.
  • the third-level group uses component descriptors. Due to the large number of component descriptors, the accuracy of descriptors based on quantum chemical calculations is still controversial in the scientific community. Therefore, we will focus on descriptors that describe the topological structure of the components. Such as connectivity index.
  • the codable, simplified component expression method SMILES is used to represent the complex component structure with two-dimensional coding, and the molecular structure of the given component is Automatically disassemble into group fragments that match the component library for quantitative analysis.
  • the primary groups and secondary groups are screened from the group library according to the molecular structure of the target component.
  • Global optimization algorithms such as simulated annealing and genetic algorithms can be used for screening.
  • tertiary groups are added to the selected feature set.
  • the addition of tertiary groups will inevitably overlap with primary and secondary groups, resulting in redundant feature sets. Therefore, combining information theory and machine learning, the concept of minimum correlation-maximum amount of information is introduced to ensure that the added third-level group maintains the minimum correlation with the existing low-level groups, while maintaining the maximum amount of information with the properties to be predicted, that is Maximize the representation of the properties to be predicted.
  • Bayes' rule is introduced for feature selection to calculate the posterior probability of the candidate model.
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the prior probability of model m is f(m)
  • the posterior probability is:
  • m) is the edge similarity, which can be calculated by f(y
  • m) ⁇ f(y
  • MCMC Markov Monte Carlo
  • Feature selection is a branch problem of model selection, that is, using binomial distribution to represent candidate models, where p is the number of all features. From this, the posterior distribution probability of the model represented by each feature subset based on the known data set Y is obtained, thereby achieving "soft" constraints.
  • the core of this feature selection method based on Bayes' rule is the MCMC sampling method.
  • Selecting group fragments based on the mechanism and combining them with component descriptors that do not rely on data regression coefficients reduces the number of regression calculation coefficients required, reduces the dependence on the size of the data set to a considerable extent, and also provides eigenvalues Set the posterior distribution probability of the model to achieve "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume.
  • FIG 4 The process of hierarchical group modeling and coefficient solution is shown in Figure 4, which can be divided into two parts: data preprocessing and modeling verification. Due to the sparseness of existing data sets due to the refining properties of components, advanced statistics and machine learning methods need to be introduced in the data preprocessing stage to strive to improve the accuracy of small sample data modeling.
  • the distribution of the eigenvalues and experimental values of the components in the database is difficult to meet the requirements of normal distribution, which will affect the model effect during the modeling process. It needs to be normalized through the probability and statistics method, namely the Box-Cox log-likelihood function method. State transformation. Due to the sparsity of the feature space, the randomly selected training set is difficult to cover the feature space of the test set, causing the model based on the training set to be over extrapolated and reducing the model prediction effect. Therefore, the second step uses an unsupervised learning method, that is, only focusing on the feature set of the data set without evaluating the modeling effect, clustering analysis is directly performed on the data set, and the sparse holes in the feature space in the data set are approximated. Based on This selected training set can cover the feature space of the test set samples to the greatest extent and improve the model prediction effect.
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • the first-level group coefficient, D j is the contribution of j group in the second-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the total contribution of three-level descriptors to a given property.
  • the hierarchical method is used for sequential regression.
  • C i is obtained through training set regression;
  • the secondary group contribution degree D j is then obtained through regression; since f(Y*) is calculated from the component descriptor, no regression calculation is required, thus greatly reducing the need for training set size.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression.
  • the calculated group coefficients ⁇ , w, ⁇ that is, the size of the weight, can represent the influence of the group fragments at the corresponding level on a given property.
  • Step 10 Use SMILES, a coded simplified component expression method, to represent complex component structures with two-dimensional coding, and build a predefined group fragment component library, including primary groups, secondary groups, and tertiary groups.
  • the primary group is a basic group containing a component structure
  • the secondary group is a combination of link positions of basic groups, used to distinguish aromatic hydrocarbons from paraffins and corresponding isomers
  • the Tertiary groups are descriptors describing the topological structure of components
  • Step 20 Screen out primary groups and secondary groups from the predefined group fragment component library according to the molecular structure of the target component, and then use the predetermined refining properties of the target component with the first group. Multiple third-level groups are screened out based on the principle of maintaining minimal correlation with secondary groups while maintaining maximum information content with the properties to be predicted. Randomly select any number of third-level groups and the first-level groups. and secondary groups to form multiple component feature sets, and then screen to obtain the component features with the highest posterior probability. set;
  • Step 30 Use a linear accumulation function to combine groups at different levels for modeling and then solve the coefficients through the training set to obtain a hierarchical group model;
  • Step 40 Generate multiple candidate models based on the component feature set with the highest posterior probability. Based on the hierarchical group model, by using Bayes' rule again, obtain the confidence intervals of all candidate models, and combine each of the Based on the accuracy of the candidate models, the octane number and cetane number models suitable for the actual conditions of the refinery are screened out based on the principle of multi-objective optimization.
  • the component feature set with the largest posterior probability is screened, which specifically includes:
  • a single model m belongs to the candidate model set M.
  • Each model obeys the distribution of the known data set Y, f(y
  • the posterior probability is:
  • m) is the edge similarity, calculated by f(y
  • m) ⁇ f(y
  • the step 30 includes a data preprocessing process and a modeling verification process
  • the data preprocessing process is: normalize the data set through probability and statistical methods, then use unsupervised learning methods to directly perform cluster analysis on the data set, approximate the sparse holes in the feature space in the data set, and obtain training. set;
  • the linear accumulation function is used for modeling hierarchical groups, and the formula is as follows:
  • the function f(Y) is the function of the property to be predicted
  • C i is the contribution of the i-th group in the primary group
  • N i is the number of occurrences of the i group
  • is the coefficient of the primary group
  • w is the second-level group coefficient.
  • Level group coefficient, D j is two The contribution of j group in the first-level group, M j is the number of occurrences;
  • is the third-level component descriptor group coefficient
  • f(Y*) is the total contribution of the third-level descriptor to a given property;
  • the hierarchical method is used to regression in sequence, and C i is obtained through training set regression; then the secondary group contribution degree D j is obtained through regression; f ( Y*) is calculated from the component descriptor without regression calculation.
  • the group coefficients ⁇ , w, and ⁇ are obtained through unified regression, that is, the size of the weight, which represents the influence of the group fragment at the corresponding level on a given property.
  • This invention selects group fragments based on the mechanism and combines them with component descriptors that do not rely on data regression coefficients, which reduces the number of required regression calculation coefficients, reduces the dependence on the size of the data set to a considerable extent, and at the same time provides
  • the posterior distribution probability of the feature subset model realizes "soft" constraints, which is suitable for prediction research on the refining properties of pure components with limited data volume; on this basis, Bayes' rule is introduced again, so that the final model can be The posterior probability distribution is estimated to avoid the risk of overfitting of the final prediction model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

本发明公开了一种基于层次基团构建的纯组分炼化性质的预测方法,该方法针对石油产物中纯组分化合物的辛烷值、十六烷值进行预测,通过组分特征集分层构建,引入层次基团构建,避免特征集出现冗余,在加入第三层次组分描述符进入特征集合时,引入贝叶斯规则,从而可以对特征集合进行后验概率分布估计,选择后验概率更高的特征的集合,而不只关注预测精度。在此基础上,再次引入贝叶斯规则,从而可对最终模型进行后验概率分布估计,避免最终预测模型的过拟合风险。该发明可应用于石化工业中的原油与产品调和单元,有效提高石油炼化精度。

Description

一种基于层次基团构建的纯组分炼化性质的预测方法 技术领域
本发明涉及炼化工业数据分析领域,特别涉及一种基于层次基团构建的纯组分炼化性质的预测方法。
背景技术
传统的炼化单元模型,碍于分析化学与计算机硬件的限制,多使用集总动力学模型,原料和产品常依据宏观性质如沸点或溶解度划分成集总。如催化裂化单元所广泛采用的十集总、十一集总模型。但基于宏观层次划分的集总天然具有多组分属性,无法详细表征组分信息,导致此类集总模型难以扩展到新的原料与催化剂体系。然而分子层次的集总模型可从纯组分层面计算原料的组成、性质,建立反应网络,进而精准预测炼化加工单元产物的性质。配合以纯组分性质预测模型与混合规则模型,分子动力学模型不但可预测炼化单元产物的分布即定性分析,更可实现定量预测产物相应的炼化性质。这一进展可使决策者定向地设计产物中纯组分的化学结构,优化单元操作条件,进而为炼化理论研究和工业生产指引方向。其中对于纯组分炼化性质预测的精度直接关系对产物质量评估的准确性,进而影响各操作单元的优化方向,是分子动力学模型的关键点,同时关系到分子管理技术能否顺利应用于炼厂优化。
发明内容
本发明要解决的技术问题,在于提供一种基于层次基团构建的纯组分炼化性质的预测方法,针对石油产物中纯组分的如汽油产物各组分的辛烷值、柴油产物中各组分的十六烷值进行预测,通过组分特征集分层构建,在加入组分描述符进行特征集合时,引入贝叶斯规则,从而可以对其进行后验概率分布估计,在此基础上,引入层次基团构建,对基团片段进行分层构建,避 免最终预测的过拟合风险。
本发明具体包括如下步骤:
步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;
步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征集;
步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;
步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。
进一步地,所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:
单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m, βm)所在空间:
其中,其中p为全体特征数量。
进一步地,所述步骤30包括数据预处理过程与建模验证过程;
所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;
所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。
本发明具有如下优点:
基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究;在此基础上,再次引入贝叶斯规则,从而可对最终模型进行后验概率分布估计,避免最终预测模型的过拟合风险。
附图说明
下面参照附图结合实施例对本发明作进一步的说明。
图1为本发明方法的流程示意图;
图2为本发明层次基团示意图;
图3为本发明组分特征集构建筛选流程示意图;
图4为本发明层次基团的建模流程示意图;
图5为本发明候选模型不确定性分析流程示意图之一;
图6为本方法候选模型不确定性分析流程示意图之二。
具体实施方式
本发明实施例通过提供一种基于层次基团构建的纯组分炼化性质的预测方法,针对石油产物中纯组分的如汽油产物各组分的辛烷值、柴油产物中各组分的十六烷值进行预测,通过组分特征集分层构建,在加入组分描述符进行特征集合时,引入贝叶斯规则,从而可以对其进行后验概率分布估计,在此基础上,引入层次基团构建,对基团片段进行分层构建,避免最终预测的过拟合风险。
如图1所示,本发明的总体思路如下:
S1:构建预定义基团片段组分库。
针对已有基团片段构建方法的缺陷,提出崭新的一套组分特征集,以表征石油产物中组分的炼化性质。此组分特征集结合机理的特征基团与机器学习筛选出的组分描述符,用以表征组分炼化性质。构建基团与描述符的同时,将其划分层次,越高层次的基团包含对组分更为细致的描述。
一级基团包含组分结构的基本基团如-CH,-CH3,-CO等,简单结构的组分如链烷烃可以通过该层次基团进行拆解表征。然而该层次基团只能代表组分的基本组成,不能表征基团在组分中的链接位置,而链接位置的不同对组分炼化性质具有举足轻重的影响。
因此,二级基团着重建立基团块,即基本基团的组合,以区分芳烃与链烷烃以及相应的同分异构体。如图2所示,一级基本基团中包括代表芳环A6基团与CH2的R基团。而其中的-CH2基团因其链接在苯环上,因此与 其所连接的苯环上的碳组成新的基团块aC-CH,在二级基团块表示以表征该组分。
三级基团采用组分描述符,由于组分描述符数量众多,基于量子化学计算的描述符,其准确性在科学界还有一定争议,因此将着重关注描述组分拓扑结构的描述符,如连接性指数(connectivity index)。
S2、根据目标组分构建并筛选组分特征集。
如图3所示,当需要对目标组分的炼化性质进行预测时,采用可编码的,简化组分表达方式SMILES,将复杂组分结构用二维编码表示,将给定组分分子结构自动拆解成符合组分库中的基团片段,从而进行定量化分析。
先根据目标组分的分子结构从基团库里筛选一级基团和二级基团,可以采用模拟退火、遗传算法等全局优化算法进行筛选。接着向筛选出来的特征集合中加入三级基团,然而三级基团的加入难免会与一级,二级基团相重叠,从而造成特征集冗余。因此,结合信息理论与机器学习,引入最小相关度-最大信息量概念,保证加入三级基团与已有的低级别基团保持最小的相关性,同时与待预测性质保持最大信息量,即最大化表征待预测性质。
接着引入贝叶斯规则进行特征选择计算候选模型后验概率。单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合。设模型m的先验概率为f(m),则后验概率为:
其中,f(y|m)为边缘相似性,可由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到。但此积分绝大多数情况下无法得到解析解,因此用马尔科夫蒙特卡洛(MCMC)随机抽样法近似估计其值。抽样范围为(m,βm)所在空间:
特征选择为模型选择的分支问题,即用二项分布表示候选模型, 其中p为全体特征数量。由此得到每个特征子集所代表模型的基于已知数据集Y的后验分布概率,从而实现“软”约束。此基于贝叶斯规则特征选择方法的核心为MCMC抽样方法。
基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究。
S3:进行层次基团建模及系数求解。
层级基团建模与系数求解过程如图4所示,总体可分为数据预处理与建模验证两部分。由于组分炼化性质已有数据集稀疏性强,因此在数据预处理阶段需引入先进的统计学与机器学习方法,力求提升小样本数据建模的精度。
数据库中组分的特征值与实验值的分布难以满足正态分布要求,在建模过程中将会影响模型效果,需通过概率统计方法即Box-Cox对数似然函数法,将其进行正态化转换。由于特征空间的稀疏性,随机选取的训练集难以涵盖测试集的特征空间,导致基于训练集模型过于外推,降低模型预测效果。因此第二步采用无监督学习方法,即只针对数据集的特征集而不通过对建模效果的评估,直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,基于此选取的训练集,可在最大程度上涵盖测试集样本的特征空间,提高模型预测效果。
层次基团的建模优先考虑传统的线性累加函数,因其运算量较小,并能给出相应基团的贡献度系数,其在一定程度上提供更为丰富的机理信息。其公式形式如下式所示:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数, f(Y*)为三级描述符对给定性质的总贡献度。
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归。通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;由于f(Y*)由组分描述符计算得到,不需回归计算,从而大大减少对训练集规模的需求。最后统一回归得到基团系数δ,w,λ。计算得到的基团系数δ,w,λ,即权重的大小,可代表所属层级基团片段对给定性质的影响力。
S4:进行不确定性分析。
如图5和图6所示,预测值的不确定性分析即置信区间的估计,对模型的实际应用至关重要。由于层级基团模型具有显性的数学表达式,同时又包括各候选模型的概率分布,通过再次使用贝叶斯规则,可得全部候选模型的置信区间,结合各自模型的精度,综合考虑模型的精确性与实用性,得到更适用于炼厂实际情况的辛烷值、十六烷值模型。
需要说明的是,本领域的相关技术人员在进行良品率预测计算的过程中,可以根据相关的原理进行适当变形及相应的参数设置。以上所述实施例仅表达了本发明的一种实施方式,其描述已经较为具体和详细,但是不能因此理解为对发明专利范围的限制。
本发明一具体实施例如下:
步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;
步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征 集;
步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;
步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。
所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:
单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m,βm)所在空间:
其中,其中p为全体特征数量。
所述步骤30包括数据预处理过程与建模验证过程;
所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;
所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二 级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;
计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。
本发明基于机理挑选基团片段,并结合不依赖数据回归系数的组分描述符,减少了所需回归计算系数的数量,在相当程度上降低了对数据集规模的依赖性,同时给出了特征子集模型的后验分布概率,实现“软”约束,适用于数据量有限的纯组分炼化性质的预测研究;在此基础上,再次引入贝叶斯规则,从而可对最终模型进行后验概率分布估计,避免最终预测模型的过拟合风险。
虽然以上描述了本发明的具体实施方式,但是熟悉本技术领域的技术人员应当理解,我们所描述的具体的实施例只是说明性的,而不是用于对本发明的范围的限定,熟悉本领域的技术人员在依照本发明的精神所作的等效的修饰以及变化,都应当涵盖在本发明的权利要求所保护的范围内。

Claims (3)

  1. 一种基于层次基团构建的纯组分炼化性质的预测方法,其特征在于,包括:
    步骤10、采用可编码的简化组分表达方式SMILES,将复杂组分结构用二维编码表示,构建预定义基团片段组分库,包括一级基团、二级基团以及三级基团;所述一级基团为包含组分结构的基本基团;所述二级基团为基本基团的链接位置组合,用于区分芳烃与链烷烃以及相应的同分异构体;所述三级基团为描述组分拓扑结构的描述符;
    步骤20、根据目标组分的分子结构从所述预定义基团片段组分库中筛选出一级基团和二级基团,再根据目标组分待预测的炼化性质利用与所述一级基团和二级基团保持最小的相关性,同时与待预测性质保持最大信息量的原则筛选得到多个三级基团,随机选取任意数量的三级基团与出的一级基团和二级基团构成多个组分特征集,然后筛选得到后验概率最大的组分特征集;
    步骤30、采用线性累加函数将不同层次的基团结合进行建模然后通过训练集对系数求解,得到层级基团模型;
    步骤40、根据所述后验概率最大的组分特征集生成多个候选模型,基于所述层级基团模型,通过再次使用贝叶斯规则,得到全部候选模型的置信区间,结合每一所述候选模型的精度,根据多目标优化的原则筛选出适用于炼化厂实际情况的辛烷值、十六烷值模型。
  2. 根据权利要求1所述的方法,其特征在于:所述步骤20中,筛选得到后验概率最大的组分特征集,具体包括:
    单一模型m属于候选模型集合M每个模型服从已知数据集Y的分布,f(y|m,βm),其中参数向量βm∈Bm,Bm为模型m系数可能取值的集合,设模型m的先验概率为f(m),则后验概率为:
    其中,f(y|m)为边缘相似性,由f(y|m)=∫f(y|m,βm)f(βm|m)dβm与f(βm|m)计算得到,用马尔科夫蒙特卡洛随机抽样法近似估计其值,抽样范围为(m,βm)所在空间:
    其中,其中p为全体特征数量。
  3. 根据权利要求1所述的方法,其特征在于:所述步骤30包括数据预处理过程与建模验证过程;
    所述数据预处理过程为:通过概率统计方法将数据集进行正态化转换,然后采用无监督学习方法直接对数据集进行聚类分析,对数据集中特征空间的稀疏空洞进行近似估计,得到训练集;
    所述建模验证过程中,层次基团的建模采用线性累加函数,公式如下:
    其中,函数f(Y)为待预测性质的函数,Ci为一级基团中第i基团的贡献度,Ni为i基团出现次数,δ为一级基团系数;w为二级基团系数,Dj为二级基团中j基团的贡献度,Mj为其出现次数;λ为三级组分描述符基团系数,f(Y*)为三级描述符对给定性质的总贡献度;
    计算层级基团系数δ,w,λ和基团贡献度Ci,Dj时,采用层次方法依次回归,通过训练集回归得到Ci;之后回归得到二级基团贡献度Dj;f(Y*)由组分描述符计算得到,不需回归计算,最后统一回归得到基团系数δ,w,λ,即权重的大小,代表所属层级基团片段对给定性质的影响力。
PCT/CN2023/085001 2022-04-19 2023-03-30 一种基于层次基团构建的纯组分炼化性质的预测方法 WO2023202345A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210411133.2 2022-04-19
CN202210411133.2A CN114708930A (zh) 2022-04-19 2022-04-19 一种基于层次基团构建的纯组分炼化性质的预测方法

Publications (1)

Publication Number Publication Date
WO2023202345A1 true WO2023202345A1 (zh) 2023-10-26

Family

ID=82174562

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/085001 WO2023202345A1 (zh) 2022-04-19 2023-03-30 一种基于层次基团构建的纯组分炼化性质的预测方法

Country Status (2)

Country Link
CN (1) CN114708930A (zh)
WO (1) WO2023202345A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114708930A (zh) * 2022-04-19 2022-07-05 泉州装备制造研究所 一种基于层次基团构建的纯组分炼化性质的预测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111899795A (zh) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 一种分子级炼油加工全流程优化方法、装置、系统及存储介质
CN111899793A (zh) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 一种分子级装置的实时优化方法、装置、系统及存储介质
WO2021234065A1 (en) * 2020-05-22 2021-11-25 Basf Coatings Gmbh Prediction of properties of a chemical mixture
CN113707240A (zh) * 2021-07-30 2021-11-26 浙江大学 基于半监督非线性变分贝叶斯混合模型的成分参数鲁棒软测量方法
CN114708930A (zh) * 2022-04-19 2022-07-05 泉州装备制造研究所 一种基于层次基团构建的纯组分炼化性质的预测方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090105167A1 (en) * 2007-10-19 2009-04-23 Duke University Predicting responsiveness to cancer therapeutics
CN106444672A (zh) * 2016-10-12 2017-02-22 杭州辛孚能源科技有限公司 针对炼油和石化装置的分子水平的实时优化方法
CN108279251B (zh) * 2018-01-26 2019-08-02 中国石油大学(北京) 一种石油分子层次分离过程模拟的方法及其装置
CN108763855B (zh) * 2018-05-18 2022-04-26 南京工业大学 一种确定生物柴油燃烧性能的方法
CN115831246A (zh) * 2022-12-05 2023-03-21 北京科技大学 一种药物化学反应合成与转化率预测联合优化方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021234065A1 (en) * 2020-05-22 2021-11-25 Basf Coatings Gmbh Prediction of properties of a chemical mixture
CN111899795A (zh) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 一种分子级炼油加工全流程优化方法、装置、系统及存储介质
CN111899793A (zh) * 2020-06-12 2020-11-06 中国石油天然气股份有限公司 一种分子级装置的实时优化方法、装置、系统及存储介质
CN113707240A (zh) * 2021-07-30 2021-11-26 浙江大学 基于半监督非线性变分贝叶斯混合模型的成分参数鲁棒软测量方法
CN114708930A (zh) * 2022-04-19 2022-07-05 泉州装备制造研究所 一种基于层次基团构建的纯组分炼化性质的预测方法

Also Published As

Publication number Publication date
CN114708930A (zh) 2022-07-05

Similar Documents

Publication Publication Date Title
WO2023040512A1 (zh) 一种基于分子级机理模型与大数据技术的催化裂化装置模拟预测方法
Schaid Genomic similarity and kernel methods II: methods for genomic information
Song et al. Modeling the hydrocracking process with deep neural networks
Ma et al. MIDIA: exploring denoising autoencoders for missing data imputation
Castro et al. Significant motifs in time series
WO2023202345A1 (zh) 一种基于层次基团构建的纯组分炼化性质的预测方法
He et al. Near-infrared spectroscopy for the concurrent quality prediction and status monitoring of gasoline blending
Chen et al. Adaptive modeling strategy integrating feature selection and random forest for fluid catalytic cracking processes
Wang et al. Layer-wise residual-guided feature learning with deep learning networks for industrial quality prediction
Feng et al. Accurate de novo prediction of RNA 3D structure with transformer network
Manfredi et al. ISPRED-SEQ: Deep neural networks and embeddings for predicting interaction sites in protein sequences
Han et al. Energy consumption hierarchical analysis based on interpretative structural model for ethylene production
Zhou et al. TransVAE-DTA: Transformer and variational autoencoder network for drug-target binding affinity prediction
Mei et al. Molecular-based bayesian regression model of petroleum fractions
Bondugula et al. MUPRED: a tool for bridging the gap between template based methods and sequence profile based methods for protein secondary structure prediction
Tan et al. Semantical and Geometrical Protein Encoding Toward Enhanced Bioactivity and Thermostability
Luo et al. Developing soft sensors using hybrid soft computing methodology: a neurofuzzy system based on rough set theory and genetic algorithms
Yan et al. Insights into deep learning framework for molecular property prediction based on different tokenization algorithms
Guan et al. Dual‐objective optimization for petroleum molecular reconstruction based on property and composition similarities
Lai et al. Workload-Aware Query Recommendation Using Deep Learning.
Yu et al. A novel interpretable ensemble learning method for NIR-based rapid characterization of petroleum products
Freitas et al. Descriptors-based machine-learning prediction of cetane number using quantitative structure–property relationship
Zhang et al. Molecular Reconstruction of Crude Oil: Novel Structure-Oriented Homologous Series Lumping with a Cloud Model
Shi et al. Interpretable reconstruction of naphtha components using property-based extreme gradient boosting and compositional-weighted Shapley additive explanation values
Zhao et al. Computational and Mathematical Methods in Medicine Prediction of COVID‐19 in BRICS Countries: An Integrated Deep Learning Model of CEEMDAN‐R‐ILSTM‐Elman

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23791003

Country of ref document: EP

Kind code of ref document: A1