CN103985055A

CN103985055A - Stock market investment decision-making method based on network analysis and multi-model fusion

Info

Publication number: CN103985055A
Application number: CN201410240496.XA
Authority: CN
Inventors: 彭勤科; 钟韬; 关新宇; 王晓; 秦小雨; 朱志博; 孙智
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2014-05-30
Filing date: 2014-05-30
Publication date: 2014-08-13

Abstract

The invention discloses a stock market investment decision-making method based on network analysis and multi-model fusion. The method includes the steps that fundamental information is grabbed from a network and then network nodes and network connection are constructed; a complex social network model is constructed; an investment portfolio is selected by means of a network analysis method and then data involved in the investment portfolio are input into a multi-model fusion frame, wherein the multi-model fusion frame comprises a plurality of sub-models; all the sub-models conduct market trend prediction with different characteristics according to technical information of different characteristics grabbed in the network generate predicted values of the corresponding sub-models, the predicted values are weighted and summated to obtain a comprehensive market trend predicted value, and then corresponding investment strategies are generated according to the comprehensive market trend predicted value. According to the method, risk factors, which are ignored in general researches, of the investment portfolio are comprehensively considered from multiple angles, real-time requirements of the strategies are guaranteed through the methods such as dimension reduction data preselection of fundamentals and technical feature selection, and consequently the more reliable investment strategies are provided.

Description

A stock market investment decision-making method based on network analysis and multi-model fusion

技术领域technical field

本发明属于股市投资策略分析技术领域，涉及一种基于网络分析和多模型融合的股市投资决策方法。The invention belongs to the technical field of stock market investment strategy analysis and relates to a stock market investment decision-making method based on network analysis and multi-model fusion.

背景技术Background technique

金融市场是国家经济运行的核心，因此，证券投资策略问题一直是各国投资理论界与投资实务界最为关注的核心问题之一，证券投资策略反映了我们基于对金融市场规律和投资者心理的认识，是根据投资目标指定的指导投资行为的规则体系和行动计划方案。其中，技术分析和基本面分析方法是两种主要的投资分析方法。其中技术分析主要应用于具体投资操作的时间和空间判断上，基本面分析则主要应用于投资标的物的选择上，作为提高投资分析有效性和可靠性的重要手段。The financial market is the core of a country's economic operation. Therefore, the issue of securities investment strategy has always been one of the core issues that are most concerned by the investment theory and investment practice circles of various countries. The securities investment strategy reflects our understanding of the laws of the financial market and investor psychology. , is a rule system and action plan to guide investment behavior specified according to investment objectives. Among them, technical analysis and fundamental analysis methods are two main investment analysis methods. Among them, technical analysis is mainly used in the time and space judgment of specific investment operations, and fundamental analysis is mainly used in the selection of investment targets, as an important means to improve the effectiveness and reliability of investment analysis.

随着互联网技术的迅速发展，大量与金融市场相关的信息在互联网上传播，这些实时信息规模巨大、形式多样，其中隐含着重要的、与投资决策相关的信息。如何综合利用信息进行市场预测和分析是金融市场投资决策中的重要问题。With the rapid development of Internet technology, a large amount of information related to the financial market is disseminated on the Internet. These real-time information are huge in scale and in various forms, which imply important information related to investment decisions. How to make comprehensive use of information for market prediction and analysis is an important issue in financial market investment decision-making.

近年来，有许多研究者对投资决策方法问题进行了一系列的研究。其中一个比较新的研究方向是分析文本信息与股票价格之间的关系。例如著名的亚利桑纳州立大学研发了Arizona Financial Text系统，通过对财经新闻和有关股票价格的文章进行分析并预测；在最近的2009年，Schumaker的研究团队分析了通过财经新闻消息利用基于文本的系统进行预测的可行性，并给出了肯定的结论；Nizer的研究团队在2012年进一步地对识别哪些新闻对股市产生可见的影响进行了研究。研究结果都表明，采用自动分析方法指导的投资策略可以获得超额利润。但是注意到，这些基于文本分析的研究所提供的分析方法利用的信息有限，局限于金融新闻这样的有强情感倾向性的文本。事实上，在这些基于文本分析的决策方法中，又通常忽略了在技术分析中使用的数据信息(如股价和股指)，这些预测和分析方法显然是使用不全面的信息的方法。In recent years, many researchers have conducted a series of studies on investment decision-making methods. One of the relatively new research directions is to analyze the relationship between textual information and stock prices. For example, the well-known Arizona State University developed the Arizona Financial Text system, which analyzes and predicts financial news and articles about stock prices; in the most recent 2009, Schumaker's research team analyzed financial news using text-based In 2012, Nizer's research team conducted further research on identifying which news had a visible impact on the stock market. The research results all show that the investment strategy guided by the automatic analysis method can obtain excess profits. However, it is noted that the analysis methods provided by these text analysis-based studies use limited information, and are limited to texts with strong emotional tendencies such as financial news. In fact, in these decision-making methods based on text analysis, the data information used in technical analysis (such as stock prices and stock indexes) is usually ignored. These prediction and analysis methods are obviously methods that use incomplete information.

另一方面，对传统的分析方法(即股价预测方法)的研究进行已久，并已有一系列比较成熟的研究成果。MIT金融专家罗耀宗说：“技术分析是一个从市场价格中撷取出有用资讯的有效方法。”美国联邦准备理事会和学术界里也有一些研究表示，支持技术分析的证据是存在着的。国内例如最近刘海玥(2011)、江龙(2012)、郑晓薇(2013)的研究团队分别从神经网络、灰色RBF网络和LSSVM方法对股价趋势变化规律进行建模和预测，也都取得了较理想的成果。但是，这些预测方法是基于历史数据的规律进行预测分析，属于技术分析的思想，其理论为效率市场假说所反驳，而且其利用的信息也有限。并且传统的研究只考虑预测准确率，而考虑到实际的股价频繁波动情况，准确率并不能直接对应实际投资收益，甚至高预测准确率也可能带来负收益的结果，此外，这样的投资组合可能带来较高的投资风险。On the other hand, the research on the traditional analysis method (that is, stock price prediction method) has been done for a long time, and a series of relatively mature research results have been obtained. MIT financial expert Luo Yaozong said: "Technical analysis is an effective method to extract useful information from market prices." There are also some studies in the US Federal Reserve Board and academic circles that support technical analysis. Evidence exists. In China, for example, the research teams of Liu Haiyue (2011), Jiang Long (2012), and Zheng Xiaowei (2013) have modeled and predicted stock price trend changes from neural network, gray RBF network, and LSSVM methods, and have achieved satisfactory results. the results. However, these forecasting methods are based on the laws of historical data for predictive analysis, which belongs to the idea of technical analysis, and its theory is refuted by the efficient market hypothesis, and the information it uses is also limited. And the traditional research only considers the prediction accuracy rate, and considering the frequent fluctuation of the actual stock price, the accuracy rate does not directly correspond to the actual investment income, and even a high prediction accuracy rate may bring negative returns. In addition, such a portfolio May bring higher investment risk.

同时，传统的股票价格趋势分析与预测方法并没有考虑数据的规模和时效性，因此使用的信息量十分有限，在大数据量下进行预测时训练耗时十分大，考虑到实际投资决策时的实时性要求，已经无法适应网络环境中、海量数据规模下的股票市场分析和预测要求。而且，现有的股票预测多忽略股票间的相关关系，认为股票直接的价格变化是相互独立的，在此基础上进行股价趋势变化的规律研究和分析，这种简化假设明显违背了我们对金融市场的一般认识，因为各股票所对应的上市公司之间是相互有联系的，会受到彼此的影响与作用。At the same time, traditional stock price trend analysis and prediction methods do not take into account the scale and timeliness of data, so the amount of information used is very limited, and training is very time-consuming when forecasting under large amounts of data. Considering the actual investment decision-making Real-time requirements have been unable to meet the requirements of stock market analysis and forecasting under the network environment and massive data scale. Moreover, the existing stock forecasts mostly ignore the correlation between stocks, and think that the direct price changes of stocks are independent of each other. On this basis, the research and analysis of the law of stock price trend changes is carried out. This simplified assumption obviously violates our financial knowledge. The general understanding of the market, because the listed companies corresponding to each stock are related to each other and will be influenced and affected by each other.

如前所述，技术分析主要应用于具体投资操作的时间和空间判断上，基本面分析则主要应用于投资标的物的选择上，作为提高投资分析有效性和可靠性的重要手段。As mentioned above, technical analysis is mainly used in the time and space judgment of specific investment operations, while fundamental analysis is mainly used in the selection of investment targets, as an important means to improve the effectiveness and reliability of investment analysis.

发明内容Contents of the invention

本发明解决的问题在于提供一种基于网络分析和多模型融合的股市投资决策方法，综合利用了基本面分析和技术面分析进行市场投资决策，能够有效的降低投资风险，提高投资收益。The problem to be solved by the present invention is to provide a stock market investment decision-making method based on network analysis and multi-model fusion, which comprehensively utilizes fundamental analysis and technical analysis to make market investment decisions, which can effectively reduce investment risks and increase investment returns.

本发明是通过以下技术方案来实现：The present invention is realized through the following technical solutions:

一种基于网络分析和多模型融合的股市投资决策方法，包括以下操作：A stock market investment decision-making method based on network analysis and multi-model fusion, including the following operations:

首先从网络中抓取基本面信息，在此基础上构建网络节点和网络连接，构建复杂社会网络模型；利用网络分析的方法选择投资组合，再把投资组合所涉及的数据输入到多模型融合框架中；First, capture fundamental information from the network, build network nodes and network connections on this basis, and construct a complex social network model; use network analysis methods to select investment portfolios, and then input the data involved in the investment portfolios into the multi-model fusion framework middle;

所述的多模型融合框架包括多个子模型，每个子模型针对从网络中抓取的不同特征的技术面信息，分别进行不同特点的市场趋势预测，生成各自的预测值，再将预测值加权求和，得到综合的市场趋势预测值，根据该值生成相应的投资策略；The multi-model fusion framework includes a plurality of sub-models, and each sub-model predicts market trends with different characteristics according to the technical information of different characteristics captured from the network, generates respective prediction values, and then weights the prediction values to obtain and, get the comprehensive market trend prediction value, and generate corresponding investment strategies according to the value;

向子模型提供信息的特征选择器、子模型涉及的参数、子模型的预测值的权重均通过单变量分布估计算法进行封装训练。The feature selector that provides information to the sub-model, the parameters involved in the sub-model, and the weight of the predicted value of the sub-model are all packaged and trained through the univariate distribution estimation algorithm.

所述的复杂社会网络模型的构建包括以下操作：The construction of the complex social network model includes the following operations:

1.1)网络节点1.1) Network nodes

向量空间模型中，从网络中抓取的基本面信息文本以二元特征向量模式的词袋表示，如下所示：In the vector space model, the fundamental information text captured from the network is represented by a bag of words in binary feature vector mode, as follows:

inf_i＝(<t₁,w_i1>,<t₂,w_i2>,...<t_M,w_iM>)inf _i ＝(<t ₁ ,w _i1 >,<t ₂ ,w _i2 >,...<t _M ,w _iM >)

其中M是特征的数量，w_ik是文本特征t_k权值，通过tf*idf方法计算权值，对固定特征化简为inf_i＝(w_i1,w_i2,...w_iM)；Where M is the number of features, w _ik is the weight of the text feature t _k , the weight is calculated by the tf*idf method, and the fixed features are simplified to inf _i =(w _i1 ,w _i2 ,...w _iM );

对利用数据挖掘的方法从网络中获得的基本面信息中的文本进行如下操作：Perform the following operations on the text in the fundamental information obtained from the network using data mining methods:

1)过滤：滤掉信息中无用的部分；1) Filtering: filter out useless parts of the information;

2)分词：将经过过滤的信息分割成多个词汇，将分词后的结果存入词汇库中，标识词汇的词性；2) Word segmentation: segment the filtered information into multiple words, store the result after word segmentation in the vocabulary database, and identify the part of speech of the words;

3)对词汇库中的词汇进行进一步的停词处理，包括去除虚词并绑定否定词；3) Carry out further stop word processing to the vocabulary in the vocabulary bank, comprise removing function words and binding negative words;

获取基本面信息当中的文本特征后，进行其权值的计算，将基本面信息整理成向量空间模型，对于时变的信息，该向量空间模型则成为一时变向量：After obtaining the text features in the fundamental information, calculate its weight, organize the fundamental information into a vector space model, and for time-varying information, the vector space model becomes a time-varying vector:

inf_i(t)＝(w_i1(t),w_i2(t),...w_iM(t))，其中t为时间变量；inf _i (t)=(w _i1 (t), w _i2 (t),...w _iM (t)), where t is a time variable;

1.2)网络连接1.2) Network connection

对网络G(t)＝(V(t),E(t))，用上市公司的基本面信息对其作为网络节点建模，即有V(t)＝{inf_i(t)},E(t)＝{(i,j,edg_ij(t))|i,j∈V(t)}；For the network G(t)=(V(t),E(t)), use the fundamental information of listed companies to model it as a network node, that is, V(t)={inf _i (t)},E (t)={(i,j,edg _ij (t))|i,j∈V(t)};

V(t)为利用基本面信息所构建的网络节点的集合，E(t)为网络节点当中的两个节点i、j以及它们之间的连接强度edg_ij(t)的集合；V(t) is a collection of network nodes constructed using fundamental information, and E(t) is a collection of two nodes i, j among the network nodes and the connection strength edg _ij (t) between them;

使用余弦相似度 $\cos (\inf_{i} (t), \inf_{j} (t)) = \frac{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} (t) w_{jn} (t)}{\sqrt{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} {(t)}^{2} \underset{t_{n} &Element; T_{M}}{Σ} w_{jn} {(t)}^{2}}}$ 计算网络连接强度，其中T_M为基本面信息文本特征的全集，并使用阈值θ进行过滤，即有 ${edg}_{ij} (t) = \{\begin{matrix} 0, & \cos (\inf_{i} (t), \inf_{j} (t)) < θ \\ \cos (\inf_{i} (t), \inf_{j} (t)), & \cos (\inf_{i} (t), \inf_{j} (t)) &GreaterEqual; θ \end{matrix},$ θ取cos45°；Using cosine similarity $\cos (\inf_{i} (t), \inf_{j} (t)) = \frac{\underset{t_{no} &Element; T_{m}}{Σ} w_{in} (t) w_{jn} (t)}{\sqrt{\underset{t_{no} &Element; T_{m}}{Σ} w_{in} {(t)}^{2} \underset{t_{no} &Element; T_{m}}{Σ} w_{jn} {(t)}^{2}}}$ Calculate the network connection strength, where T _M is the complete set of text features of fundamental information, and use the threshold θ to filter, that is, ${edg}_{ij} (t) = \{\begin{matrix} 0, & \cos (\inf_{i} (t), \inf_{j} (t)) < θ \\ \cos (\inf_{i} (t), \inf_{j} (t)), & \cos (\inf_{i} (t), \inf_{j} (t)) &Greater Equal; θ \end{matrix},$ θ takes cos45°;

以如inf_i(t)＝(w_i1(t),w_i2(t),...w_iM(t))所示的向量空间模型描述网络节点，构建成所需要的网络节点；Describe the network nodes with the vector space model shown in inf _i (t)=(w _i1 (t), w _i2 (t),...w _iM (t)), and construct the required network nodes;

再将网络节点按照如V(t)＝{inf_i(t)},E(t)＝{(i,j,edg_ij(t))|i,j∈V(t)}所示的网络连接方式连接，从而复杂社会网络模型，该模型是一个动态的网络模型。Then the network nodes follow the network shown by V(t)={inf _i (t)}, E(t)={(i,j,edg _ij (t))|i,j∈V(t)} Connection means connection, thus complex social network model, which is a dynamic network model.

所述利用网络分析方法选择投资组合是选择最相互无关的股票组成投资组合，包括以下基于社团检测聚类的多元性划分方法：The selection of investment portfolio using network analysis method is to select the most unrelated stocks to form the investment portfolio, including the following multi-diversity division method based on community detection and clustering:

使用社团检测方法进行划分，使用Girvan-Newman聚类方法进行网络聚类，其评价指标为模块度 $Q (t) = \underset{i}{Σ} (e_{ii} (t) - {a_{i}}^{2} (t)),$ t为时间变量；Use the community detection method for division, use the Girvan-Newman clustering method for network clustering, and its evaluation index is modularity $Q (t) = \underset{i}{Σ} (e_{i} (t) - {a_{i}}^{2} (t)),$ t is a time variable;

其中，e_ij(t)表示连接社团i与社团j中网络节点的网络连接的权值所占的比例；a_i(t)表示与社团i中网络节点相关联的所有网络连接的权值所占的比例,包括两个网络连接全在社团内部和仅有一个网络连接在社团内部两种情况；Among them, e _ij (t) represents the proportion of the network connection weights connecting community i and network nodes in community j; a _i (t) represents the proportion of all network connection weights associated with network nodes in community i proportion, including two cases where both network connections are within the community and only one network connection is within the community;

模块度描述了划分后的复杂社会网络模型与随机网络模型的差异程度大小，模块度最大的社团划分方法将作为最佳划分；The modularity describes the degree of difference between the divided complex social network model and the random network model, and the community division method with the largest modularity will be the best division;

Girvan-Newman聚类方法的基本过程如下：The basic process of the Girvan-Newman clustering method is as follows:

第一步：计算所有网络连接的中介度；Step 1: Calculate the betweenness of all network connections;

第二步：移走具有最大中介度的网络连接；Step 2: Remove the network connection with the maximum betweenness;

第三步：重新计算第二步中影响的网络连接的中介度；The third step: recalculate the betweenness of the network connections affected in the second step;

第四步：若没有剩余网络连接，则结束算法，否则转向第二步；Step 4: If there is no remaining network connection, then end the algorithm, otherwise turn to the second step;

最后，从最优划分后的各网络社团com(t)中选择一个代表，即组成所需的多元化的投资组合；Finally, choose a representative from each network community com(t) after optimal division, that is to form the required diversified investment portfolio;

最终，基本面分析选择投资组合的策略模型表示为：Finally, the strategy model for fundamental analysis to select a portfolio is expressed as:

$IFA IFA ((t t)) = = {{n no | | &ForAll; &ForAll; com com ((t t)),, n no = = arg arg {\underset{i i &Element; &Element; com com ((t t))}{max max}}^{N N} (({rep rep}_{i i} ((t t))))}} . .$

所述利用网络分析方法选择投资组合是选择最相互无关的股票组成投资组合，包括以下基于最大全连通无关子网的多元性分割方法：The selection of the investment portfolio using the network analysis method is to select the most irrelevant stocks to form the investment portfolio, including the following multi-diversity segmentation method based on the largest fully connected unrelated subnetwork:

采用Bron–Kerbosch算法提取补网中的最大全连接子网，其基础形式是一个递归回溯的搜索算法，流程如下：The Bron–Kerbosch algorithm is used to extract the largest fully connected subnet in the supplementary network. Its basic form is a recursive backtracking search algorithm. The process is as follows:

Bron-Kerbosch算法：Bron-Kerbosch algorithm:

Step1：给定三个集合(R,P,X)，初始化集合R,X分别为空，而集合P为所有网络节点的集合；Step1: Given three sets (R, P, X), the initialization sets R and X are empty respectively, and the set P is the set of all network nodes;

Step2：若集合P,X分别为空，则输出R为最大团；Step2: If the sets P and X are empty respectively, output R as the largest group;

Step3：对于每一个从集合P中取得得网络节点{v}，有如下处理:Step3: For each network node {v} obtained from the set P, the following processing is performed:

1)将网络节点{v}加到集合R中，集合P,X与网络节点{v}得邻接网络节点集合N{v}相交，之后递归集合R,P,X(转Step2)；1) Add the network node {v} to the set R, set P, X intersect with the network node {v}'s adjacent network node set N{v}, and then recursively set R, P, X (turn to Step2);

2)从集合P中删除网络节点{v}，并将网络节点{v}添加到集合X中；2) Delete the network node {v} from the set P, and add the network node {v} to the set X;

此时基本面分析选择投资组合的策略模型表示为：At this time, the strategy model for fundamental analysis to select a portfolio is expressed as:

即按照Bron-Kerbosch算法生成的补网中的最大全连接子网。 That is, the largest fully connected subnetwork in the supplementary network generated according to the Bron-Kerbosch algorithm.

所述的多模型融合的框架是把一个复杂的系统化为若干个子系统，一个子系统对应一个子模型，然后将这些子系统对应的子模型组合起来共同描述同一个模型以提高模型拟合度；The framework of multi-model fusion described is to transform a complex system into several subsystems, one subsystem corresponds to one sub-model, and then combine the sub-models corresponding to these subsystems to jointly describe the same model to improve the model fitting degree ;

所述子模型连接方法采用加权求和方式，将每一个子模型的输出按一定的权值进行求和，得到最终的输出；The sub-model connection method adopts a weighted summation method, and the output of each sub-model is summed according to a certain weight to obtain the final output;

所述的子模型是可加减或替换的，经过基本数据的训练，整体模型会通过调节自身子模型权值大小对预测模型进行自适应的选择；The sub-models can be added, subtracted or replaced. After basic data training, the overall model will adaptively select the prediction model by adjusting the weight of its own sub-models;

所述的的子模型包括以下几种：The sub-models described include the following:

1)基于矢量符号序列的趋势预测方法1) Trend prediction method based on vector symbol sequence

首先采用最小二乘拟合法对历史股价数据进行矢量化，若定义第x_i日的价格为y_i，最小化n日的误差 $S (a, b) = Σ_{i = 0}^{n} {(y_{i} - f (x_{i}))}^{2}, f (x_{i}) = {ax}_{i} + b,$ 可得到以斜率表征的趋势，定义为：First, the least squares fitting method is used to vectorize the historical stock price data. If the price of the x _i day is defined as y _i , the error of n days is minimized $S (a, b) = Σ_{i = 0}^{no} {({the y}_{i} - f (x_{i}))}^{2}, f (x_{i}) = {ax}_{i} + b,$ A trend characterized by a slope can be obtained, defined as:

进一步离散化连续的趋势矢量，以从中提取宏观的趋势信息，采用无监督的聚类方法，针对矢量化的数据的特殊性，进行基于k-means的重聚类算法，得到聚类结果及各类中心矢量；Further discretize the continuous trend vectors to extract macro trend information, use unsupervised clustering methods, and perform k-means-based re-clustering algorithms for the particularity of vectorized data to obtain clustering results and various class center vector;

2)基于股票时间序列转折点抽取的趋势预测算法2) Trend prediction algorithm based on stock time series turning point extraction

在时间序列波动特征点形式定义的基础上，设有时间序列X＝{x₁,x₂,……,x_N}，则X在其时间间隔内的波动特征点x为{xVD(x)＝max(VD(x_i)),i＝1,2,…,N}，其中On the basis of the definition of time series fluctuation characteristic points, if time series X={x ₁ ,x ₂ ,……,x _N }, then the fluctuation characteristic point x of X in its time interval is {xVD(x) =max(VD( _xi )),i=1,2,...,N}, where

$VD VD (({x x}_{i i})) = = | | {x x}_{i i} - - (({x x}_{11} + + (({x x}_{N N} - - {x x}_{11})) - - \frac{i i - - 11}{N N - - 11})) | |;;$

3)基于词汇情感倾向性判定的投资推荐算法3) Investment recommendation algorithm based on vocabulary emotional tendency judgment

对任意词汇w_i∈W，记：For any vocabulary w _i ∈ W, remember:

$T T (({w w}_{i i})) \overset{^^}{= =} \{\begin{matrix} 11,, if if {w w}_{i i} is positive is positive,, \\ - - 11,, if if {w w}_{i i} is negative is negative . . \end{matrix}$

称T(w_i)为w_i的情感倾向性值，给定情感倾向性关系网络SORN＝(W,C,Q)，在SORN中一条从w_i到w_j的路径上经过的词汇编号序列被记为(p₁,p₂,…,p_s)，其中2≤s≤n，p₁＝i，p_s＝j；T(w _i ) is called the emotional orientation value of w _i , given the emotional orientation relationship network SORN=(W,C,Q), the sequence of lexical numbers passed on a path from w _i to w _j in SORN is denoted as (p ₁ ,p ₂ ,...,p _s ), where 2≤s≤n, p ₁ =i, p _s =j;

若集合{(p₁,p₂,…,p_s)}中一个元素(h₁,h₂,…,h_s)满足: $q_{h_{1} h_{2}} + q_{h_{2} h_{3}} + . . . + q_{h_{s - 1} h_{s}} = \min {q_{p_{q} p_{2}} + q_{p_{2} p_{3}} + . . . + q_{p_{s} - 1 p_{s}}},$ 则称(h₁,h₂，…,h_s)为SORN中从w_i到w_j的一条最少噪声路径上经过的词汇编号序列，记为D(i,j)；称(D(i,1)，…，D(i,i-1)，D(i,i+1)，…，D(i,n))为D(i,j)的序列，记为D(i)；采用邻接表和斐波纳契堆实现的Dijkstra算法来计算D(i)；If an element (h ₁ ,h ₂ ,…,h _s ) in the set {(p ₁ ,p ₂ ,…,p _s )} satisfies: $q_{h_{1} h_{2}} + q_{h_{2} h_{3}} + . . . + q_{h_{the s - 1} h_{the s}} = \min {q_{p_{q} p_{2}} + q_{p_{2} p_{3}} + . . . + q_{p_{the s} - 1 p_{the s}}},$ Then (h ₁ ,h ₂ ,…,h _s ) is called the sequence of vocabulary numbers passing through a least noisy path from w _i to w _j in SORN, denoted as D(i,j); called (D(i, 1),..., D(i,i-1), D(i,i+1),...,D(i,n)) is the sequence of D(i,j), denoted as D(i); using Dijkstra's algorithm implemented by adjacency list and Fibonacci heap to calculate D(i);

选择两个词汇之间噪声最少的路径来鉴定这对词汇的情感倾向性关系，当语料足够充分，对利用Q便可获得D(i,j)，再利用C便能计算w_i和w_j的情感倾向性关系；在语料中positive词汇总比negative词汇多。Select the path with the least noise between two words to identify the emotional tendency relationship between the words. When the corpus is sufficient, the D(i,j) can be obtained by using Q, and the emotional tendency relationship between w _i and w _j can be calculated by using C; there are always more positive words than negative words in the corpus.

所述的聚类算法中，把各个趋势矢量用其所在的聚类中心所对应的矢量代替，而该矢量表征了趋势的类型，若将其表征为符号，则最终可以将连续变化的趋势矢量离散化成一系列符号序列；使用股票的交易日、开盘价、最高价、最低价、收盘价、成交量和成交额作为技术分析的主要数据，股票S的历史信息表示为符号序列：{S_i}， $S_{i} = {s_{i}^{d}, s_{i}^{o}, s_{i}^{h}, s_{i}^{l}, s_{i}^{c}, s_{i}^{v}, s_{i}^{t}};$ In the described clustering algorithm, each trend vector is replaced with the vector corresponding to the cluster center where it is located, and the vector characterizes the type of trend, if it is represented as a symbol, the continuously changing trend vector can finally be Discretize into a series of symbol sequences; using the stock's trading day, opening price, highest price, lowest price, closing price, trading volume and turnover as the main data of technical analysis, the historical information of stock S is expressed as a symbol sequence: {S _i }, $S_{i} = {{the s}_{i}^{d}, {the s}_{i}^{o}, {the s}_{i}^{h}, {the s}_{i}^{l}, {the s}_{i}^{c}, {the s}_{i}^{v}, {the s}_{i}^{t}};$

其中分别对应了在第i日该股的交易日、开盘价、最高价、最低价、收盘价、成交量和成交额数据；in Corresponding to the trading day, opening price, highest price, lowest price, closing price, trading volume and trading value data of the stock on the i-th day;

并采用了使用径向基核函数的支撑向量机模型对历史股价趋势符号模型进行学习，SVM理论追求经过学习得到具有最强泛化能力的模型，在求出支持向量SV后，需要再求出最优分类超平面OHP，此求解过程是一个二次规划问题；And the support vector machine model using the radial basis kernel function is used to learn the historical stock price trend symbol model. The SVM theory pursues the model with the strongest generalization ability after learning. After finding the support vector SV, it is necessary to find The optimal classification hyperplane OHP, the solution process is a quadratic programming problem;

设SV是离OHP：(l·x)+b＝0距离最近的样本点，并且同一类的SV离OHP距离完全相等，不同类的SV离OHP距离不一定相等；Let SV be the sample point closest to OHP: (l x)+b=0, and the distances between SVs of the same class and OHP are completely equal, and the distances between SVs of different classes and OHP are not necessarily equal;

对m个训练样本(x₁,y₁),(x₂,y₂),...,(x_m,y_m)求其分类超平面，分类超平面必须满足最优分类超平面的条件：Find the classification hyperplane for m training samples (x ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _m ,y _m ), the classification hyperplane must meet the conditions of the optimal classification hyperplane :

为了找到最优分类超平面，借助Lagrange函数将原问题转化成求解标准型二次规划问题：In order to find the optimal classification hyperplane, the original problem is transformed into a standard quadratic programming problem with the help of Lagrange function:

$max max W W ((α α)) = = {Σ Σ}_{i i = = 11}^{m m} {α α}_{i i} - - \frac{11}{22} {Σ Σ}_{i i,, j j = = 11}^{m m} {α α}_{i i} {α α}_{j j} {y the y}_{i i} {y the y}_{i i} K K (({x x}_{i i},, {x x}_{j j}))$

$s the s . . t t . . {Σ Σ}_{i i = = 11}^{m m} {α α}_{i i} {y the y}_{i i} = = 00,, {α α}_{i i} &GreaterEqual; &Greater Equal; 00,, ((i i = = 1,2 1,2,, . . . . . .,, m m))$

求最优超平面的关键在于求出满足α_i>0的α_i以及 The key to finding the optimal hyperplane is to find the α _i that satisfies α _i >0 and

而最优分类超平面为：And the optimal classification hyperplane is:

$f f ((x x)) = = sign sign {{\underset{{α α}_{i i} > > 00}{Σ Σ} {α α}_{i i} {y the y}_{i i} K K (({x x}_{i i},, x x)) - - {b b}_{00}}}$

其中，α_i>0对应的样本点为支持向量，对于二次规划问题可采取积极方法、对偶方法或内点算法求解；训练后的支撑向量机自动地对新数据进行分类，输出分类结果的符号，该符号对应原聚类中心矢量的斜率，即为最终趋势的预测值。Among them, the sample points corresponding to α _i >0 are support vectors, and the quadratic programming problem can be solved by positive method, dual method or interior point algorithm; the trained support vector machine automatically classifies new data and outputs the result of classification The symbol corresponds to the slope of the original cluster center vector, which is the predicted value of the final trend.

所述的时间序列波动特征点提取算法的步骤如下：The steps of the time series fluctuation feature point extraction algorithm are as follows:

第一步：输入待提取序列的起点坐标start和终点坐标end，判断start与end间的距离是否满足子序列小于最小区间长度，若满足则转第三步；若不满足则按波动特征点的定义寻找起点与终点间VD值最大的点，若VD大于算法幅度的终止条件，则将该点作为波动特征点加入到波动特征点结果序列中，并将该点记做fp；Step 1: Input the start coordinates start and end coordinates end of the sequence to be extracted, and judge whether the distance between start and end satisfies that the subsequence is less than the minimum interval length. If yes, go to the third step; Define to find the point with the largest VD value between the starting point and the end point. If VD is greater than the termination condition of the algorithm range, then add this point as a fluctuation feature point to the result sequence of fluctuation feature points, and record this point as fp;

第二步：用fp将原序列划分成两段，即start到fp子段、fp到end子段，对这两个子段执行第一步；The second step: use fp to divide the original sequence into two sections, namely start to fp subsection, fp to end subsection, and execute the first step for these two subsections;

第三步：将波动特征点结果序列按时间排序后保存与输出；Step 3: Save and output the result sequence of fluctuation feature points according to time;

第四步：基于转折点判定最小时间间隔阈值，用最大/最小值原则在波动特征点结果序列上提取出转折点种子集；Step 4: Determine the minimum time interval threshold based on the turning point, and use the maximum/minimum principle to extract the turning point seed set from the result sequence of the fluctuation feature points;

第五步：基于转折点种子集，在任意的两个连续转折点之间基于波动特征点序列用反向波幅最大原则寻找转折点，并加入到转折点种子集中，重复上述操作，直到按设定的转折点提取参数无法再找到新的转折点为止；Step 5: Based on the turning point seed set, find the turning point between any two consecutive turning points based on the fluctuation feature point sequence and use the principle of maximum reverse volatility, and add it to the turning point seed set, repeat the above operation until the set turning point is extracted parameters can no longer find a new turning point;

第六步：将转折点种子集按时间排序后保存与输出；Step 6: Save and output the turning point seed set sorted by time;

以上抽取的转折点种子则表征了较大时间尺度下的股价变化趋势，通过神经网络或者支持向量机等机器学习方法进行训练，得到最优分类器，则该分类器可以自动地对新数据进行分类，输出对应日期是否转折点，即对未来趋势预测的结果，将最近转折点对应的趋势预测值即为最终趋势的预测值。The turning point seeds extracted above represent the stock price change trend on a large time scale. After training with machine learning methods such as neural networks or support vector machines, the optimal classifier can be obtained, and the classifier can automatically classify new data. , output whether the corresponding date is a turning point, that is, the result of predicting the future trend, and the predicted value of the trend corresponding to the latest turning point is the predicted value of the final trend.

所述的基于词汇情感倾向性判定的投资推荐算法采用SWSOA算法，该算法从SORN最大连通子图的任何一个节点开始，便对该子图中的所有节点进行情感倾向性分类；SWSOA算法的输入为一个具体的SORN，将该算法表示成函数SWSOA(SORN)，其中SORN表示SORN变量；The described investment recommendation algorithm based on vocabulary emotional tendency judgment adopts SWSOA algorithm, and this algorithm starts from any node of SORN maximum connected subgraph, and just carries out emotional tendency classification to all nodes in this subgraph; The input of SWSOA algorithm For a specific SORN, the algorithm is expressed as a function SWSOA(SORN), where SORN represents a SORN variable;

该算法的具体步骤如下：The specific steps of the algorithm are as follows:

步骤1.利用广度优先遍历算法获得SORN的最大连通子图Gs，Gs中包含的词汇节点组成的集合被记为W_GS；Step 1. Utilize the breadth-first traversal algorithm to obtain the maximum connected subgraph Gs of SORN, and the collection of vocabulary nodes contained in Gs is recorded as _WGS ;

步骤2.指定W_GS中的任意一个节点w_i，且使得W_GS＝W_GS-{w_i}，U＝{w_i}和 Step 2. Designate any node w _i in W _GS , and make W _GS =W _GS -{ _wi }, U={ _wi } and

步骤3.计算D(i)来获得w_i到W_GS中任意节点的最少噪声路径；Step 3. Calculate D(i) to obtain the least noisy path from w _i to any node in _WGS ;

步骤4.对WGS中的每一个词汇节点w_j，执行a)～c)：Step 4. For each vocabulary node w _j in WGS, execute a)~c):

a)依据D(i)中的D(i，j)，计算从w_i到w_j的最少噪声路径上经过的转折关系边的a) According to D(i, j) in D(i), calculate the transition relation edge passed on the least noise path from w _i to w _j

数量e， $e = Σ_{l = 1}^{l = s - 1} δ_{h_{l} h_{l + 1}} c_{h_{l} h_{l + 1}};$ quantity e, $e = Σ_{l = 1}^{l = the s - 1} δ_{h_{l} h_{l + 1}} c_{h_{l} h_{l + 1}};$

b)计算w_j与w_i的倾向性关系：T(w_j)＝(-1)^e×T(w_i)；b) Calculate the tendency relationship between w _j and w _i : T(w _j )=(-1) ^e ×T(w _i );

c)如果T(w_j)＝T(w_i)，那么U＝U∪{w_j}，否则V＝V∪{w_j}；c) If T(w _j )=T(w _i ), then U=U∪{w _j }, otherwise V=V∪{w _j };

步骤5.WSO分类完成，分类结果分别存放在U和V中，如果|U|>|V|，那么U中存放着positive词汇和V中存放着negative词汇，反之亦然；Step 5. WSO classification is completed, and the classification results are stored in U and V respectively. If |U|>|V|, then positive words are stored in U and negative words are stored in V, and vice versa;

提取与金融市场相关的评论新闻报道及评论，利用SWSOA算法判别其情感倾向性，判定为positive的上市公司及其对应的股票是推荐应该关注的候选股；Extract commentary news reports and comments related to the financial market, use the SWSOA algorithm to identify their emotional tendencies, and determine the listed companies that are positive and their corresponding stocks to be candidate stocks that should be paid attention to;

对于判定为positive的上市公司及其对应的股票的最终趋势的预测值为1；The predicted value of the final trend of the listed companies and their corresponding stocks judged as positive is 1;

对于判定为negative的上市公司及其对应的股票的最终趋势的预测值为-1。The predicted value of the final trend of the listed companies and their corresponding stocks judged as negative is -1.

考虑到多模型融合使得整个模型的计算量和复杂度，还包括基于单变量分布估计封装方法的训练、优化和特征选择方法：Considering the computational complexity and complexity of the entire model due to multi-model fusion, it also includes training, optimization and feature selection methods based on univariate distribution estimation encapsulation methods:

所述的单变量分布估计算法，针对数据，算法中用一个n(n＝N+Dim)维染色体进行编码；染色体分为两个部分，第一部分是被封装算法参数编码部分，共N位二进制码；并且启用精英保留过程，并对每只股票分别存储历史最优值进行初始化以提高寻优效率，保证技术分析过程的实时性：对种群q(t)，t为时间变量，设其是一个m*n阶矩阵，记录了按照适应度值Q9从高到低排列的染色体，前r行[q₁ q₂ … q_r]^T被作为精英保留下来，则估计的分布概率矩阵为P＝[P₁ P₂ … P_n]，其中：R为参与变异的染色体数量，其步骤如下：In the univariate distribution estimation algorithm, for data, an n (n=N+Dim) dimensional chromosome is used in the algorithm to encode; the chromosome is divided into two parts, the first part is the encoded part of the encapsulated algorithm parameter, a total of N bits of binary code; and enable the elite retention process, and initialize each stock to store the historical optimal value separately to improve the optimization efficiency and ensure the real-time performance of the technical analysis process: for the population q(t), t is a time variable, let it be An m*n order matrix records the chromosomes arranged from high to low according to the fitness value Q9, the first r rows [q ₁ q ₂ ... q _r ] ^T are reserved as elites, then the estimated distribution probability matrix is P= [P ₁ P ₂ … P _n ], where: R is the number of chromosomes involved in the mutation, and the steps are as follows:

第一步：初始化种群q(0)，读取历史最优值加入初始种群，迭代次数t＝0；Step 1: Initialize the population q(0), read the historical optimal value and join the initial population, the number of iterations t=0;

第二步：染色体种群q(t)根据适应度值递减的顺序对染色体排序，并将前r个染色体作为精英保留；The second step: the chromosome population q(t) sorts the chromosomes according to the order of decreasing fitness value, and reserves the first r chromosomes as elites;

第三步：计算前R(R一般大于r)个染色体每一列取1的概率P_i，组成概率矩阵；Step 3: Calculate the probability P _i of taking 1 in each column of the first R (R is generally greater than r) chromosomes, and form a probability matrix;

第四步：以概率P_i随机产生一个符合0-1分布的数字作为染色体第i位，再用同样的算法生成(m-r)个染色体，令t＝t+1，将这些染色体和保留下来的r个染色体组合成新的种群q(t)；Step 4: Randomly generate a number conforming to the 0-1 distribution with the probability P _i as the i-th position of the chromosome, and then use the same algorithm to generate (mr) chromosomes, let t=t+1, combine these chromosomes with the remaining r chromosomes are combined into a new population q(t);

第五步：判断t>N,若成立，则结束算法，若不成立，则转向第二步；The fifth step: judge t>N, if it is true, then end the algorithm, if not, then turn to the second step;

其中参数寻优部分为应用封装，需要对参数进行编解码，其中多模型融合中各子模型的权值的二进制编解码方法如下：The parameter optimization part is the application package, which needs to encode and decode the parameters. The binary encoding and decoding method of the weight of each sub-model in the multi-model fusion is as follows:

1)权值编码1) Weight coding

初始化权值就是一组n×m维长的0和1编码序列，其中m为需要优化的权值个数，n是每一个权值的编码位数，且每个权值的编码长度都是相同的；The initialization weight is a set of n×m dimensional long 0 and 1 code sequences, where m is the number of weights to be optimized, n is the number of code bits for each weight, and the code length of each weight is identical;

2)权值解码2) Weight decoding

权值w_i的取值范围是w_i∈[0,1]，假设第i个权值的编码是[q₁ q₂ … q_n]，其中q_i∈{0,1},i∈{1,2,…,n}The value range of weight w _i is w _i ∈ [0,1], assuming that the encoding of the i-th weight is [q ₁ q ₂ … q _n ], where q _i ∈ {0,1}, i∈{ 1,2,…,n}

则w'_i＝q₁×2⁰+q₂×2¹+…+q_n×2^n-1 Then w' _i ＝q ₁ ×2 ⁰ +q ₂ ×2 ¹ +…+q _n ×2 ^n-1

然后将w'_i映射到区间[0,1]，则Then map w' _i to the interval [0,1], then

${w w}_{i i} = = \frac{{w w}_{i i}^{' '}}{22^{n no}}$

所述的封装方法，将由技术面信息形成的数据集随机分为训练集合测试集两部分，对种群进行解码，利用特征的基因显型部分选择特征子集进行训练，得到参数被设定的子模型，同时利用特征的基因显型部分对测试集数据进行特征选择，将特征选择后的测试集输入到训练好的子模型进行测试，根据测试集的分类性能进行适应度评估，若满足单变量分布估计的终止条件，则输出最优子模型和特征子集，否则按单变量分布估计算法更新概率分布向量，再次进行编码生成新的种群。In the encapsulation method, the data set formed by the technical information is randomly divided into two parts, the training set and the test set, the population is decoded, and the genotype part of the feature is used to select a subset of features for training to obtain a subset whose parameters are set. At the same time, the genotype part of the feature is used to perform feature selection on the test set data, and the test set after feature selection is input to the trained sub-model for testing, and the fitness evaluation is performed according to the classification performance of the test set. If the univariate If the termination condition of the distribution estimation, the optimal sub-model and feature subset are output, otherwise the probability distribution vector is updated according to the univariate distribution estimation algorithm, and a new population is generated by encoding again.

所述的利用综合的市场趋势预测值设计综合投资策略是基于以下修正因子：The described use of comprehensive market trend forecasts to design comprehensive investment strategies is based on the following correction factors:

$r r ((t t)) = = {{\begin{matrix} 11,, if if {P P}_{s the s} ((t t)) \times \times ((11 + + {Q Q}_{s the s} ((t t)))) > > 11 \\ {P P}_{s the s} ((t t)) \times \times ((11 + + {Q Q}_{s the s} ((t t)))),, otherwise otherwise \end{matrix},,$

其中P和Q分别对应训练中测试集的准确率和评价值，修正因子r(t)反应了投资者对该次预测的置信程度，即r(t)的值越高则投资者对该次预测结果越有信心，反正则将倾向于怀疑预测结果而保持原策略；设t时刻投资者持有的股票数量为N(t)，持有的现金数量为c(t)，以股票收盘价cp(t)作为其价格代表VAL(t)，操作符号为Sig(t)，则对一只股票，可将基于投资策略数学化表示为：Among them, P and Q correspond to the accuracy rate and evaluation value of the test set in training, respectively, and the correction factor r(t) reflects the investor's confidence in this prediction, that is, the higher the value of r(t), the greater the investor's confidence in this prediction. The more confident the prediction result is, the more likely it is to doubt the prediction result anyway and keep the original strategy; suppose the number of stocks held by investors at time t is N(t), the amount of cash held by investors is c(t), and the stock closing price cp(t) represents VAL(t) as its price, and the operation symbol is Sig(t), then for a stock, it can be expressed mathematically based on the investment strategy as:

$Sig Sig ((t t)) = = \{\begin{matrix} Buy buy,, if if {Y Y}_{s the s} ((t t)) > > 00 \\ Sell sell,, if if {Y Y}_{s the s} ((t t)) < < 00 \\ Sig Sig ((t t - - 11)),, if if {Y Y}_{s the s} ((t t)) = = 00 \end{matrix};;$

以上以收盘价作为操作标准，因而最终总资产为a(t)＝c(t)+N(t)×cp(t)；The closing price is used as the operating standard above, so the final total assets are a(t)=c(t)+N(t)×cp(t);

对投资组合，再进一步考虑给予网络分析的股票预选IFA，定义投资分析的期望为E(ITA(t,s))＝Y_s(t)×r_s(t)，其中Y_s为综合的市场趋势预测值、r_s为对结果置信程度，则其对股票s的交易金额可按进行分配，对应的一般性投资策略表述为：For the investment portfolio, further consider the stock pre-selection IFA based on network analysis, and define the expectation of investment analysis as E(ITA(t,s))=Y _s (t)×r _s (t), where Y _s is the comprehensive market Trend prediction value, r _s is the degree of confidence in the result, then its transaction amount for stock s can be calculated according to The corresponding general investment strategy is expressed as:

$&ForAll; &ForAll; s the s &Element; &Element; IFA IFA ((t t)),,$

$Sig Sig ((t t,, s the s)) = = {{\begin{matrix} Buy buy ((s the s,, ITA ITA ((t t,, s the s)))),, if ITA if ITA ((t t,, s the s)) > > {θ θ}_{+ +} \\ Sell sell ((s the s,, | | ITA ITA ((t t,, s the s)) | |)),, if ITA if ITA ((t t,, s the s)) < < θ θ__ \\ Sig Sig ((t t - - 11,, s the s)),, otherwise otherwise \end{matrix},,$

①:Sig(t,s)＝Buy(s,r)②:Sig(t,s)＝Sell(s,r)①: Sig(t,s)＝Buy(s,r)②: Sig(t,s)＝Sell(s,r)

where r＝ITA(t,s)where r=ITA(t,s)

与现有技术相比，本发明具有以下有益的技术效果：Compared with the prior art, the present invention has the following beneficial technical effects:

本发明提供的基于网络分析和多模型融合的股市投资决策方法，使用基本面分析中的文本信息和技术面分析中的数值信息相互补充，实现了更完整的信息综合利用，并利用复杂社会网络模型分析了市场的网络结构，从多元性的角度综合考虑了一般研究中被忽略的投资组合风险因素，以及通过基本面的预选降维数据和技术面的特征选择方法保证了决策的实时性要求，以提供更加可靠的投资策略。The stock market investment decision-making method based on network analysis and multi-model fusion provided by the present invention uses the text information in the fundamental analysis and the numerical information in the technical analysis to complement each other, realizes a more complete comprehensive utilization of information, and utilizes complex social networks The model analyzes the network structure of the market, comprehensively considers the investment portfolio risk factors neglected in general research from the perspective of diversity, and ensures the real-time requirements of decision-making through the pre-selection of dimensionality reduction data on the fundamental side and the feature selection method on the technical side , to provide a more reliable investment strategy.

本发明提供的基于网络分析和多模型融合的股市投资决策方法，利用复杂社会网络模型对金融市场进行建模，以处理现有方法中忽略了上市公司之间的相关性的问题；由于利用复杂社会网络模型对金融市场进行建模，考虑了上市公司之间的相关性，有效降低了投资组合的风险，计算表明所选投资组合的夏普指数明显提高。The stock market investment decision-making method based on network analysis and multi-model fusion provided by the present invention uses a complex social network model to model the financial market to deal with the problem of neglecting the correlation between listed companies in existing methods; due to the use of complex The social network model models the financial market, takes into account the correlation between listed companies, and effectively reduces the risk of the investment portfolio. The calculation shows that the Sharpe index of the selected investment portfolio is significantly improved.

本发明提供的基于网络分析和多模型融合的股市投资决策方法，将基本面信息中的文本数据和市场内的数值数据相互补充，进一步保证了研究分析中信息的完整性。The stock market investment decision-making method based on network analysis and multi-model fusion provided by the present invention complements the text data in the fundamental information and the numerical data in the market, further ensuring the integrity of the information in the research and analysis.

本发明提供的基于网络分析和多模型融合的股市投资决策方法，考虑大数据量下的实时性问题，更适合对海量数据进行处理；由于采用了预选过程及特征选择，有效地对数据进行了降维，本发明克服了大数据量下的实时性难以保证的问题，更适合对海量数据进行处理。The stock market investment decision-making method based on network analysis and multi-model fusion provided by the present invention considers the real-time problem under the large amount of data, and is more suitable for processing massive data; due to the use of pre-selection process and feature selection, the data is effectively processed Dimensionality reduction, the present invention overcomes the problem that it is difficult to guarantee real-time performance under a large amount of data, and is more suitable for processing massive data.

本发明提供的基于网络分析和多模型融合的股市投资决策方法，综合使用多模型自动提供相应的投资策略以供实际投资使用或参考；综合使用多模型融合方法，可以灵活组合各种分析预测方式，并通过封装方法训练以实现自适应的优化选择，可以兼容并利用现有的分析预测方法，并能自动提供相应的投资策略以供实际投资使用或参考。The stock market investment decision-making method based on network analysis and multi-model fusion provided by the present invention comprehensively uses multiple models to automatically provide corresponding investment strategies for actual investment use or reference; comprehensively uses multi-model fusion methods to flexibly combine various analysis and prediction methods , and through encapsulation method training to achieve adaptive optimization selection, it can be compatible with and utilize existing analysis and prediction methods, and can automatically provide corresponding investment strategies for actual investment use or reference.

附图说明Description of drawings

图1为本发明的投资决策方法的示意图；Fig. 1 is the schematic diagram of investment decision-making method of the present invention;

图2为基于k-means的重聚类算法；Figure 2 is a re-clustering algorithm based on k-means;

图3为带精英保留的分布估计算法；Figure 3 is a distribution estimation algorithm with elite reservation;

图4为封装方法的基本流程图；Fig. 4 is the basic flowchart of packaging method;

图5为多模型融合的自动化股市投资决策系统整体示意图；Fig. 5 is the overall schematic diagram of the automatic stock market investment decision-making system of multi-model integration;

图6为2012年2月模拟投资结果；Figure 6 shows the simulated investment results in February 2012;

图7-1、图7-2分别为2012年3月模拟投资结果及其增长率示意图。Figure 7-1 and Figure 7-2 are schematic diagrams of simulated investment results and their growth rates in March 2012, respectively.

具体实施方式Detailed ways

下面结合具体的实施例对本发明做进一步的详细说明，所述是对本发明的解释而不是限定。The present invention will be further described in detail below in conjunction with specific embodiments, which are explanations of the present invention rather than limitations.

参见图1，首先从网络中抓取基本面信息，在此基础上构建网络节点和网络连接，构成复杂社会网络模型；利用网络分析的方法选择投资组合，再把投资组合所涉及的数据输入到多模型融合框架中；Referring to Figure 1, first capture fundamental information from the network, build network nodes and network connections on this basis, and form a complex social network model; use network analysis methods to select investment portfolios, and then input the data involved in the investment portfolios into In the multi-model fusion framework;

所述的多模型融合框架包括多个子模型，每个子模型针对从网络中抓取的不同特征的技术面信息，分别进行不同特点的市场趋势预测，生成各自的预测值，再将预测值加权求和，得到综合的市场趋势预测值，根据该值生成相应的投资策略。The multi-model fusion framework includes a plurality of sub-models, and each sub-model predicts market trends with different characteristics according to the technical information of different characteristics captured from the network, generates respective prediction values, and then weights the prediction values to obtain And, get the comprehensive market trend prediction value, and generate the corresponding investment strategy according to the value.

以下将分步介绍发明中涉及的方法实现，最终阐述完整投资策略系统的实现：The following will introduce step by step the realization of the method involved in the invention, and finally explain the realization of the complete investment strategy system:

1.复杂社会网络模型的构建1. Construction of complex social network model

1.1网络节点1.1 Network Node

其中M是特征的数量，w_ik是文本特征t_k权值，通过tf*idf方法计算权值，对固定特征，可化简为inf_i＝(w_i1,w_i2,...w_iM)。Where M is the number of features, w _ik is the weight of the text feature t _k , and the weight is calculated by the tf*idf method. For fixed features, it can be simplified as inf _i =(w _i1 ,w _i2 ,...w _iM ) .

对基本面信息(利用数据挖掘的方法从网络中获得)中的文本进行如下操作：Perform the following operations on the text in the fundamental information (obtained from the network using data mining methods):

1)过滤：滤掉信息中无用的部分；具体操作包括：移除信息中的标点、分隔符、时间属性、固定标识和垃圾信息等；1) Filtering: filtering out useless parts of the information; specific operations include: removing punctuation, separators, time attributes, fixed logos and spam in the information;

3)对词汇库中的词汇进行进一步的停词处理，包括：去除助词、连词等虚词并绑定否定词等；3) Carry out further stop word processing to the vocabulary in the vocabulary bank, including: remove function words such as auxiliary words, conjunctions and bind negative words, etc.;

经过过滤、分词、停词等一系列步骤，获取基本面信息当中的文本特征，并进行其权值的计算，从而可以将基本面信息整理成上述的向量空间模型，对于时变的信息(如上市公司的月报、年报等)，该向量空间模型则成为一时变向量：After a series of steps such as filtering, word segmentation, and stop words, the text features in the fundamental information are obtained, and their weights are calculated, so that the fundamental information can be organized into the above-mentioned vector space model. For time-varying information (such as monthly reports, annual reports, etc. of listed companies), the vector space model becomes a time-varying vector:

1.2网络连接1.2 Network connection

金融市场的结构十分类似于复杂社会网络。各上市公司类似于一个社会中的成员，而公司之间的联系则类似于社会成员之间的关系。对网络G(t)＝(V(t),E(t))，用上市公司的基本面信息对其作为网络节点建模，即有V(t)＝{inf_i(t)},E(t)＝{(i,j,edg_ij(t))|i,j∈V(t)}。The structure of financial markets closely resembles complex social networks. Each listed company is similar to members in a society, and the connection between companies is similar to the relationship among members of society. For the network G(t)=(V(t),E(t)), use the fundamental information of listed companies to model it as a network node, that is, V(t)={inf _i (t)},E (t)={(i,j,edg _ij (t))|i,j∈V(t)}.

V(t)即为利用基本面信息所构建的网络节点的集合，E(t)为网络节点当中的两个节点i、j以及它们之间的连接强度edg_ij(t)的集合；V(t) is the set of network nodes constructed using fundamental information, and E(t) is the set of two nodes i and j among the network nodes and the connection strength edg _ij (t) between them;

动态网络模型可以清晰直观地表征市场的内部结构，并且可以实现对实时变化的市场信息的自动处理，而且提供了一系列特定的方法可供基本面分析使用。The dynamic network model can clearly and intuitively represent the internal structure of the market, and can realize the automatic processing of real-time changing market information, and provides a series of specific methods for fundamental analysis.

使用余弦相似度 $\cos (\inf_{i} (t), \inf_{j} (t)) = \frac{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} (t) w_{jn} (t)}{\sqrt{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} {(t)}^{2} \underset{t_{n} &Element; T_{M}}{Σ} w_{jn} {(t)}^{2}}}$ 计算网络连接强度，其中T_M为基本面信息文本特征的全集，并使用阈值θ进行过滤，即有 ${edg}_{ij} (t) = \{\begin{matrix} 0, & \cos (\inf_{i} (t), \inf_{j} (t)) < θ \\ \cos (\inf_{i} (t), \inf_{j} (t)), & \cos (\inf_{i} (t), \inf_{j} (t)) &GreaterEqual; θ \end{matrix},$ θ可取cos45°，近似为0.707。Using cosine similarity $\cos (\inf_{i} (t), \inf_{j} (t)) = \frac{\underset{t_{no} &Element; T_{m}}{Σ} w_{in} (t) w_{jn} (t)}{\sqrt{\underset{t_{no} &Element; T_{m}}{Σ} w_{in} {(t)}^{2} \underset{t_{no} &Element; T_{m}}{Σ} w_{jn} {(t)}^{2}}}$ Calculate the network connection strength, where T _M is the complete set of text features of fundamental information, and use the threshold θ to filter, that is, ${edg}_{ij} (t) = \{\begin{matrix} 0, & \cos (\inf_{i} (t), \inf_{j} (t)) < θ \\ \cos (\inf_{i} (t), \inf_{j} (t)), & \cos (\inf_{i} (t), \inf_{j} (t)) &Greater Equal; θ \end{matrix},$ θ can take cos45°, which is approximately 0.707.

这样，以如inf_i(t)＝(w_i1(t),w_i2(t),...w_iM(t))所示的向量空间模型描述网络节点，构建成所需要的网络节点；再将网络节点按照如V(t)＝{inf_i(t)},E(t)＝{(i,j,edg_ij(t))|i,j∈V(t)}所示的网络连接方式连接，从而复杂社会网络模型，该模型是一个动态的网络模型。Like this, describe the network node with the vector space model shown in inf _i (t)=(w _i1 (t), w _i2 (t),...w _iM (t)), construct the required network node; Then the network nodes follow the network shown by V(t)={inf _i (t)}, E(t)={(i,j,edg _ij (t))|i,j∈V(t)} Connection means connection, thus complex social network model, which is a dynamic network model.

2、通过网络分析方法选择投资组合2. Portfolio selection through network analysis methods

为在充分降低自动投资分析的计算量的同时，尽量减小投资组合的风险。本发明中引入了多元化策略的思想，即在股票投资中，如选择尽可能相互无关的股票组成投资组合，则该投资组合的风险将小于平均风险。以下给出两种实现方法：In order to minimize the risk of investment portfolio while fully reducing the calculation amount of automatic investment analysis. The idea of diversification strategy is introduced in the present invention, that is, in stock investment, if the stocks that are as irrelevant as possible are selected to form a portfolio, the risk of the portfolio will be less than the average risk. Two implementation methods are given below:

2.1基于社团检测聚类的多元性划分方法2.1 Diversity division method based on community detection clustering

本发明中使用社团检测方法进行划分，因为聚类方法将保证社团内部的相似距离小，而社团之间的相似距离大；具体可使用Girvan-Newman聚类方法进行网络聚类，其评价指标为模块度t为时间变量。其中，e_ij(t)表示连接社团i与社团j中网络节点的网络连接的权值所占的比例；a_i(t)表示与社团i中网络节点相关联的所有网络连接的权值所占的比例,包括两个网络连接全在社团内部和仅有一个网络连接在社团内部两种情况。模块度描述了划分后的复杂社会网络模型与随机网络模型的差异程度大小，模块度最大的社团划分方法将作为最佳划分。Girvan-Newman聚类方法的基本过程如下：In the present invention, the community detection method is used to divide, because the clustering method will ensure that the similarity distance inside the community is small, and the similarity distance between the communities is large; specifically, the Girvan-Newman clustering method can be used for network clustering, and its evaluation index is Modularity t is a time variable. Among them, e _ij (t) represents the proportion of the weights of network connections connecting community i and network nodes in community j; a _i (t) represents the proportion of weights of all network connections associated with network nodes in community i The proportion of the total includes two cases where both network connections are within the community and only one network connection is within the community. The modularity describes the degree of difference between the divided complex social network model and the random network model, and the community division method with the largest modularity will be regarded as the best division. The basic process of the Girvan-Newman clustering method is as follows:

第四步：若没有剩余网络连接，则结束算法；否则转向第二步。Step 4: If there is no remaining network connection, then end the algorithm; otherwise, turn to the second step.

最后，从最优划分后的各网络社团com(t)中选择一个代表，即可组成所需的多元化的投资组合。选择的策略可根据网络节点的重要性指标rep_i(t)(如度、紧密度、中介数、中心性等中的一种，推荐使用紧密度)进行判断。最终，基本面分析选择投资组合的策略模型可表示为：Finally, choose a representative from the optimally divided network communities com(t) to form the required diversified investment portfolio. The selected strategy can be judged according to the importance index rep _i (t) of the network node (such as one of degree, closeness, betweenness number, centrality, etc., and closeness is recommended). Ultimately, the strategy model for selecting a portfolio for fundamental analysis can be expressed as:

$IFA IFA ((t t)) = = {{n no | | &ForAll; &ForAll; com com ((t t)),, n no = = arg arg {\underset{i i &Element; &Element; com com ((t t))}{max max}}^{N N} (({rep rep}_{i i} ((t t))))}}$

上述的投资组合选择还可以这样来实现：The above portfolio selection can also be implemented as follows:

2.2基于最大全连通无关子网的多元性分割方法：2.2 Multivariate segmentation method based on maximum fully connected unrelated subnetwork:

注意聚类方法试图但不保证分类间的差异最大化，本发明设计了另一种进行多元化的方法，即寻找复杂社会网络模型补网的最大全连通子网。由于复杂社会网络G(t)＝(V(t),E(t))是按照上市公司间的相关关系进行建模，其补网恰好表征了对应股票间的不相关关系，从中选择的全连接子网则可保证所构成的投资组合中股票的强相互独立性，而最大子网则意味着对原市场的最大化覆盖。本发明采用了Bron–Kerbosch算法提取补网中的最大全连接子网(最大团)，其基础形式是一个递归回溯的搜索算法，流程如下：Note that the clustering method attempts but does not guarantee the maximization of the differences between classifications. The present invention designs another method for diversification, which is to find the largest fully connected subnetwork of the complement network of the complex social network model. Since the complex social network G(t)=(V(t),E(t)) is modeled according to the correlation between listed companies, its complementary network It just represents the uncorrelated relationship between the corresponding stocks, and the fully connected subnetwork selected from it can ensure the strong mutual independence of the stocks in the portfolio, and the largest subnetwork means the maximum coverage of the original market. The present invention adopts the Bron–Kerbosch algorithm to extract the largest fully connected subnet (the largest group) in the supplementary network, and its basic form is a recursive backtracking search algorithm, and the process is as follows:

Bron-Kerbosch算法：Bron-Kerbosch algorithm:

Step1：给定三个集合(R,P,X)，初始化集合R,X分别为空，而集合P为所有网络节点的集合(此处即为补网中的网络节点的集合)；Step1: Given three sets (R, P, X), the initialization sets R and X are empty respectively, and the set P is the set of all network nodes (here is the set of network nodes in the supplementary network);

2)从集合P中删除网络节点{v}，并将网络节点{v}添加到集合X中。2) Delete network node {v} from set P, and add network node {v} to set X.

此时基本面分析选择投资组合的策略模型可表示为：At this time, the strategy model for selecting investment portfolios for fundamental analysis can be expressed as:

根据聚类方法和补网性质可以看出，以上实现2.1突出选择的物理含义，而实现2.2强调了划分的不相关性，操作中可根据实际需求选择操作方法，一般通用地可以默认选择实现2.2进行操作。同时注意到，通过网络分析对投资股票进行选择可以大幅度减少需要关注的股票，缩减数据量并提高之后分析工作的效率。According to the clustering method and the nature of the supplementary network, it can be seen that the above implementation 2.1 highlights the physical meaning of the selection, while the implementation 2.2 emphasizes the irrelevance of the division. During the operation, the operation method can be selected according to the actual needs. Generally, the implementation 2.2 can be selected by default. to operate. At the same time, it is noted that the selection of investment stocks through network analysis can greatly reduce the stocks that need attention, reduce the amount of data and improve the efficiency of subsequent analysis work.

3、基于技术面信息的多模型融合的框架3. A framework for multi-model fusion based on technical information

考虑到金融市场数据的海量、复杂性和非线性，单一的建模方式可能难以精确地描述模型。通常的分析方法将多种算法进行比较，然后采取相对来说效果较好的预测算法，而将其余的预测算法忽略不计，然而这样会忽略大量有用的信息。介于此，本发明采用了多模型融合建模，其基本思想是把一个复杂的系统化为若干个子系统，一个子系统对应一个子模型，然后将这些子系统对应的子模型组合起来共同描述同一个模型以提高模型拟合度。Considering the mass, complexity and nonlinearity of financial market data, it may be difficult for a single modeling method to accurately describe the model. The usual analysis method compares multiple algorithms, and then adopts the prediction algorithm with relatively better effect, while ignoring the rest of the prediction algorithms. However, this will ignore a lot of useful information. Because of this, the present invention adopts multi-model fusion modeling, and its basic idea is to systematize a complex system into several subsystems, one subsystem corresponds to a sub-model, and then combine the sub-models corresponding to these subsystems to jointly describe the same model to improve model fit.

本发明中子模型连接方法采用加权求和方式，将每一个子模型的输出按一定的权值进行求和，得到最终的输出。通常，上述权值可根据遗传算法、人工神经网络优化而得到。不同于“开关切换”方式，“加权求和”方式每一个子模型的输出对结果都是有贡献的，这种连接方式虽然对数据分类器的敏感度较低，但是整个模型的计算量和复杂度都变大了许多。其连接框架如图2所示。设t时刻每个子模型度输出为y_i(t)，经过训练后的权值为u_i，最终输出为：本发明给出了几种可用的子模型如下：The neutron model connection method of the present invention adopts a weighted summation method, and the output of each sub-model is summed according to a certain weight value to obtain the final output. Usually, the above weights can be obtained through genetic algorithm and artificial neural network optimization. Different from the "switching" method, the output of each sub-model in the "weighted sum" method contributes to the result. Although this connection method is less sensitive to the data classifier, the calculation amount of the entire model and The complexity has increased a lot. Its connection framework is shown in Figure 2. Let the degree output of each sub-model at time t be y _i (t), the weight after training is u _i , and the final output is: The present invention provides several available sub-models as follows:

3.1基于矢量符号序列的趋势预测方法3.1 Trend prediction method based on vector symbol sequence

首先采用最小二乘拟合法对历史股价数据进行矢量化。若定义第x_i日的价格为y_i，最小化n日的误差 $S (a, b) = Σ_{i = 0}^{n} {(y_{i} - f (x_{i}))}^{2}, f (x_{i}) = {ax}_{i} + b,$ 可得到以斜率表征的趋势，定义为：Firstly, the least squares fitting method is used to vectorize the historical stock price data. If the price of the x _i- th day is defined as y _i , minimize the error of n days $S (a, b) = Σ_{i = 0}^{no} {({the y}_{i} - f (x_{i}))}^{2}, f (x_{i}) = {ax}_{i} + b,$ A trend characterized by a slope can be obtained, defined as:

进一步离散化连续的趋势矢量，以从中提取宏观的趋势信息，可以采用了无监督的聚类方法，针对矢量化的数据的特殊性，进行如图3所示的基于k-means的重聚类算法，得到聚类结果及各类中心矢量。To further discretize the continuous trend vector to extract macro trend information, an unsupervised clustering method can be used to perform k-means-based re-clustering as shown in Figure 3 for the particularity of the vectorized data algorithm to obtain clustering results and various center vectors.

图3所示的聚类算法可以自动处理证券市场中有时会出现的除权除息而带来的价格异常突变，因此可以有效地消除突发事件的影响。The clustering algorithm shown in Figure 3 can automatically deal with abnormal price changes caused by ex-rights and ex-dividends that sometimes occur in the securities market, so it can effectively eliminate the impact of unexpected events.

把各个趋势矢量用其所在的聚类中心所对应的矢量代替，而该矢量表征了趋势的类型(如：大涨、涨、平稳、跌、大跌等)，若将其表征为符号，则最终可以将连续变化的趋势矢量离散化成一系列符号序列。一般主要使用了股票的交易日、开盘价、最高价、最低价、收盘价、成交量和成交额作为技术分析的主要数据，因此股票S的历史信息可表示为符号序列{Si}, $S_{i} = {s_{i}^{d}, s_{i}^{o}, s_{i}^{h}, s_{i}^{l}, s_{i}^{c}, s_{i}^{v}, s_{i}^{t}};$ 其中 $s_{i}^{d}, s_{i}^{o}, s_{i}^{h}, s_{i}^{h}, s_{i}^{l}, s_{i}^{c}, s_{i}^{v}, s_{i}^{t}$ 分别对应了在第i日该股的交易日、开盘价、最高价、最低价、收盘价、成交量和成交额数据。例如，设用收盘价作为最终交易操作价格，并设在技术分析时，使用且仅使用之前定长的m日数据，则作为训练的数据格式为 Replace each trend vector with the vector corresponding to the cluster center where it is located, and this vector represents the type of trend (such as: big up, up, stable, down, big down, etc.), if it is represented as a symbol, then Finally, the continuously changing trend vector can be discretized into a series of symbol sequences. Generally, the trading day, opening price, highest price, lowest price, closing price, trading volume and turnover of the stock are mainly used as the main data of technical analysis, so the historical information of the stock S can be expressed as a symbol sequence {Si}, $S_{i} = {{the s}_{i}^{d}, {the s}_{i}^{o}, {the s}_{i}^{h}, {the s}_{i}^{l}, {the s}_{i}^{c}, {the s}_{i}^{v}, {the s}_{i}^{t}};$ in ${the s}_{i}^{d}, {the s}_{i}^{o}, {the s}_{i}^{h}, {the s}_{i}^{h}, {the s}_{i}^{l}, {the s}_{i}^{c}, {the s}_{i}^{v}, {the s}_{i}^{t}$ Corresponding to the trading day, opening price, highest price, lowest price, closing price, trading volume and trading value data of the stock on the i-th day respectively. For example, if the closing price is used as the final trading operation price, and when technical analysis is performed, and only m-day data of the previous fixed length is used, then the training data format is

继而，为实现自动化的技术分析，本发明采用了使用径向基核函数(RBF)的支撑向量机(SVM)模型对历史股价趋势符号模型进行学习。SVM理论追求经过学习得到具有最强泛化能力的模型，因此在求出支持向量SV后，需要再求出最优分类超平面OHP，此求解过程是一个二次规划问题。Then, in order to realize automatic technical analysis, the present invention adopts the support vector machine (SVM) model using the radial basis function (RBF) to learn the historical stock price trend symbol model. The SVM theory pursues the model with the strongest generalization ability after learning. Therefore, after finding the support vector SV, it is necessary to find the optimal classification hyperplane OHP. This solving process is a quadratic programming problem.

设SV是离OHP：(l·x)+b＝0距离最近的样本点，并且同一类的SV离OHP距离完全相等，不同类的SV离OHP距离不一定相等。Let SV be the nearest sample point to OHP: (l·x)+b=0, and the distances between SVs of the same class and OHP are completely equal, and the distances of SVs of different classes to OHP are not necessarily equal.

对m个训练样本(x₁,y₁),(x₂,y₂),...,(x_m,y_m)求其分类超平面，关键是求系数l和b。由于支持向量机理论要求分类超平面具有良好的性质(即超平面分类误差小、推广能力强)，因此这样分类超平面必须满足最优分类超平面的条件：For m training samples (x ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _m ,y _m ) to find the classification hyperplane, the key is to find the coefficients l and b. Since the support vector machine theory requires that the classification hyperplane has good properties (that is, the hyperplane classification error is small and the generalization ability is strong), the classification hyperplane must meet the conditions of the optimal classification hyperplane:

为了找到最优分类超平面，根据最优化理论，可以借助Lagrange函数将原问题转化成求解标准型二次规划问题：In order to find the optimal classification hyperplane, according to the optimization theory, the original problem can be transformed into a standard quadratic programming problem with the help of the Lagrange function:

而最优分类超平面为：And the optimal classification hyperplane is:

通常α_i>0对应的样本点为支持向量。对于上面的二次规划问题，可采取积极方法、对偶方法、内点算法等经典解法求解。训练后的支撑向量机可以自动地对新数据进行分类，输出分类结果的符号(如：大涨、涨、平稳、跌、大跌等)，该符号对应原聚类中心矢量的斜率，即为最终趋势的预测值。Usually the sample points corresponding to α _i >0 are support vectors. For the above quadratic programming problem, classic solutions such as positive method, dual method, and interior point algorithm can be used to solve it. The trained support vector machine can automatically classify the new data, and output the symbols of the classification results (such as: rising, rising, stable, falling, falling, etc.), which correspond to the slope of the original cluster center vector, which is The predicted value of the final trend.

3.2基于股票时间序列转折点抽取的趋势预测算法3.2 Trend prediction algorithm based on stock time series turning point extraction

时间序列数据通常具有短期波动行为频繁、噪声干扰多以及非稳态等特点，然而在很多应用中人们通常所关心的只是某些形态上的关键点，这些点就是时间序列波动特征点。本发明将在时间序列波动特征点形式定义的基础上介绍一种提取算法：设有时间序列X＝{x₁,x₂,……,x_N}，则X在其时间间隔内的波动特征点x为{x|VD(x)＝max(VD(x_i)),i＝1,2,…,N}，其中Time series data usually has the characteristics of frequent short-term fluctuations, high noise interference, and unsteady state. However, in many applications, people usually only care about some key points in the form, and these points are the characteristic points of time series fluctuations. The present invention will introduce an extraction algorithm on the basis of the definition of time series fluctuation feature points: if time series X={x ₁ , x ₂ ,...,x _N }, then the fluctuation characteristics of X in its time interval The point x is {x|VD(x)=max(VD( _xi )), i=1,2,…,N}, where

$VD VD (({x x}_{i i})) = = | | {x x}_{i i} - - (({x x}_{11} + + (({x x}_{N N} - - {x x}_{11})) - - \frac{i i - - 11}{N N - - 11})) | |$

时间序列波动特征点提取算法的步骤如下：The steps of the time series fluctuation feature point extraction algorithm are as follows:

第一步：输入待提取序列的起点坐标start和终点坐标end。判断start与end间的距离是否满足子序列小于最小区间长度，若满足则转第三步；若不满足则按波动特征点的定义寻找起点与终点间VD值最大的点，若VD大于算法幅度的终止条件，则将该点作为波动特征点加入到波动特征点结果序列中，并将该点记做fp。Step 1: Input the start coordinates start and end coordinates end of the sequence to be extracted. Determine whether the distance between start and end satisfies that the subsequence is smaller than the minimum interval length. If it is satisfied, go to the third step; if it is not satisfied, find the point with the largest VD value between the starting point and the end point according to the definition of the fluctuation feature point. If VD is greater than the algorithm range The termination condition of , then add this point as a fluctuation feature point to the result sequence of fluctuation feature points, and record this point as fp.

第二步：用fp将原序列划分成两段，即start到fp子段、fp到end子段，对这两个子段执行第一步。The second step: use fp to divide the original sequence into two sections, namely start to fp subsection, fp to end subsection, and execute the first step for these two subsections.

第三步：将波动特征点结果序列按时间排序后保存与输出。Step 3: Save and output the result sequence of fluctuation feature points according to time.

第四步：基于转折点判定最小时间间隔阈值，用最大/最小值原则在波动特征点结果序列上提取出转折点种子集。Step 4: Determine the minimum time interval threshold based on the turning point, and use the maximum/minimum principle to extract the turning point seed set from the result sequence of fluctuation feature points.

第五步：基于转折点种子集，在任意的两个连续转折点之间基于波动特征点序列用反向波幅最大原则寻找转折点，并加入到转折点种子集中。重复上述操作，直到按设定的转折点提取参数(若找到的两点之间的时间间隔条件与波动幅度条件满足预设的阈值)无法再找到新的转折点为止；Step 5: Based on the turning point seed set, find the turning point between any two consecutive turning points based on the fluctuation feature point sequence with the principle of maximum reverse amplitude, and add it to the turning point seed set. Repeat the above operations until the parameters are extracted according to the set turning point (if the time interval condition and the fluctuation amplitude condition between the found two points meet the preset threshold), no new turning point can be found;

以上抽取的转折点种子则表征了较大时间尺度下的股价变化趋势，可通过神经网络或者支持向量机等机器学习方法进行训练，得到最优分类器，则该分类器可以自动地对新数据进行分类，输出对应日期是否转折点，即对未来趋势预测的结果。将最近转折点对应的趋势预测值即为最终趋势的预测值。The turning point seeds extracted above represent the trend of stock price changes on a large time scale, and can be trained by machine learning methods such as neural networks or support vector machines to obtain the optimal classifier, which can automatically classify new data. Classification, output whether the corresponding date is a turning point, that is, the result of predicting the future trend. The predicted value of the trend corresponding to the nearest turning point is the predicted value of the final trend.

3.3基于词汇情感倾向性判定的投资推荐算法3.3 Investment Recommendation Algorithm Based on Vocabulary Sentiment Tendency Judgment

对任意词汇w_i∈W，记：For any vocabulary w _i ∈ W, remember:

称T(w_i)为w_i的情感倾向性值。给定情感倾向性关系网络SORN＝(W,C,Q)，在SORN中一条从w_i到w_j的路径上经过的词汇编号序列被记为(p₁,p₂,…,p_s)，其中2≤s≤n，p₁＝i，p_s＝j。若集合{(p₁,p₂,…,p_s)}中一个元素(h₁,h₂,…,h_s)满足: $q_{h_{1} h_{2}} + q_{h_{2} h_{3}} + . . . + q_{h_{s - 1} h_{s}} = \min {q_{p_{q} p_{2}} + q_{p_{2} p_{3}} + . . . + q_{p_{s} - 1 p_{s}}},$ 则称(h₁,h₂，…,h_s)为SORN中从w_i到w_j的一条最少噪声路径上经过的词汇编号序列，记为D(i,j)。称(D(i,1)，…，D(i,i-1)，D(i,i+1)，…，D(i,n))为D(i,j)的序列，记为D(i)。可以采用邻接表和斐波纳契堆实现的Dijkstra算法来计算D(i)。Call T(w _i ) the emotional tendency value of w _i . Given the emotional orientation relationship network SORN=(W,C,Q), the sequence of vocabulary numbers passing through a path from w _i to w _j in SORN is recorded as (p ₁ ,p ₂ ,…,p _s ) , where 2≤s≤n, p ₁ =i, p _s =j. If an element (h ₁ ,h ₂ ,…,h _s ) in the set {(p ₁ ,p ₂ ,…,p _s )} satisfies: $q_{h_{1} h_{2}} + q_{h_{2} h_{3}} + . . . + q_{h_{the s - 1} h_{the s}} = \min {q_{p_{q} p_{2}} + q_{p_{2} p_{3}} + . . . + q_{p_{the s} - 1 p_{the s}}},$ Then (h ₁ , h ₂ , ..., h _s ) is called the sequence of vocabulary numbers passing through a least noisy path from w _i to w _j in SORN, denoted as D(i, j). Call (D(i,1),...,D(i,i-1), D(i,i+1),...,D(i,n)) a sequence of D(i,j), denoted as D(i). D(i) can be calculated using Dijkstra's algorithm implemented by adjacency list and Fibonacci heap.

显然，应该选择两个词汇之间噪声最少的路径来鉴定这对词汇的情感倾向性关系。当语料足够充分，对利用Q便可获得D(i,j)，再利用C便能计算w_i和w_j的情感倾向性关系。另外在语料中positive词汇总比negative词汇多。基于上述分析，提出SWSOA算法，该算法从SORN最大连通子图的任何一个节点开始，便可对该子图中的所有节点进行情感倾向性分类。SWSOA算法的输入为一个具体的SORN。为方便使用，将该算法表示成函数SWSOA(SORN)，其中SORN表示SORN变量。该算法的步骤如下：Obviously, the path with the least noise between two words should be selected to identify the sentiment-oriented relationship between the pair of words. When the corpus is sufficient, the D(i, j) can be obtained by using Q, and then the emotional tendency relationship between w _i and w _j can be calculated by using C. In addition, there are always more positive words than negative words in the corpus. Based on the above analysis, the SWSOA algorithm is proposed. Starting from any node in the maximum connected subgraph of SORN, the algorithm can classify the sentiment orientation of all nodes in the subgraph. The input of SWSOA algorithm is a specific SORN. For ease of use, the algorithm is expressed as a function SWSOA(SORN), where SORN represents the SORN variable. The steps of the algorithm are as follows:

步骤1.利用广度优先遍历算法获得SORN的最大连通子图Gs。Gs中包含的词汇节点组成的集合被记为W_GS。Step 1. Use the breadth-first traversal algorithm to obtain the largest connected subgraph Gs of SORN. The set of vocabulary nodes contained in Gs is denoted as W _GS .

步骤3.计算D(i)来获得w_i到W_GS中任意节点的最少噪声路径。Step 3. Compute D(i) to obtain the least noisy path from w _i to any node in _WGS .

步骤4.对W_GS中的每一个词汇节点w_j，执行a)～c)：Step 4. For each vocabulary node w _j in W _GS , execute a)~c):

d)依据D(i)中的D(i，j)，计算从w_i到w_j的最少噪声路径上经过的转折d) According to D(i, j) in D(i), calculate the turning point passed on the least noisy path from w _i to w _j

关系边的数量e， $e = Σ_{l = 1}^{l = s - 1} δ_{h_{l} h_{l + 1}} c_{h_{l} h_{l + 1}} .$ The number of relationship edges e, $e = Σ_{l = 1}^{l = the s - 1} δ_{h_{l} h_{l + 1}} c_{h_{l} h_{l + 1}} .$

e)计算w_j与w_i的倾向性关系：T(w_j)＝(-1)^e×T(w_i)。e) Calculate the propensity relationship between w _j and w _i : T(w _j )=(-1) ^e ×T(w _i ).

f)如果T(w_j)＝T(w_i)，那么U＝U∪{w_j}，否则V＝V∪{w_j}。f) If T(w _j )=T(w _i ), then U=U∪{w _j }, otherwise V=V∪{w _j }.

步骤5.WSO分类完成，分类结果分别存放在U和V中。如果|U|>|V|，那么U中存放着positive词汇和V中存放着negative词汇，反之亦然。Step 5. WSO classification is completed, and the classification results are stored in U and V respectively. If |U|>|V|, then positive words are stored in U and negative words are stored in V, and vice versa.

提取与金融市场相关的评论新闻报道及评论，可以利用SWSOA算法判别其情感倾向性，判定为positive的上市公司及其对应的股票显然是推荐应该关注的候选股。To extract commentary news reports and comments related to the financial market, you can use the SWSOA algorithm to judge their emotional tendencies. The listed companies and their corresponding stocks that are judged as positive are obviously candidate stocks that should be recommended for attention.

注意这些子模型是可加减或替换的，经过基本数据的训练，整体模型会通过调节自身子模型权值大小对预测模型进行自适应的选择。Note that these sub-models can be added, subtracted or replaced. After basic data training, the overall model will adaptively select the prediction model by adjusting the weights of its own sub-models.

4、基于单变量分布估计封装方法的训练、优化和特征选择方法4. Training, optimization and feature selection methods based on univariate distribution estimation encapsulation methods

考虑到多模型融合使得整个模型的计算量和复杂度十分大，再加上存在大量冗余和噪声的海量数据信息，需要考虑计算的时间复杂度。对大数据工程，主要采取的方法是提升硬件性能，优化算法性能和数据降维约减。为此，本发明提出了封装方法实现数据的特征选择约减，同时该方法也用来自适应地实现系统中模型的训练和参数优化工作。Considering that the multi-model fusion makes the calculation amount and complexity of the whole model very large, coupled with the massive data information with a lot of redundancy and noise, the time complexity of calculation needs to be considered. For big data projects, the main methods are to improve hardware performance, optimize algorithm performance and reduce data dimensionality. For this reason, the present invention proposes an encapsulation method to realize feature selection reduction of data, and at the same time, the method is also used to adaptively realize model training and parameter optimization in the system.

本发明中使用了单变量分布估计算法进行封装训练。单变量分布估计算法是遗传算法的一种。针对数据，算法中用一个n(n＝N+Dim)维染色体进行编码。染色体分为两个部分。第一部分是被封装算法参数编码部分，共N位二进制码。以SVM分类器为例，则共N＝N_c+N_g位二进制码，其中N_c是SVM参数C的编码位数，N_g是SVM参数γ的编码位数。第二部分是特征选择部分，这部分的维数和待分类数据的维数(Dim维)相等，1代表数据集对应位的特征被保留，0代表数据集对应位的特征被舍弃。In the present invention, a univariate distribution estimation algorithm is used for encapsulation training. Univariate distribution estimation algorithm is a kind of genetic algorithm. For the data, an n(n=N+Dim) dimensional chromosome is used in the algorithm to encode. Chromosomes are divided into two parts. The first part is the encoding part of the encapsulated algorithm parameters, with a total of N-bit binary codes. Taking the SVM classifier as an example, there are a total of N=N _c +N _g binary codes, where N _c is the number of coding bits of the SVM parameter C, and N _g is the number of coding bits of the SVM parameter γ. The second part is the feature selection part. The dimension of this part is equal to the dimension of the data to be classified (Dim dimension). 1 means that the feature of the corresponding bit of the data set is retained, and 0 means that the feature of the corresponding bit of the data set is discarded.

同时，本发明重新设计了单变量分布估计算法，在算法中启用了精英保留过程，并对每只股票分别存储历史最优值进行初始化以提高寻优效率，保证技术分析过程的实时性：对种群q(t)，t为时间变量，设其是一个m*n阶矩阵，记录了按照适应度值Q9(注意Q9是一种不依赖数据分布的评价标准)从高到低排列的染色体，前r行[q₁ q₂ … q_r]^T被作为精英保留下来，则估计的分布概率矩阵为P＝[P₁ P₂ … P_n]，其中：R为参与变异的染色体数量，重新设计的分布估计算法如图4所示，其步骤如下：At the same time, the present invention redesigns the univariate distribution estimation algorithm, enables the elite retention process in the algorithm, and initializes the historical optimal values stored for each stock to improve the optimization efficiency and ensure the real-time performance of the technical analysis process: Population q(t), where t is a time variable, is assumed to be an m*n order matrix, which records the chromosomes arranged from high to low according to the fitness value Q9 (note that Q9 is an evaluation standard that does not depend on data distribution), The first r rows [q ₁ q ₂ ... q _r ] ^T are reserved as elites, then the estimated distribution probability matrix is P=[P ₁ P ₂ ... P _n ], where: R is the number of chromosomes involved in the mutation. The redesigned distribution estimation algorithm is shown in Figure 4. The steps are as follows:

第二步：染色体种群q(t)根据适应度值递减的顺序对染色体排序，并将前r个染色体作为精英保留；The second step: the chromosome population q(t) sorts the chromosomes according to the order of decreasing fitness value, and keeps the first r chromosomes as elites;

第五步：判断t>N,若成立，则结束算法，若不成立，则转向第二步。The fifth step: judge t>N, if it is true, then end the algorithm, if not, then go to the second step.

其中参数寻优部分为应用封装，需要对参数进行编解码，以下以多模型融合中各子模型的权值为例说明二进制编解码方法：Among them, the parameter optimization part is the application package, which needs to encode and decode the parameters. The following uses the weight of each sub-model in the multi-model fusion as an example to illustrate the binary encoding and decoding method:

1)权值编码1) Weight coding

对子模型权值进行优化时，权值需要经过编码才能参与分布估计运算。编码过程可以简述为：初始化权值就是一组n×m维长的0和1编码序列，其中m为需要优化的权值个数，n是每一个权值的编码位数，且每个权值的编码长度都是相同的。When optimizing the sub-model weights, the weights need to be coded to participate in the distribution estimation operation. The encoding process can be briefly described as: the initialization weight is a set of n×m dimensional length 0 and 1 encoding sequences, where m is the number of weights to be optimized, n is the number of encoding bits for each weight, and each The encoding lengths of the weights are all the same.

2)权值解码2) Weight decoding

封装方法返回的优化结果，需要经过与编码过程相对应的解码过程才能得到实际所需的权值结果。解码过程可以简述为：权值w_i的取值范围是w_i∈[0,1]，假设第i个权值的编码是[q₁ q₂ … q_n]，其中q_i∈{0,1},i∈{1,2,…,n}The optimization result returned by the encapsulation method needs to undergo a decoding process corresponding to the encoding process to obtain the actual required weight result. The decoding process can be briefly described as: the value range of the weight w _i is w _i ∈ [0,1], assuming that the encoding of the i-th weight is [q ₁ q ₂ … q _n ], where q _i ∈ {0 ,1},i∈{1,2,…,n}

${w w}_{i i} = = \frac{{w w}_{i i}^{' '}}{22^{n no}}$

最终，封装方法的基本流程图如图5所示，将由技术面信息形成的数据集随机分为训练集合测试集两部分，对种群进行解码，利用特征的基因显型部分选择特征子集进行训练，得到参数被设定的子模型，同时利用特征的基因显型部分对测试集数据进行特征选择，将特征选择后的测试集输入到训练好的子模型进行测试，根据测试集的分类性能进行适应度评估，若满足单变量分布估计的终止条件，则输出最优子模型和特征子集，否则按单变量分布估计算法更新概率分布向量，再次进行编码生成新的种群。Finally, the basic flowchart of the encapsulation method is shown in Figure 5. The data set formed by the technical information is randomly divided into two parts, the training set and the test set, the population is decoded, and the feature subset is selected for training using the genotype part of the feature. , get the sub-model whose parameters are set, and use the genotype part of the feature to perform feature selection on the test set data, input the feature-selected test set to the trained sub-model for testing, and perform classification according to the classification performance of the test set For fitness evaluation, if the termination condition of univariate distribution estimation is satisfied, the optimal sub-model and feature subset are output, otherwise, the probability distribution vector is updated according to the univariate distribution estimation algorithm, and a new population is generated by encoding again.

对于训练集数据使用不同的待训练模型进行单变量分布估计算法的训练，并使用测试集数据进行5倍交叉验证，选择最优输出结果作为最优子模型和特征子集，从而可以自适应地实现多模型融合框架中的子模型的训练和参数优化。For the training set data, different models to be trained are used to train the univariate distribution estimation algorithm, and the test set data is used for 5-fold cross-validation, and the optimal output result is selected as the optimal sub-model and feature subset, so that it can be adaptively Realize the training and parameter optimization of sub-models in the multi-model fusion framework.

5、利用综合的市场趋势预测值设计综合投资策略5. Design a comprehensive investment strategy using comprehensive market trend forecasts

为制定针对技术分析的投资策略，本发明中定义了修正因子：In order to formulate an investment strategy for technical analysis, a correction factor is defined in the present invention:

其中P和Q分别对应训练中测试集的准确率和评价值。修正因子r(t)反应了投资者对该次预测的置信程度，即r(t)的值越高则投资者对该次预测结果越有信心，反正则将倾向于怀疑预测结果而保持原策略。设t时刻投资者持有的股票数量为N(t)，持有的现金数量为c(t)，以股票收盘价cp(t)作为其价格代表VAL(t)，操作符号为Sig(t)，则对一只股票，可将基于投资策略数学化表示为：Among them, P and Q correspond to the accuracy and evaluation value of the test set in training, respectively. The correction factor r(t) reflects the investor's confidence in the forecast, that is, the higher the value of r(t), the more confident the investor is in the forecast result; otherwise, they will tend to doubt the forecast result and keep the original Strategy. Assuming that the number of stocks held by investors at time t is N(t), the amount of cash held by investors is c(t), and the closing price cp(t) of the stock is used as its price to represent VAL(t), and the operation symbol is Sig(t ), then for a stock, it can be expressed mathematically based on the investment strategy as:

以上以收盘价作为操作标准，因而最终总资产为a(t)＝c(t)+N(t)×cp(t)。The closing price is used as the operating standard above, so the final total assets are a(t)=c(t)+N(t)×cp(t).

对投资组合，再进一步考虑给予网络分析的股票预选IFA，定义投资分析的期望为E(ITA(t,s))＝Y_s(t)×r_s(t)，其中Y_s为综合的市场趋势预测值、r_s为对结果置信程度，则其对股票s的交易金额可按进行分配，对应的一般性投资策略可表述为：For the investment portfolio, further consider the stock pre-selection IFA based on network analysis, and define the expectation of investment analysis as E(ITA(t,s))=Y _s (t)×r _s (t), where Y _s is the comprehensive market Trend prediction value, r _s is the degree of confidence in the result, then its transaction amount for stock s can be calculated according to The corresponding general investment strategy can be expressed as:

$&ForAll; &ForAll; s the s &Element; &Element; IFA IFA ((t t)),,$

where r＝ITA(t,s)where r=ITA(t,s)

本发明利用复杂社会网络模型对金融市场进行建模，考虑了上市公司之间的相关性，有效降低了投资组合的风险。计算表明所选投资组合的夏普指数明显提高，如下表1；The invention models the financial market by using a complex social network model, considers the correlation between listed companies, and effectively reduces the risk of investment portfolios. The calculation shows that the Sharpe index of the selected investment portfolio has increased significantly, as shown in Table 1 below;

表1投资组合夏普指数比较Table 1 Portfolio Sharpe Index Comparison

由于采用了预选过程及特征选择，有效地对数据进行了降维，本发明克服了大数据量下的实时性难以保证的问题，更适合对海量数据进行处理；Due to the adoption of the pre-selection process and feature selection, the dimensionality reduction of the data is effectively carried out. The present invention overcomes the problem that the real-time performance is difficult to guarantee under a large amount of data, and is more suitable for processing massive data;

本发明综合使用多模型融合方法，可以灵活组合各种分析预测方式，并通过封装方法训练以实现自适应的优化选择，可以兼容并利用现有的分析预测方法，并能自动提供相应的投资策略以供实际投资使用或参考。The present invention comprehensively uses the multi-model fusion method, can flexibly combine various analysis and prediction methods, and implements self-adaptive optimization selection through the training of the encapsulation method, can be compatible with and utilize the existing analysis and prediction methods, and can automatically provide corresponding investment strategies For actual investment use or reference.

对本发明提出的系统，以上证A股交易市场的实际数据进行模拟，使用各股的交易日、开盘价、最高价、最低价、收盘价、成交量和成交额数据，以及上市公司的经营范围和季度报表，两个月内的所有投资操作策略均在30min内生成，足以满足实时性的要求，仿真投资收益结果如图6所示。For the system proposed by the present invention, the actual data of the Shanghai Stock Exchange A-share trading market is simulated, using the trading day, opening price, highest price, lowest price, closing price, trading volume and turnover data of each stock, as well as the business scope of listed companies And quarterly reports, all investment operation strategies within two months are generated within 30 minutes, which is sufficient to meet the real-time requirements. The simulation investment income results are shown in Figure 6.

图6是2012年2月的投资结果，其中蓝线(IND％)表示上证A股指数(即大盘整体的趋势情况)的变化情况，红线(NP％)表示未使用网络分析做投资组合预选的收益情况，绿线(P％)表示使用网络分析做投资组合预选后的收益情况。可见，根据本系统提供的投资策略进行操作可能获得超额利润，并且网络分析进行的预选可能对获得超额收益有所贡献。Figure 6 shows the investment results in February 2012, in which the blue line (IND%) represents the changes in the Shanghai A-share index (that is, the overall trend of the market), and the red line (NP%) represents the investment portfolio that was not pre-selected using network analysis The income situation, the green line (P%) indicates the income situation after using the network analysis to do the pre-selection of the investment portfolio. It can be seen that operating according to the investment strategy provided by this system may obtain excess profits, and the pre-selection by network analysis may contribute to obtaining excess profits.

图7是2012年3月的投资结果，体现了在下降趋势中网络分析降低风险的作用。如图可见，投资组合预选后的最终收益比未预选有所提高(见左图)，更重要的是，在增长率为负的时期(即下跌的时期，见右图)，预选后的增长率(绿线)明显优于不进行网络分析预选(红线)的结果，可见通过基于多元性的预选，下跌时的亏损趋势可以在一定程度上得以遏制，即降低了投资风险。Figure 7 shows the investment results in March 2012, reflecting the role of network analysis in reducing risk in a downtrend. As can be seen from the figure, the final return of the portfolio after preselection is higher than that without preselection (see left panel), and more importantly, in the period of negative growth rate (that is, the period of decline, see right panel), the growth rate after preselection The rate of loss (green line) is significantly better than the result of preselection without network analysis (red line). It can be seen that through preselection based on diversity, the loss trend during the decline can be contained to a certain extent, that is, the investment risk is reduced.

Claims

1. the Stock Market method that Excavation Cluster Based on Network Analysis and multi-model merge, other are being, comprise following operation:

First from network, capture basic side information, build on this basis network node and be connected with network, build complicated community network model; Utilize the method for network analysis to select investment portfolio, the more related data of investment portfolio are input in multi-model fusion framework;

Described multi-model merges framework and comprises a plurality of submodels, each submodel is for the technological side information of the different characteristic capturing from network, carry out respectively the market trend prediction of different characteristics, generate predicted value separately, again by predicted value weighted sum, obtain comprehensive market trend predicted value, according to this value, generate corresponding investment strategy;

To submodel, provide the weight of the predicted value of the feature selecting device of information, parameter that submodel relates to, submodel all by single argument Estimation of Distribution Algorithm, to encapsulate training.

2. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 1 and multi-model merge, is characterized in that, the structure of described complicated community network model comprises following operation:

1.1) network node

In vector space model, the basic side information text capturing from network represents with the word bag of binary feature vector pattern, as follows:

inf _i＝(<t ₁,w _i1>,<t ₂,w _i2>,...<t _M,w _iM>)

Wherein M is the quantity of feature, w _iktext feature t _kweights, calculate weights by tf*idf method, to fixation features abbreviation, are inf _i=(w _i1, w _i2... w _iM);

To utilizing the text in the basic side information that the method for data mining obtains from network to proceed as follows:

1) filter: filter part useless in information;

2) participle: the information through filtering is divided into a plurality of vocabulary, the result after participle is deposited in lexicon, the part of speech of sign vocabulary;

3) vocabulary in lexicon is further stopped to word and process, comprise and remove function word and bind negative word;

Obtain after the text feature in the middle of basic side information, carry out the calculating of its weights, basic side finish message is become to vector space model, for time the information that becomes, this vector space model becomes a time-varying vector:

Inf _i(t)=(w _i1(t), w _i2(t) ... w _iM(t)), wherein t is time variable;

1.2) network connects

To network G (t)=(V (t), E (t)), the basic side information of use listed company as network node modeling, has V (t)={ inf to it _i(t) }, E (t)={ (i, j, edg _ij(t)) | i, j ∈ V (t) };

V (t) is for utilizing the set of the constructed network node of basic side information, and E (t) is two node i, j and the strength of joint edg between them in the middle of network node _ij(t) set;

Use cosine similarity

\cos (\inf_{i} (t), \inf_{j} (t)) = \frac{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} (t) w_{jn} (t)}{\sqrt{\underset{t_{n} &Element; T_{M}}{Σ} w_{in} {(t)}^{2} \underset{t_{n} &Element; T_{M}}{Σ} w_{jn} {(t)}^{2}}}

Computational grid strength of joint, wherein T _mfor the complete or collected works of basic side information text feature, and use threshold value θ to filter, have

{edg}_{ij} (t) = \{\begin{matrix} 0, & \cos (\inf_{i} (t), \inf_{j} (t)) < θ \\ \cos (\inf_{i} (t), \inf_{j} (t)), & \cos (\inf_{i} (t), \inf_{j} (t)) &GreaterEqual; θ \end{matrix},

θ gets cos45 °;

With as inf _i(t)=(w _i1(t), w _i2(t) ... w _iM(t) vector space model) is described network node, is built into needed network node;

Again by network node according to as V (t)={ inf _i(t) }, E (t)={ (i, j, edg _ij(t)) | i, j ∈ V (t) shown in internetwork connection mode connect, thereby complicated community network model, this model is a dynamic network model.

3. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 1 and multi-model merge, it is characterized in that, describedly utilize network analysis method to select investment portfolio to be to select the stock being independent of each other most to form investment portfolio, to comprise the following diversity division methods that detects cluster based on corporations:

Use corporations' detection method to divide, use Girvan-Newman clustering method to carry out network clustering, its evaluation index is modularity

Q (t) = \underset{i}{Σ} (e_{ii} (t) - {a_{i}}^{2} (t)),

T is time variable;

Wherein, e _ij(t) represent to connect the shared ratio of weights that the i of corporations is connected with the network of network node in the j of corporations; a _i(t) the shared ratio of weights that the all-network that expression is associated with network node in the i of corporations is connected, comprises that two networks connect Quan corporations inside and only have a network to be connected to the inner two kinds of situations of corporations;

The complicated community network model that modularity has been described after dividing is big or small with the difference degree of stochastic network model, and the group dividing method of modularity maximum will be as optimum division;

The basic process of Girvan-Newman clustering method is as follows:

The first step: calculate intermediary's degree that all-network connects;

Second step: remove the network connection with maximum intermediary degree;

The 3rd step: intermediary's degree that the network that recalculating affects in second step connects;

The 4th step: if do not have rest network to connect, finish algorithm, otherwise turn to second step;

Finally, in the com of each network corporations (t) from optimal dividing, select a representative, form the investment portfolio of required diversification;

Finally, Fundamental Analysis selects the Policy model of investment portfolio to be expressed as:

IFA (t) = {n | &ForAll; com (t), n = \arg {\max_{i &Element; com (t)}}^{N} ({rep}_{i} (t))} .

4. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 1 and multi-model merge, it is characterized in that, describedly utilize network analysis method to select investment portfolio to be to select the stock being independent of each other most to form investment portfolio, to comprise the following diversity dividing method based on the irrelevant subnet of maximum full-mesh:

Adopt Bron – Kerbosch algorithm to extract the most complete works of connexon net in net mending, its base form is the searching algorithm of a recursive backtracking, and flow process is as follows:

Bron-Kerbosch algorithm:

Step1: given three set (R, P, X), initialization set R, it is empty that X is respectively, and set P is the set of all-network node;

Step2: if set P, it is empty that X is respectively, and exporting R is Clique;

Step3: for each from set obtain P network node v}, has following processing:

1) by network node v} is added in set R, set P, { it is crossing that v} obtains adjacent_lattice node set N{v}, recursive set R afterwards, P, X (turning Step2) for X and network node;

2) from set, delete network node { v}, and { v} adds in set X by network node P;

Now Fundamental Analysis selects the Policy model of investment portfolio to be expressed as:

the most complete works of connexon net in the net mending generating according to Bron-Kerbosch algorithm.

5. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 1 and multi-model merge, it is characterized in that, the framework that described multi-model merges is that a complicated system is turned to several subsystems, a corresponding submodel of subsystem, the same model of common description that then submodel corresponding to these subsystems combined is to improve model-fitting degree;

Described submodel method of attachment adopts weighted sum mode, and the output of each submodel is sued for peace by certain weights, obtains final output;

Described submodel can be added and subtracted or replace, and through the training of master data, block mold can be by regulating self submodel weights size to carry out adaptive selection to forecast model;

Described submodel comprises following several:

1) trend forecasting method based on vector symbol sequence

First adopt least square fitting method to carry out vector quantization to historical share price data, if definition x _ithe price of day is y _i, minimize the error of n day

S (a, b) = Σ_{i = 0}^{n} {(y_{i} - f (x_{i}))}^{2}, f (x_{i}) = {ax}_{i} + b,

Can obtain the trend characterizing with slope, be defined as:

Further the continuous trend vector of discretize, therefrom to extract macroscopical tendency information, adopts unsupervised clustering method, for the singularity of the data of vector quantization, carries out the heavy clustering algorithm based on k-means, obtains cluster result and all kinds of center vector;

2) the trend prediction algorithm extracting based on Stock Index Time Series turning point

On the basis of time series fluctuation characteristic point formal definition, be provided with time series X={x ₁, x ₂..., x _n, the fluctuation characteristic point x of X within its time interval is { xVD (x)=max (VD (x _i)), i=1,2 ..., N}, wherein

VD (x_{i}) = | x_{i} - (x_{1} + (x_{N} - x_{1}) - \frac{i - 1}{N - 1}) |;

3) the investment recommendation algorithm of judging based on vocabulary emotion tendency

To any vocabulary w _i∈ W, note:

T (w_{i}) \hat{=} \{\begin{matrix} 1, if w_{i} is positive, \\ - 1, if w_{i} is negative . \end{matrix}

Claim T (w _i) be w _iemotion tendency value, given emotion tendency relational network SORN=(W, C, Q), in SORN one from w _ito w _jpath on the vocabulary numbered sequence of process be designated as (p ₁, p ₂..., p _s), 2≤s≤n wherein, p ₁=i, p _s=j;

If set { (p ₁, p ₂..., p _s) in an element (h ₁, h ₂..., h _s) meet:

q_{h_{1} h_{2}} + q_{h_{2} h_{3}} + . . . + q_{h_{s - 1} h_{s}} = \min {q_{p_{q} p_{2}} + q_{p_{2} p_{3}} + . . . + q_{p_{s} - 1 p_{s}}},

Claim (h ₁, h ₂..., h _s) be from w in SORN _ito w _ja minimum noise path on the vocabulary numbered sequence of process, be designated as D (i, j); Claim (D (i, 1) ..., D (i, i-1), D (i, i+1) ..., D (i, n)) and be the sequence of D (i, j), be designated as D (i); The dijkstra's algorithm that adopts adjacency list and Fibonacci heap to realize calculates D (i);

Select the minimum path of noise between two vocabulary to identify this emotion tendency relation to vocabulary, when language material enough abundant, right utilize Q just can obtain D (i, j), recycling C just can calculate w _iand w _jemotion tendency relation; In language material, positive vocabulary is always many than negative vocabulary.

6. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 5 and multi-model merge, it is characterized in that, in described clustering algorithm, each trend vector is replaced with the corresponding vector of cluster centre at its place, and this characterization vector the type of trend, if be characterized by symbol, finally can be by the discrete series of sign sequence that changes into of continually varying trend vector; Use the day of trade, opening price, highest price, lowest price, closing price, trading volume and the transaction value of stock as the general data of technical Analysis, the historical information of stock S is expressed as symbol sebolic addressing: { S _i,

S_{i} = {s_{i}^{d}, s_{i}^{o}, s_{i}^{h}, s_{i}^{l}, s_{i}^{c}, s_{i}^{v}, s_{i}^{t}};

Wherein the respectively corresponding day of trade, opening price, highest price, lowest price, closing price, trading volume and transaction value data at i this strand of day;

And adopted the Support Vector Machine model of use radial basis kernel function to learn historical stock price trend symbolic model, SVM is theoretical to be pursued through study and obtains having the model of strong generalization ability, after obtaining support vector SV, need to obtain optimal classification lineoid OHP, this solution procedure is a quadratic programming problem again;

If SV is from OHP:(lx) the nearest sample point of+b=0, and of a sort SV is completely equal from OHP distance, and inhomogeneous SV not necessarily equates from OHP distance;

To m training sample (x ₁, y ₁), (x ₂, y ₂) ..., (x _m, y _m) ask its classification lineoid, classification lineoid must meet the condition of optimal classification lineoid:

In order to find optimal classification lineoid, by Lagrange function, former problem is changed into and solves standard form quadratic programming problem:

\max W (α) = Σ_{i = 1}^{m} α_{i} - \frac{1}{2} Σ_{i, j = 1}^{m} α_{i} α_{j} y_{i} y_{i} K (x_{i}, x_{j})

s . t . Σ_{i = 1}^{m} α_{i} y_{i} = 0, α_{i} &GreaterEqual; 0, (i = 1,2, . . ., m)

Ask the key of optimum lineoid to be to obtain and meet α _ithe α of >0 _iand

And optimal classification lineoid is:

f (x) = sign {\underset{α_{i} > 0}{Σ} α_{i} y_{i} K (x_{i}, x) - b_{0}}

Wherein, α _ithe sample point that >0 is corresponding is support vector, for quadratic programming problem, can take positive method, Dual Method or interior some Algorithm for Solving; Support Vector Machine after training is automatically classified to new data, the symbol of output category result, and the slope of the corresponding former cluster centre vector of this symbol, is the predicted value of final trend.

7. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 5 and multi-model merge, is characterized in that, the step of described time series fluctuation characteristic point extraction algorithm is as follows:

The first step: input starting point coordinate start and the terminal point coordinate end of sequence to be extracted, judge whether distance between start and end meets subsequence and be less than smallest interval length, turn the 3rd step if meet; If do not meet and find the maximum point of VD value between Origin And Destination by the definition of fluctuation characteristic point, if VD is greater than the end condition of algorithm amplitude, this point is joined in fluctuation characteristic point result sequence as fluctuation characteristic point, and this some note is done to fp;

Second step: with fp, former sequence is divided into two sections, start, to fp subsegment, fp to end subsegment, carries out the first step to these two subsegments;

The 3rd step: fluctuation characteristic is put to result sequence and according to time sequence preserve afterwards and output;

The 4th step: judge minimum interval threshold value based on turning point, extract turning point subset by maximum/minimum value principle in fluctuation characteristic point result sequence;

The 5th step: based on turning point subset, between two continuous turning points, based on fluctuation characteristic point sequence, by the maximum principle of reverse wave amplitude, finding turning point arbitrarily, and join in turning point subset, repeat aforesaid operations, until cannot find again new turning point by the turning point extracting parameter of setting;

The 6th step: turning point subset is according to time sequence preserved and output afterwards;

The turning point seed more than extracting has characterized the change of stock price trend under larger time scale, by machine learning methods such as neural network or support vector machine, train, obtain optimum classifier, this sorter can automatically be classified to new data, whether the corresponding date of output turning point, the i.e. result to future trend prediction, is trend prediction value corresponding to nearest turning point the predicted value of final trend.

8. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 5 and multi-model merge, it is characterized in that, the described investment recommendation algorithm of judging based on vocabulary emotion tendency adopts SWSOA algorithm, this algorithm, from any one node of the largest connected subgraph of SORN, just carries out emotion tendency classification to all nodes in this subgraph; SWSOA algorithm be input as a concrete SORN, this algorithm table is shown as to function SWSOA (SORN), wherein SORN represents SORN variable;

The concrete steps of this algorithm are as follows:

Step 1. utilizes breadth First ergodic algorithm to obtain the largest connected subgraph Gs of SORN, and the set that the lexical node comprising in Gs forms is designated as W _gS;

Step 2. is specified W _gSin any one node w _i, and make W _gS=W _gS-{ w _i, U={w _iand

Step 3. is calculated D (i) and is obtained w _ito W _gSthe minimum noise path of middle arbitrary node;

Step 4. couple W _gSin each lexical node w _j, carry out a)～c):

A), according to the D (i, j) in D (i), calculate from w _ito w _jminimum noise path on the turnover of process

The quantity e that is related to limit,

e = Σ_{l = 1}^{l = s - 1} δ_{h_{l} h_{l + 1}} c_{h_{l} h_{l + 1}};

B) calculate w _jwith w _itendentiousness relation: T (w _j)=(-1) ^e* T (w _i);

C) if T is (w _j)=T (w _i), U=U ∪ { w so _j, otherwise V=V ∪ { w _j;

Step 5.WSO has classified, and classification results leaves in respectively in U and V, if | U|>|V|, in U, depositing so in positive vocabulary and V and depositing negative vocabulary, vice versa;

Extract comment news report and the comment relevant to financial market, utilize SWSOA algorithm to differentiate its emotion tendency, the listed company and the corresponding stock thereof that are judged to be positive are candidate's thighs that recommendation should be paid close attention to;

For the predicted value that is judged to be the final trend of the listed company of positive and the stock of correspondence thereof, be 1;

For the predicted value that is judged to be the final trend of the listed company of negative and the stock of correspondence thereof, be-1.

9. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 5 and multi-model merge, it is characterized in that, consider that multi-model merges calculated amount and the complexity that makes whole model, also comprise based on single argument and distribute and estimate training, optimization and the feature selection approach of method for packing:

Described single argument Estimation of Distribution Algorithm, for data, encodes with a n (n=N+Dim) dimension chromosome in algorithm; Chromosome is divided into two parts, and first is packed algorithm parameter coded portion, altogether N position binary code; And enable elite's retention process, and every stock is stored respectively to historical optimal value and carry out initialization to improve Searching efficiency, guarantee the real-time of technical Analysis process: to population q (t), t is time variable, if it is a m*n rank matrix, recorded the chromosome of arranging from high to low according to fitness value Q9, the capable [q of front r ₁q ₂q _r] ^tbe used as elite and remain, the distribution probability matrix of estimating is P=[P ₁p ₂p _n], wherein: r is for participating in the chromosome quantity of variation, and its step is as follows:

The first step: initialization population q (0), reads historical optimal value and adds initial population, iterations t=0;

Second step: the order that chromosome population q (t) successively decreases according to fitness value sorts to chromosome, and a front r chromosome is retained as elite;

The 3rd step: before calculating, each row of the individual chromosome of R (R is generally greater than r) are got 1 probability P _i, form probability matrix;

The 4th step: with probability P _ione of random generation meets the numeral of 0-1 distribution as chromosome i position, then generates (m-r) individual chromosome with same algorithm, makes t=t+1, by the synthetic new population q (t) of these chromosomes and r the genome remaining;

The 5th step: judgement t>N, if set up, finish algorithm, if be false, turn to second step;

Wherein parameter optimization, partly for application encapsulation, need to be carried out encoding and decoding to parameter, and wherein in multi-model fusion, the scale-of-two decoding method of the weights of each submodel is as follows:

1) weight-codings

Initialization weights are exactly 0 and 1 long coded sequence of one group of n * m dimension, and wherein m is for needing the weights number of optimization, and n is the coding figure place of each weights, and the code length of each weights is identical;

2) weights decoding

Weight w _ispan be w _i∈ [0,1], the coding of supposing i weights is [q ₁q ₂q _n], q wherein _i∈ 0,1}, i ∈ 1,2 ..., n}

W' _i=q ₁* 2 ⁰+ q ₂* 2 ¹+ ... + q _n* 2 ^n-1

Then by w' _ibe mapped to interval [0,1],

w_{i} = \frac{w_{i}^{'}}{2^{n}}

Described method for packing, the data set being formed by technological side information is divided into training set test set two parts at random, population is decoded, utilize the gene phenotype of feature partly to select character subset to train, obtain the submodel that parameter is set, utilize the gene phenotype part of feature to carry out feature selecting to test set data simultaneously, test set after feature selecting is input to the submodel training to be tested, according to the classification performance of test set, carry out fitness assessment, if meet the end condition that single argument distributes and estimates, export optimum submodel and character subset, otherwise upgrade probability distribution vector by single argument Estimation of Distribution Algorithm, again encode and generate new population.

10. the Stock Market method that Excavation Cluster Based on Network Analysis claimed in claim 1 and multi-model merge, is characterized in that, the comprehensive market trend predicted value design synthesis investment strategy of described utilization is based on following modifying factor:

r (t) = {\begin{matrix} 1, if P_{s} (t) \times (1 + Q_{s} (t)) > 1 \\ P_{s} (t) \times (1 + Q_{s} (t)), otherwise \end{matrix},

Wherein P and Q distinguish accuracy rate and the evaluation of estimate of test set in corresponding training, modifying factor r (t) has reacted the confidence level of investor to this prediction, it is more confident to be that the higher investor of value of r (t) predicts the outcome to this time, and anti regular will tend to suspect and predicts the outcome and keep former strategy; If the t stock quantity that investor holds is constantly N (t), the cash quantity of holding is c (t), and the stock price data cp (t) of usining represents VAL (t) as its price, and functional symbol is Sig (t), to a stock, can be by based on investment strategy, mathematicization is expressed as:

Sig (t) = \{\begin{matrix} Buy, if Y_{s} (t) > 0 \\ Sell, if Y_{s} (t) < 0 \\ Sig (t - 1), if Y_{s} (t) = 0 \end{matrix};

Using closing price above as operation standard, thereby final total assets is a (t)=c (t)+N (t) * cp (t);

To investment portfolio, then further consider the stock preliminary election IFA that gives network analysis, the expectation of definition investment analysis is E (ITA (t, s))=Y _s(t) * r _s(t), Y wherein _sfor comprehensive market trend predicted value, r _sfor to result confidence level, it can be pressed the dealing money of stock s distribute, corresponding general investment strategy is expressed as:

&ForAll; s &Element; IFA (t),

Sig (t, s) = {\begin{matrix} Buy (s, ITA (t, s)), if ITA (t, s) > θ_{+} \\ Sell (s, | ITA (t, s) |), if ITA (t, s) < θ_ \\ Sig (t - 1, s), otherwise \end{matrix},

①:Sig(t,s)＝Buy(s,r)②:Sig(t,s)＝Sell(s,r)

where?r＝ITA(t,s)