Open Access Published by De Gruyter July 12, 2018

Ensembles of Text and Time-Series Models for Automatic Generation of Financial Trading Signals from Social Media Content

Omar A. Bari and Arvin Agah

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2017-0567

Abstract

Event studies in finance have focused on traditional news headlines to assess the impact an event has on a traded company. The increased proliferation of news and information produced by social media content has disrupted this trend. Although researchers have begun to identify trading opportunities from social media platforms, such as Twitter, almost all techniques use a general sentiment from large collections of tweets. Though useful, general sentiment does not provide an opportunity to indicate specific events worthy of affecting stock prices. This work presents an event clustering algorithm, utilizing natural language processing techniques to generate newsworthy events from Twitter, which have the potential to influence stock prices in the same manner as traditional news headlines. The event clustering method addresses the effects of pre-news and lagged news, two peculiarities that appear when connecting trading and news, regardless of the medium. Pre-news signifies a finding where stock prices move in advance of a news release. Lagged news refers to follow-up or late-arriving news, adding redundancy in making trading decisions. For events generated by the proposed clustering algorithm, we incorporate event studies and machine learning to produce an actionable system that can guide trading decisions. The recommended prediction algorithms provide investing strategies with profitable risk-adjusted returns. The suggested language models present annualized Sharpe ratios (risk-adjusted returns) in the 5–11 range, while time-series models produce in the 2–3 range (without transaction costs). The distribution of returns confirms the encouraging Sharpe ratios by identifying most outliers as positive gains. Additionally, machine learning metrics of precision, recall, and accuracy are discussed alongside financial metrics in hopes of bridging the gap between academia and industry in the field of computational finance.

Keywords: Computational finance; machine learning; natural language processing; sentiment analysis; time-series classification; stock trading

1 Introduction

Event studies research focuses on the statistical impact that an event has on a traded company [11]. In finance, a financial press release announcing company earnings is an example of an event. Unlike earnings announcements, media events may arise unexpectedly. Through using the framework of an event study, this paper explores unexpected events in modern media – particularly Twitter. Measuring statistical impact is not the central goal. Instead, listed here are the selected implementation objectives. By utilizing natural language processing, the goal is to identify events on Twitter that influence stock prices of firms. Text and time-series models are generated by applying machine learning techniques in order to classify events. Quantitative trading decisions are developed by associating prediction outputs as trading signals. The implementation objectives combine event studies and machine learning to produce an actionable system that can guide trading decisions.

1.1 Motivation

With advancements in computation power, market investors adapt by relying heavily on empirical findings as opposed to classical economic theories. The fields of behavioral and mathematical finance emphasize a shift towards interpreting empirical findings and market realities.

Technological innovations have changed the way that news disseminates. Therefore, an emphasis is placed in this work on modern news mediums. An example of this transformation is Twitter, a popular social media platform, providing a means for posting unverified bits of information. To financial markets, any medium that breaks news first is relevant.

The focus of this paper is to use predictive analytics to gain insights on reactions to financial events. As a contribution, we address procedures to reduce the challenges of modern media.

1.2 Overview

Event studies in finance investigate the impact that events have on the value of traded firms. The impact is quantified by the change in stock prices immediately following an event. A principal study from economics provides the guideline on how best to structure an event study [11], with event analysis research considering the efficient-market hypothesis [5].

Figure 1:

Event Horizon.

1.3 Event Analysis

1.3.1 Event Study

In order to measure the effect that breaking news has on prices, it is best to align the time stamp of each news story with its market data. The x-axis of the event horizon in Figure 1 marks the time components in an event study plotted against the cumulative stock returns. This is a hypothetical chart, and the cumulative return can vary significantly, i.e. not to scale. The horizon consists of a fixed event time surrounded by the pre/post-event regions. It becomes easier to collect and average the price impact of all positive/negative news. Selecting the amount of time needed for pre/post-event analysis is a matter of choice. It is dependent on whether the trading frequency is in seconds, minutes, days, months or years. Figure 1 also highlights key insights that appear in many published papers [7; 9; 11; 22]. First, cumulative returns show that investors respond well to good stories and poorly to bad ones. The second finding, not as obvious, establishes some presence of a pre-news effect. Prior to a press announcement, stock prices prematurely move in the expected direction. Financial institutions and traders ingest a variety of sources similar to journalists. For this reason, traditional media has a systematic form of information leak. The pre-news effect is an essential concept discussed in detail later. Regardless of the pre-news effect, it is apparent that prices continue in their respective direction once the story is public. It is strengthening the view of behavioral finance where investor perception guides investment decisions.

Without the availability of a standardized dataset, researchers frequently select news from a set of traditional media outlets. Major sources studied are the Wall Street Journal and Dow Jones News [2; 22], Yahoo Finance [18], Financial Times [20], Forbes [16], and Thomson Reuters News Analytics [9]. More recently, social media are deemed an important area of research, the most influential being Twitter [1; 12; 17; 21].

1.3.2 “New news” and Twitter

There exists proof showing a dramatic evolution in the way news disseminates [9]. Around late 2007, there emerged a significant increase in news count, news depth (news stories per stock), and news breadth (coverage for lesser-known companies). The cause grew from enhancements that effectively provide journalists with electronic chunks of informative pre-news material, thereby making the ability to write stories easier. The main result to come from this was a higher correlation in returns for event-driven trading strategies post 2007 [9]. The intensification of the dissemination process of news was called “new news”, which is a concept that also encapsulates a technological innovation in how institutions handle releases. Given this purview, Twitter is an interesting and highly sought dimension. By using the Twitter API, investors are granted instant access to electronic messages that carry up to 140 characters of unstructured text. Twitter has developed into an instrumental source material for traders. It is for this reason that Twitter has systematized the use of a special character “$” to reference a stock. One instance would suggest carrying “$AAPL” upon referencing Apple Inc. Investors and industry are increasingly investigating Twitter-based strategies because tweets come under the heading of both “new news” and “pre-news”.

Pre-news consists of informative material that originates from a direct source. An example of this would be an US Securities and Exchange Commission filing located on its site. Later, traditional media would use information within that document to tell a coherent story. It should be noted that markets begin to move prior to the release of traditional news. In contrast, Twitter gives its source (companies, investors, traders, and the general population) the ability to reach a broad audience immediately. Thus, it has two advantages. First, source material, if predicted, can lead to an earlier entry point into a directional trade. Second, unlike traditional media, Twitter is not confined to the number of news stories or journalists it can have. Without such limitations, Twitter produces information on a larger breadth of companies. A breadth of coverage is important because studies have found a greater price impact from news on stocks with a small total dollar market value (small-cap) as compared to larger corporations [9].

The inability to differentiate news from no-news is the difficulty of using Twitter. By definition, traditional media presents newsworthy material. Contrastingly, Twitter has no guidelines for its correspondents. Up to this point, the solution in literature has been to examine collective sentiment from tweets [1; 12].

Leveraging the Twitter standardized “$” is a common way to filter tweets about a particular company. By collecting and analyzing tweets mentioning companies from the Standard & Poor’s 500 (S&P 500), results show that stock returns revealingly relate to their tweets’ emotional valence [21]. Furthermore, tweets from users with a greater number of followers had an impact on same-day returns, while users with fewer followers had an effect on 10-day returns [21].

1.4 Natural Language Processing

A review paper in text mining for market prediction highlights that nearly all studies use the bag-of-words model [14]. The model captures word frequencies in a document-term matrix scheme. Once in this schema, any linear or non-linear classifier may be used for prediction. The frequent choices for bag-of-words are linear classifiers, namely, support vector machines (SVM) and naive Bayes. Since the task considered is document classification (more specifically, sentiment classification), one must generate a vector of positive/negative labels for each sample.

There are many variants of SVMs and naive Bayes. Thus, we look for answers in the field of sentiment classification. For sentiment in larger documents, SVMs outperform naive Bayes, whereas the reverse holds true for smaller documents [23]. Finally, a combination of naive Bayes and SVM called Naive Bayes Support Vector Machine (NBSVM) achieves strong and robust performance across standard datasets (both large and small documents) [23].

Simplicity and accuracy are the reasons why computational finance researchers select the bag-of-words model. It is easy to see that words from the model are treated completely as symbolic. There is no notion of word relationships, ordering, or context. For example, in trading, considering the terms “buy” and “sell”, the words “stocks”, “shares”, “options”, and “bonds” have a higher probability of being surrounding words as opposed to “laugh” or “cry”. Word frequencies in bag-of-words will not find the relationships above. Newer trends in textual analysis, using distributed representations of words, may capture such similarities and associations.

Word embeddings handle relational similarities by preserving aspects of similarity. In looking at the current state of financial event studies, no language model has used word embeddings. This work investigates this further as recent trends in sentiment classification move towards language models that use the full textual representation. The similarity on distributed representation of word embeddings can help find close word matches numerically. Word embeddings can be trained on all tweets.

1.5 Time-Series Analysis

For prediction tasks, modeling prices directly is not always necessary. Stock quotes have simply been used as features. However, in time-series classification it may be better to use both current and past prices entirely. A method is to use regression schemes to convert real-valued numbers into a class label. A conversion grants access to many regression algorithms for time-series predictions. As an example, we consider a regression model that predicts stock price movements in cents. What is needed for conversion to classification is a buy criterion. If the prediction is greater than x cents, then buy; contrarily, do nothing. The downside is that the training optimizer will only be concerned with minimizing prediction error on real-valued outputs, not buying decisions. The upside is that it allows usage of conventional methods such as linear regression, ridge regression, support vector regression (SVR), and neural networks. The strategy of leveraging regression systems for a classification task is considered in this work.

1.6 Objectives

1.6.1 Identifying Events on Twitter

As mentioned, filtering irrelevant tweets from newsworthy tweets is a difficult task. This is the reason why most researchers have filtered noise by averaging sentiment through a large group of tweets. Aside from using the “$” symbol to signify stocks, individual news events are commonly ignored. With the sheer number of users and tweets, related messages arriving in a small time interval can aid in forming events. In order to better understand the intuition, considering an oil spill in Mexico, it is likely that clusters of tweets will pour in for a short duration mentioning “oil”, “spill”, and “Mexico”. A procedure for event identification is presented later. To move beyond general group sentiment on Twitter, one needs to determine how to use tweets similar to news headlines. Thus, for tweets, we present keyword-based similarity clusters to generate Twitter events resembling headlines.

1.6.2 New News to Fix Pre-News and Lagged News Effect

One recurring result in prior research has been the stock prices moving prior to the release of news. In comparison, Twitter often discovers new stories faster because individuals post eyewitness accounts. Of course, one caveat is that statements are not verified. Still, as markets move on rumors and truths, it is possible that using Twitter will reduce the amount prices move before articles. Secondly, roughly all studies ignore how to deal with lagged news. Observing again an oil spill in Mexico, at first, breaking the story may negatively affect stock prices. Many times, there will be follow-up articles explaining how prices fell after news of an oil spill. These lagged articles will not have the same impact on markets as the original story. However, a learning algorithm will most likely be trained to classify it as a new negative event. Thus, an emphasis is placed in this work to both identify news and to reduce lagged news.

1.6.3 Language and Time-Series Ensemble Models

For language models, many researchers have not gone beyond the bag-of-words model. Furthermore, when using stock prices, event study models reduce the dimensionality of the input space by extracting a limited number of price features. Recent trends in machine learning aim to reduce the time spent on feature engineering. Representation learning and deep learning groups are leading this trend. We use prices in their full representation. We implement modern methods on text and prices (in their original representation) and study the performance of a trading strategy – on Twitter-generated events – that predicts stock price jumps.

2 Methodology

2.1 Models

All systems are either language models or time-series models. Models predict using input from two representations, namely, text and financial time-series. Language models try to classify stock movement given the text within tweets. Language models are trained and tested on tweets. Language models enable the extraction of information from textual data, rather than numerical data. Language models are constructed for general natural language processing. For time-series, the task is not to compute long-term forecasts on a single stock. Thus, we do not consider traditional time-series forecasting designs such as autoregressive moving averages or stochastic models. Instead, the task compares machine learning techniques to mine data for patterns in events. Traditional models are applied: linear regression, kernel ridge regression (KRR), and SVR. Furthermore, with the recent trends of deep learning, we create systems using recurrent neural network (RNN) architectures.

2.2 Model Ensemble

Regression predicts real-valued outputs. Thus, a natural solution applies weighted averages between models. The approach scales with increasing predictors. Weights are computed with the goal to raise annualized Sharpe ratios. The validation set is used to approximate the ideal weights.

Models that individually perform better are given a higher weight in an ensemble prediction. Therefore, a weighted arithmetic mean of Sharpe ratios determines the influence of individuals in a combined model directly. As previously mentioned, the metrics of interest are accuracy, precision, recall, and financial statistics. These must all be calculated on trading decisions, not on regression outputs. Thus, even though regression models output a numerical value, they are then converted to a labeled class (BUY/NEUTRAL/SHORT). Classification models already predict those categories. Since classification tasks predict labels, it is simplest to take the most common predicted class when building an ensemble. This is the strategy used on language models. If multiple learning algorithms conflict in a tie, then it is safest to select NEUTRAL.

2.3 The Process

Assigning the financial security associated with a tweet is straightforward when the symbol is in the text. As such, we store tweets that single out companies using Twitter’s “$” character. The list of steps required to study price influence of events for companies mentioned in tweets include the following:

Track and store all tweets mentioning company-centric stock symbols.
Provide a method to define “new event” from a cluster of similar tweets.
Develop a procedure to calculate a numerical event matrix that aligns events with financial time-series.
Construct language models from tweets that best classify price spikes.
Construct a temporal model that classifies large price movements using share prices.
Measure the performance of language, temporal, and ensemble models on the validation and testing data.
Use classified outputs from prediction model as buy/sell trading signals.
Evaluate trading metrics (returns, Sharpe ratio, and profit per share).

A significant concern for predictive financial models is overfitting, i.e. the lack of ability to generalize. We alleviate the suspicion above to some extent by holding out large sections of our dataset for validation and testing. Supervised learning algorithms learn by seeing examples from the training set, then predict on unseen data. The dataset partition consists of training, validation, and testing set. Choosing two longer-term hold-out periods is necessary to avoid bias from short-term market conditions. The validation set, while being unseen by the learning model, allows parameter tuning. Intraday trading ensures a considerable volume of data, enabling to partition higher percentages. Unlike other applications, finance does not grant shuffling of data due to its sequential nature. Therefore, the sets are arranged in a preserved order of trading days. In addition, partitioning using trading days avoids having data from the same intraday period between two sets. The dates used for partitions are described next.

2.4 Data

There are two forms of collected data, namely, tweets and stock quotes. The choice of trading frequency determines the level of detail in constructing an event study. Tweets measure time stamps to the second. Therefore, we may choose a trading rate of either minutes or days. Since intraday trading provides a larger volume of data, we elect minutes. The Twitter API is used to retrieve tweets. MongoDB is then used to store tweets for analysis. This section provides a summary of curated data.

2.4.1 Tweets Collection

The collection only stores a tweet given the presence of the “$” character. For example, any text containing “$FB” (symbol for Facebook, Inc.) is immediately saved into the MongoDB database. Twitter limits the number of tracked terms to 400. As such, we generate a list of 396 symbols to track during tweet tracking dates of 12/21/2015 to 05/01/2016, by using a static snapshot of the S&P 500. Table 1 includes the training and evaluation periods for the dataset.

Table 1:

Company-Centric Evaluation Period.

Period	Dates
Training period	12/21/2015 to 02/28/2016
Validation period	03/01/2016 to 03/31/2016
Testing period	04/01/2016 to 05/01/2016

2.5 Event Study Horizon

The horizon is established to measure the effect Twitter events have on prices. The relevant definitions are the event time, pre-event region, and post-event region. As suggested, intraday impacts are of interest. Because of this, the event time can be any minute during market hours that produces a complete post-event region. To compute the financial metrics, the entry and exit of trades must be clear. A simple exit strategy of closing a position after a set amount of time works well with event studies. The approach is to examine exit strategies of 60 and 240 min. As such, given time of event T, the post-even regions will be (T, T+60], (T, T+240]. These intraday regions help differentiate how long the influence remains on stock prices.

Time-series models use the pre-event region for predicting movements. A larger time span grants more input data for the forecasts. However, it reduces the number of events possible due to missing data. Since the markets open at 9:30 am EST, if an input requires 60 min of financial quotes, then it will not allow a trade before 10:30 am EST. Accordingly, a 30-min selection defines an area of [T−30, T).

2.6 Evaluation Metrics

The selected evaluation metrics are price spike accuracy (with a focus on precision/recall), and trading metrics. Directional accuracy, representing a binary classification with UP/DOWN as classes, is a consistent choice amongst researchers because it displays the algorithms’ ability to predict price direction. It is less important by industry standards since trading only on price direction can lead to financial losses. If incurred transaction costs are higher than stock gains, then the overall result is negative. Incurring high transaction costs is a common pitfall in designing trading strategies. The binary classification of UP/DOWN does not follow trading logic – mostly it is best to make no trade at all.

Predicting price spikes can have considerable benefits. Let us define a price spike as stock movement greater than —x—% within time t. We can empirically find the value for x. It need not be symmetric, in such a case one can use two positive values x and y. The values must be large enough to offset transaction costs but small enough to find enough events. To view this as a classification problem, we define the three classes of POSITIVE: +x% movement in time t; NEUTRAL: [−y, x]% movement in time t; and NEGATIVE: −y% movement in time t. Given this setup, we can train a prediction classifier that works well with classical buy/sell trading logic: purchase the stock when the classifier predicts POSITIVE; short the stock when the prediction is NEGATIVE; and do nothing for a NEUTRAL prediction. Lastly, the simplest exit strategy is to close the trading position after time t.

Evaluation of the results uses the precision and recall [4] metrics. For the price spike classification task, we can consider a one versus all strategy. When measuring precision for the POSITIVE spike class, both NEUTRAL and NEGATIVE would be the incorrect categories. For this work, the reported values for precision and recall are computed using weighted averages from the POSITIVE and NEGATIVE classes. This stays consistent with providing market-neutral trading decisions. The emphasis on precision and recall stems from trading logic. Trading does not have a huge penalty for missing out on buying opportunities. Profit and loss are only affected by precision. It is best to tune model parameters to maximize precision. The only consideration needed is to increase recall if there are not enough trading opportunities. We expect to have low levels of recall since we do not want to trade on all news events.

The final evaluation criteria are achieved by computing trading metrics. The following industry standards are used as metrics: returns, Sharpe ratio, and profits per share. Sharpe ratio [19] is defined as Sharpe ratio = (r_p−R_f)/σ_p, where r_p is return of the portfolio, R_f is return of risk-free rate, and σ_p is standard deviation of portfolio. Sharpe ratio is a better metric than returns because it measures returns with respect to risk. As such, a strategy that produces 8% returns with very low volatility is much better than one with high fluctuations. The goal is to attain positive Sharpe ratios. In general, a strategy with a Sharpe ratio above 3 will be profitable daily [3]. Due to the low-risk levels, strategies with lower average returns but larger Sharpe ratios can be leveraged to maximize profit. An important task is to ensure that all values of risk and returns are annualized. With current economic conditions, the risk-free rate can be ignored.

Commission costs are often calculated on a per share basis. For example, a brokerage can charge a fee of 0.0075 cents per share on a trade. For this reason, we highlight the average profit per share in cents for our prediction algorithm, to provide portfolio managers with a better view of how to associate their fee structure. An emphasis on precision and recall can aid in optimizing prediction algorithms for finance.

Figure 2:

Flow Chart of System Overview.

2.7 Overview Diagram

Figure 2 presents the system design, visualizing the procedure for constructing the proposed event study. The overview flows through the full procedure, starting from storing tweets to the evaluation of trading decisions. A later section describes with detail all components from the system overview. The system overview highlights the following components: data retrieval, data storage, cluster formation, model building, and trading metrics. Prior to training, the data needed for retrieval are deduced. The two components of data are financial time-series and tweets. Tweets are retrieved using the Twitter API, while financial data are gathered using QuantQuote (financial data provider). The retrieved data are stored in MongoDB databases. Once all data are stored, clusters are formed as described. Clustered events and market data are converted to a format suitable for classification tasks. The outputs from learning algorithms are ensembled and used as trading decisions.

3 Experimental Design

3.1 Model Type: Language Models

The purpose of the language model is to predict a price spike given a tweet. Since features are textual, there are no issues with missing values. Two approaches are attempted for model selection. First, staying consistent with past studies, a bag-of-words style approach where words are symbolic is used for features. Second, more recent models are tested using words converted to numerical vectors. For a symbolic word feature set, NBSVM is applied [23]. For distributed representations, the following models are considered: sentence/paragraph vector model [8] and a RNN language model (RNNLM) [13]. These model implementations were adapted from available code [6]. Most of the core implementation, such as preprocessing techniques to normalize text, was left exactly as is. The main design change was to go from a binary classification task to a multi-class classification with three labels. Two solutions are available: convert to a multi-class classification task or use a one-vs-all classification strategy.

A multi-class solution reduces the number of samples of each category in the training set. The one-vs-all method can perform a generative maneuvering where instead of one model we create two, namely, a positive for buy decisions and a negative for shorting decisions. In both models, the opposing class will be a neutral (do nothing) label. Table 2 lists all potential cases to combine predictions of the generative models. The risk on returns is reduced by diffusing the predictions of both “buy” and “short” to neutral. The text in such a scenario could have high volatility conditions and, therefore, is avoided.

Table 2:

One-vs-All Approach for Language Model Predictions.

Model	Sample 0	Sample 1	Sample 2	Sample 3
Positive model	Buy	Buy	Neutral	Neutral
Negative model	Neutral	Short	Neutral	Short
Final prediction	Buy	Neutral	Neutral	Short

3.2 Event Clusters

One tweet may not satisfy the requirements of being noteworthy. However, a burst of tweets on the same topic would assert likelihood. The following algorithm is put forward to cluster events.

For incoming tweets, tag text using part-of-speech (POS) tagger, where words are separated using a highly developed POS tagger [15].
Compare tagged common nouns, URLs, proper nouns, and hashtags with prior events. If no such event has occurred in the past T minutes, create event ID and map POS tags to a new event. If POS tags match with past events’ tags, then append tweet to closest old event. Do not use stock symbol in matching criteria.
Close an event if it does not have a new tweet appended to it within T minutes.
Save an event as noteworthy if the required number of tweets cluster within T minutes.
Once an event is created, save it for reference in merging future indistinguishable events.
Mark an event as lagged if it shares high similarity with an older saved event.
Convert clustered event time to a tradeable market time.
For each clustered event, send time stamp to market data database to connect financial data. Use the stock representation of the company mentioned using the “$”. Use Step 8 to label price spikes as BUY/NEUTRAL/SHORT.

The suggested algorithm for clustering events has parameters. During the training period, the parameters may be empirically adjusted to improve performance. However, the event clustering algorithm is the most computationally expensive algorithm in this work. An extensive amount of parameter tuning could further improve results. The clustering method needs to identify events early enough to counter the pre-news effect. In addition, merging handles lagged news by determining whether late tweets have an additional impact. Thus, the current parameter set is an initial solution for event clustering. Table 3 specifies crucial parameters.

Table 3:

Parameter Set for Clustering Events.

Parameter	Value
Cluster size	3
Time limit for clusters	60 min
Time limit for lagged events	4 h
Matching criteria	2 words
Merging criteria	3 words

Once the cluster reaches the accepted size, the time stamp of the last tweet is used. We convert this time stamp to an acceptable market time for trading. Since Twitter time stamps appear at a second frequency, we round up to the nearest minute. It is important not to round down, as it would have a look ahead bias of seconds. Rounding up to the next minute does not solve the scenario where exchange hours are not open. To deal with this, any event that comes during outside hours takes the next available market time. This strategy puts to use all newsworthy events leading up to the market opening at 9:30 am EST. The lagged news effect is dealt with by storing any noteworthy event for 4 h, purging any incoming event with a strong resemblance. A primary benefit of using the enumerated method is the ability to label tweets automatically. It is essential for supervised learning.

3.3 Event Matrix

An event matrix is designed to align the time stamp of an event with its stock symbol. Event matrices are utilized by both the time-series model and for trading metrics. Table 4 uses a made-up example to best describe an event matrix for a time-series model. The “E” inside the matrix identifies a clustered tweet event. The time-series model will use the information to know when and what stock to examine. One tweak is needed for trading metrics. The values inside the matrix now correspond to output labels, where 1 is BUY, 0 is NEUTRAL, and −1 is SHORT. The event matrix in Table 5 can be used to make buy/sell trading decisions. The hypothetical trades in the example are as follows: buy CVX at 11:31, short XOM at 11:32, and short SLB at 11:39.

Table 4:

Event Matrix as Input to Time-Series Model-Minute Frequency.

Time stamp	$CVX	$BP	$XOM	$SLB
08-03-2015-11:30	–	–	–	–
08-03-2015-11:31	E	–	–	–
08-03-2015-11:32	–	–	E	−
08-03-2015-11:33	–	–	–	–
08-03-2015-11:34	–	–	–	–
08-03-2015-11:35	–	E	–	–
08-03-2015-11:36	–	–	–	E
08-03-2015-11:37	–	E	–	–
08-03-2015-11:38	–	–	–	–
08-03-2015-11:39	–	–	–	E

Table 5:

Event Matrix as Input for Trading-Minute Frequency.

Time stamp	$CVX	$BP	$XOM	$SLB
08-03-2015-11:30	–	–	–	–
08-03-2015-11:31	1	–	–	–
08-03-2015-11:32	–	–	−1	–
08-03-2015-11:33	–	–	–	–
08-03-2015-11:34	–	–	–	–
08-03-2015-11:35	–	0	–	–
08-03-2015-11:36	–	–	–	0
08-03-2015-11:37	–	0	–	–
08-03-2015-11:38	–	–	–	–
08-03-2015-11:39	–	–	–	−1

3.4 Model Type: Time-Series Models

Regression systems use past 30 min of price movements to predict future shifts. We convert the price change prediction to trading logic. For example, if a linear regression system predicts a price increase of 50 cents, then buy if the buy criterion is any move larger than 30 cents. Essentially, while the learning algorithm is designed for regression, the output is classification.

Classical techniques such as linear regression, SVR, and KRR use default parameters. Neural networks, however, have infinite possibilities for architectures. To stay within reason, we have focused on one feedforward (FWD) neural network and three recurrent neural networks. The reason for multiple structures is to compare the gated recurrent unit (GRU) versus the long short-term memory (LSTM) unit. Model names for neural networks are labeled in a way to highlight architecture. The FWD neural network, FWD 0 RNN 6 Dense, has zero RNN layers and six fully connected hidden layers. RNN 1 GRU 1 Dense has one GRU layer and one fully connected hidden layer. RNN 4 LSTM 1 Dense has four LSTM layers and one fully connected hidden layer. Lastly, RNN7 LSTM 0 Dense has seven LSTM layers and zero fully connected hidden layers. Every neural network has a single neuron as its output layer. The output layer does not have an activation function since the neural networks learn to predict values for price shifts. Training enhances these predictions by minimizing the mean squared error function. Once complete, the projection is converted to class labels using BUY/SHORT criteria. Based on the overall strategy, the system attempts to answer the question “Is the stock primed for a large movement?”

4 Experimental Results

4.1 Analysis of Twitter Dataset

Before profit results, we analyze the stored collection and highlight some helpful techniques found to reduce useless events. A filter is necessary because Twitter is full of casual conversations. The event clustering algorithm moves towards using Twitter in a manner similar to those of news headlines. We begin by not including retweets, and in doing so, all saved tweets are unique and independently produced. The number of retweets is unknown because the storing routine rejects them immediately. Using the clustering method, the total number of original tweets stored for the company-centric collection is 1,557,811. Since language has grammatical peculiarities, the focus is only on English tweets. Removing retweets aids event clustering by avoiding the arrival of identical texts. Unfortunately, a similar concern arises given the pervasiveness of automated bots on Twitter. Luckily, the Twitter API provides an origination of source for each tweet. A source tag registers if a tweet was formed using an iPhone app, Blackberry app, any third-party app, or the Twitter website itself. By examining sources, a filter is built to block automated tweets.

In viewing preliminary events, few sources appear to dominate as candidates for presenting repetitive texts. A blacklist containing sources that exhibit such behavior is designed to mitigate the problem. The list is by no means exhaustive; however, it is an essential step for the event clustering algorithm. Either blacklisting sources or accepting trusted sources is our recommendation to move towards identifying noteworthy news on Twitter. Table 6 states figures for filtered documents, i.e. English filter and blacklisted source filter. Approximately, half of the tweets are filtered from grouping into events. This high percentage shows the influence that bots have on Twitter.

Table 6:

Statistics on Stored Tweets for Company-Centric Collection.

Tweet collection	Tweet count	% of collection
Total	1,557,811	100%
English	1,398,981	89.80%
Not blacklisted	877,054	56.30%

4.2 Analysis of Event Collections

Presented here are the results after applying the event clustering algorithm on filtered documents. The numbers change as the exit strategy parameter is modified. Currently, a 60-min exit strategy is used. An examination of which exit strategy works best is presented later. Total counts on the number of events included 35,453 total events with training size of 19,788, validation size of 7352, and testing size of 8313. As mentioned, large percentages of events are prevalent in the validation and testing sets. This stays consistent in ensuring fairness to the concern of overfitting in finance.

If a cluster is identified outside of trading hours, we can either ignore it or trade it at market open. A large proportion of clustered events, 69%, is at market open. Thus, the decision is to use events at market open. There are no events after 3:00 pm EST, because we are using a 60-min holding period. All events that need to hold the stock for a longer period will be a subset of either collection. Events at market open are not available for time-series models since they require a 30-min pre-event region.

Similar logic applies to trading days. The strategy is to ignore all evening events on Friday while keeping all weekend (Saturday and Sunday) tweets. Including weekends increases the number of events on Mondays. The decisions described here are the same choices traders must consider when incorporating morning and weekend news. The effect of news from outside of trading hours may explain why the largest price movements in a trading day happen at market open. A significant number of events have been merged and classified as lagged events. Each event is saved for 4 h. If any new event has a high similarity to a past event, it is considered lagged. A total of 32,258 events are merged; meaning 47.6% of all company-centric events are merged.

With additional information provided by the Twitter API, analysis can go beyond just the text of tweets. Hashtags are used as topical indicators, as they give a quick look at top categories clustered by the algorithm. The top hashtags are considered relevant.

4.3 Performance of Event Collections

Annualizing Sharpe ratios is a matter of scaling. If we have monthly returns, then we would multiply the Sharpe ratio by the square root of 12. Daily returns would scale by the square root of 252 and not 365, because there are 252 trading days in a year. In using a 1-h holding period, we may be tempted to multiply 252 by the number of hours in a trading day. However, this should not be the case because the strategies do not provide opportunities to trade every single hour. Instead, the learning algorithms do present enough events to achieve these returns at least once a day. Thus, we scale returns daily to annual rather than hourly to annual. To allow scaling of Sharpe ratios, an underlying assumption is that the returns are independent and identically distributed. A study suggests an overestimation of Sharpe ratios when the assumption does not hold [10]. Overstated Sharpe ratios stem from the presence of serial correlation in returns [10].

If at 9:30 EST there are five events for a company where four are buys and one short, then select buy. Doing so removes the serial correlation that would arise from having same returns and removes the one differing trading decision. Now there is less likelihood for serial correlation in returns, and thus, Sharpe ratios will be annualized.

An issue occurs when multiple stock symbols show up in a tweet. If “$AAPL” and “$FB” are listed, then we must choose to accept either both symbols, only one, or none. Multiple events with identical text are generated if all symbols are used. As an initial solution, we only choose the first symbol alphabetically.

4.4 Exit Strategy for Language Models

A difficult decision for traders is to figure out when to exit a trade. The experiments measure performance on intraday holding periods of 60 and 240 min. Figures 3 and 4 show Sharpe ratios and the number of trades for each language model. Figures include ensembles that form by using the most common prediction from individual models. Given the doubt of serial correlation in the company-centric collection, a conversion is made to an annualized Sharpe ratio.

Figure 3:

Performance of Language Models on 60-Min Exit Strategy.

Figure 4:

Performance of Language Models on 240-Min Exit Strategy.

The company-centric collection does well and is mostly consistent. Two of the most promising models are language models: NBSVM and PARAGRAPH. The decisions used from these systems would present meaningful gains for day traders over the 2-month period. Sharpe ratios give a general feeling of trading strategies profitability.

The PARAGRAPH model is based on unsupervised learning of distributed vector representations for texts, varying from a sentence to an entire paragraph. The vector representation of model is trained to predict surrounding words in a paragraph. Based on empirical results, the paragraph model is a good technique for text representation [8].

With monthly or yearly holding periods, a Sharpe ratio of greater than 3 is considered excellent. Shorter-term and high-frequency traders prefer higher Sharpe ratios. Due to smaller returns, a larger amount of capital is needed to invest in high-frequency positions. Since a significant capital investment is required, lower risk opportunities are adequate.

Trade counts, greater than or close to 500, boost confidence in the results as it ensures that no single outcome dominates averages. The only ensembles that have a small number of trades are pairs that include RNNLM; this shows that there is a limited union of decisions between RNNLM and other language models. The displayed results fit the requirements of short-term traders noted above.

Figure 5:

Performance of Regression Models on 60-Min Exit Strategy.

Figure 6:

Performance of Regression Models on 240-Min Exit Strategy.

4.5 Exit Strategy for Regression Models

Regression performances across exit strategies are displayed in Figures 5 and 6, including the KRR. Sharpe ratios ranging from −2 to 3 fail to impress when compared to language models. A lower range may be partly due to being evaluated on a small subset of the data that language models trade on. Regression models forecast price shifts based on the past 30 min of movements. A significant portion of cases appear at market open. Therefore, less than 30% have the pre-event region required for projection. Even with a smaller training set, time-series models mostly learn to be profitable.

Time-series predictors, unlike language models, show clear differences among exit strategies. Longer periods of 240 min show negative Sharpe ratios and inconsistencies between the validation and testing set. The 60-min holding period seems like the best choice for regression models. It is reasonable to expect shorter term forecasts to be more accurate. On good decisions, allowing a trade more time aids the ability to accumulate greater returns. Since Sharpe ratios are a function of both risk and returns, a good balance helps. When comparing models, there is not one that outperforms all others across the given exit strategies. A closer examination of the performance of regression models is required. Both language models and regression models perform well on a 60-min holding period. For this reason, we continue beyond Sharpe ratios and later highlight financial metrics using the 60-min exit strategy.

Figure 7:

Percentage Returns Compared Using a 60-Min Exit Strategy on Language Models.

4.6 Evaluation Metrics

A Sharpe ratio acts as a sound indication to accept or reject a trading strategy. Still, a closer inspection of additional metrics is advised. This section considers the distribution of stock returns, profits per share, and machine learning metrics of precision, recall, and accuracy. Figure 7 helps in understanding the distribution of percentage returns; presented is a box plot of returns for language models. A box plot highlights the mean, quantiles, and outliers from all trades. The mean and majority of predictions are above 0% for NBSVM and PARAGRAPH. Furthermore, most outliers have large positive gains, and regardless of the high Sharpe ratios, it should be noted that most of these outliers are positive. Sharpe ratios are discouraged by risk in either direction. Investors, on the other hand, want notable positive gains. For this reason, some traders prefer to use Sortino ratios which only adjust for downside risk. The distribution of returns further persuades investors in using the event clustering algorithm and language models. Comparing the box plot of returns and Sharpe ratios infers more meaningful interpretations.

By using a histogram in the range of −5% to 5%, we can capture the distribution of where most movements appear. Figure 8 shows how a profitable model should look, while Figure 9 displays a model that would lead to losses. There is a natural skew to the right for both NBSVM and PARAGRAPH along with a large number of profitable trades. It is also clear that the majority of returns have a mean close to 0.6%. The average may seem small, but these are accumulated under just 60 min and not annualized. The advantage of a shorter holding period is the ability to free up capital. A 0.6% return can be absorbed and reinvested to give compounding effects. Tables 7–10 report evaluation metrics for language and regression models on the validation and testing sets. On the evaluation metrics alone, text models undoubtedly outperform time-series models. Still, it is important to compare them separately since both are trading on different subsets of the data.

Figure 8:

PARAGRAPH Histogram of Percentage Returns Focused on Range (−5%, 5%).

Figure 9:

RNNLM Histogram of Percentage Returns Focused on Range (−5%, 5%).

Table 7:

60-Min Language Model Metrics on Validation Set.

Model	Profit per	Returns	Risk	Trades	Sharpe	Precision	Recall	Accuracy
type	share ($)	(%)			ratio
NBSVM	0.761	0.686	1.432	761	7.602	0.612	0.057	0.556
RNNLM	−0.051	−0.125	1.552	1068	−1.282	0.591	0.038	0.544
PARAGRAPH	0.698	0.565	1.784	865	5.025	0.621	0.060	0.557
NBSVM RNNLM	0.970	0.798	0.788	42	16.080	0.582	0.003	0.537
NBSVM PARAGRAPH	0.832	0.717	1.712	389	6.648	0.599	0.029	0.546
RNNLM PARAGRAPH	0.988	0.628	1.013	84	9.838	0.611	0.006	0.538
NBSVM RNNLM PARAGRAPH	0.867	0.698	1.596	479	6.946	0.607	0.036	0.549
RANDOM-AVG	0.001	0.001	1.299	2150	0.011	0.290	0.050	0.505

Table 8:

60-Min Language Model Metrics on Testing Set.

Model	Profit per	Returns	Risk	Trades	Sharpe	Precision	Recall	Accuracy
type	share ($)	(%)			ratio
NBSVM	0.752	0.636	0.918	820	10.990	0.558	0.057	0.563
RNNLM	−0.086	−0.037	1.228	1109	−0.481	0.539	0.036	0.549
PARAGRAPH	0.561	0.500	1.125	818	7.058	0.577	0.054	0.562
NBSVM RNNLM	0.824	0.690	1.213	72	9.028	0.477	0.004	0.547
NBSVM PARAGRAPH	0.770	0.654	0.892	430	11.653	0.553	0.030	0.555
RNNLM PARAGRAPH	0.514	0.352	1.986	89	2.815	0.574	0.006	0.548
NBSVM RNNLM PARAGRAPH	0.752	0.602	1.185	521	8.067	0.553	0.036	0.556
RANDOM-AVG	−0.002	0.003	1.013	2237	0.042	0.288	0.050	0.514

Table 9:

60-Min Regression Model Metrics on Validation Set.

Model type	Profit per share ($)	Returns (%)	Risk	Trades	Sharpe ratio	Precision	Recall	Accuracy
LR	0.038	0.330	2.805	442	1.870	0.438	0.060	0.598
SVR	−0.044	0.192	3.585	251	0.852	0.455	0.032	0.593
KRR	0.041	0.261	3.053	349	1.356	0.421	0.046	0.596
FWD 0 RNN 6 Dense	−0.040	0.104	3.028	373	0.548	0.362	0.043	0.591
RNN 1 GRU 1 Dense	0.181	0.077	2.724	482	0.448	0.399	0.059	0.593
RNN 4 LSTM 1 Dense	0.158	0.904	4.431	128	3.237	0.432	0.018	0.591
RNN 7 LSTM 0 Dense	0.120	0.240	4.102	176	0.928	0.413	0.023	0.590
RANDOM-AVG	0.005	−0.002	0.839	768	−0.027	0.226	0.050	0.633

Table 10:

60-Min Regression Model Metrics on Testing Set.

Model type	Profit per share ($)	Returns (%)	Risk	Trades	Sharpe ratio	Precision	Recall	Accuracy
LR	0.032	0.024	1.311	332	0.291	0.432	0.047	0.602
SVR	0.079	0.250	1.555	155	2.553	0.508	0.024	0.601
KRR	0.004	−0.022	1.381	273	−0.249	0.417	0.038	0.602
FWD 0 RNN 6 Dense	−0.014	0.031	1.263	287	0.384	0.419	0.038	0.599
RNN 1 GRU 1 Dense	0.150	0.078	1.235	408	1.002	0.435	0.052	0.599
RNN 4 LSTM 1 Dense	0.176	0.243	1.402	78	2.746	0.447	0.012	0.598
RNN 7 LSTM 0 Dense	−0.313	0.090	1.892	111	0.755	0.241	0.015	0.597
RANDOM-AVG	−0.003	0.000	0.596	740	0.011	0.217	0.050	0.648

The variance in the price an asset trades at explains why some strategies have a negative profit per share but positive returns. For example, on the validation set, SVR loses an average of −0.04 cents per trade but has an average of 0.19% returns. SVR made a few bad predictions on stocks that trade at higher prices; this could be a large decline in cents but a small ratio according to percentage return. Nonetheless, the current regression techniques learn to be profitable with the selected representation. Only RNN 4 LSTM1 Dense model achieves a Sharpe ratio of over 2 across both validation and testing set.

4.7 Ensembles of Time-Series Models

Model combinations and weights are selected using the method described earlier. Listed in Table 11 are the top 10 performers based on Sharpe ratio on the testing set. As shown in Tables 12 and 13, all the recorded ensembles have Sharpe ratios greater than 2. Of the 10, seven use RNN 4 LSTM 1 Dense. Three out of the seven, 1, 2, and 4, perform better as a combination than the original.

Table 11:

Top Ensembles.

Ensemble 1	FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense
Ensemble 2	RNN 4 LSTM 1 Dense and RNN 1 GRU 1 Dense and RNN 7 LSTM 0 Dense
Ensemble 3	RNN 1 GRU 1 Dense and SVR
Ensemble 4	FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense and RNN 1 GRU 1 Dense and SVR
Ensemble 5	FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense and KRR and RNN 1 GRU 1 Dense and SVR
Ensemble 6	FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense and RNN 1 GRU 1 Dense and LR and RNN 7 LSTM 0 Dense and SVR
Ensemble 7	RNN 4 LSTM 1 Dense and RNN 1 GRU 1 Dense
Ensemble 8	FWD 0 RNN 6 Dense and RNN 1 GRU 1 Dense and SVR
Ensemble 9	FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense and KRR and SVR
Ensemble 10	FWD 0 RNN 6 Dense and RNN 1 GRU 1 Dense and RNN 7 LSTM 0 Dense and SVR

Table 12:

60-Min Ensemble of Regression Models Metrics on Validation Set.

Model type	Profit per share ($)	Returns (%)	Risk	Trades	Sharpe ratio	Precision	Recall	Accuracy
Ensemble 1	0.064	0.679	4.719	123	2.285	0.410	0.016	0.590
Ensemble 2	0.100	0.693	4.775	119	2.305	0.441	0.016	0.590
Ensemble 3	−0.001	0.591	4.399	145	2.132	0.423	0.019	0.591
Ensemble 4	0.060	0.616	4.771	116	2.051	0.409	0.015	0.591
Ensemble 5	0.226	0.900	4.773	120	2.994	0.460	0.018	0.592
Ensemble 6	0.212	0.992	5.055	104	3.115	0.424	0.014	0.590
Ensemble 7	0.069	0.653	4.273	142	2.425	0.398	0.018	0.589
Ensemble 8	−0.036	0.608	4.613	137	2.092	0.370	0.017	0.591
Ensemble 9	0.147	0.813	4.713	123	2.737	0.436	0.017	0.591
Ensemble 10	0.182	0.857	5.001	106	2.721	0.396	0.016	0.591

Table 13:

60-Min Ensemble of Regression Models Metrics on Testing Set.

Model type	Profit per share ($)	Returns (%)	Risk	Trades	Sharpe ratio	Precision	Recall	Accuracy
Ensemble 1	0.151	0.316	1.449	75	3.464	0.473	0.012	0.599
Ensemble 2	0.047	0.269	1.332	68	3.210	0.468	0.011	0.598
Ensemble 3	0.276	0.327	1.677	105	3.098	0.531	0.019	0.601
Ensemble 4	0.184	0.295	1.546	72	3.031	0.484	0.012	0.599
Ensemble 5	0.077	0.238	1.427	83	2.653	0.509	0.015	0.600
Ensemble 6	0.069	0.248	1.513	68	2.605	0.515	0.013	0.600
Ensemble 7	0.194	0.210	1.372	89	2.427	0.428	0.013	0.598
Ensemble 8	0.112	0.244	1.625	96	2.382	0.539	0.018	0.601
Ensemble 9	0.052	0.218	1.465	80	2.366	0.513	0.014	0.599
Ensemble 10	−0.165	0.238	1.773	66	2.135	0.537	0.013	0.599

4.8 Summary

The figures and tables provided in the paper enable further investigation of the use of the proposed models, as the performance of regression models is evaluated versus the performance of language models. The metrics of Sharpe ratio and trace count – typically used in this area – are utilized for the assessments. In addition to the models, the exit strategy is an important factor to be considered. The time-series predictors are dependent on exit strategies, and the data presented in figures and tables depict the evaluation of 60-min exit strategy versus 240-min exit strategy, with the 60-min exit strategy performing better for the regression models. As models are ensembled, the performance of certain ensembles is included for further analysis.

Table 14:

Recommended Models.

Model	Type	Sharpe valid	Sharpe test
NBSVM	Language	7.602	10.990
PARAGRAPH	Language	5.025	7.058
NBSVM PARAGRAPH	Language	6.648	11.653
NBSVM RNNLM PARAGRAPH	Language	6.946	8.067
FWD 0 RNN 6 Dense and RNN 4 LSTM 1 Dense	Regression	2.285	3.464
RNN 4 LSTM 1 Dense and RNN 1 GRU 1 Dense and RNN 7 LSTM 0 Dense	Regression	2.305	3.210
RNN 1 GRU 1 Dense and SVR	Regression	2.132	3.098

The event clustering algorithm does well in generating events. Both language and regression models have positive Sharpe ratios. Language models are better predictors and more profitable than regression models. Language models have a larger training set available for learning since they do not require financial data for predictions. Performance on regression models is strengthened by taking advantage of ensembles. Table 14 contains a list of recommended models along with their reported Sharpe ratios. Recommendations reflect evaluation metrics and the distribution of returns. Lastly, language models are favored. The preference is not only due to greater Sharpe ratios but also because performance is reported over a larger sample of trading decisions. Reporting performance on many trades lifts belief in the model’s ability to stay persistent.

5 Conclusions

5.1 Discussion

Journalists scour through dealings and stories to find noteworthy news headlines. The implemented event clustering algorithm sequentially assembles tweets with similar text, creating topical events as clusters. Only gathering related tweets under specific time restrictions helps identify developing stories. Furthermore, filters reduce junk by removing retweets and tweets from some automated bots. A keyword-based similarity clustering technique used on a total of 1,557,811 tweets forms 35,453 events. As such, the clustering technique is useful in generating Twitter events that resemble news headlines.

The majority of generated events have an associated time stamp of 9:30 am EST when financial markets open. Therefore, prices have not had a chance to incorporate any information found in events. By making decisions before trading hours, we avoid pre-news concerns.

To address lagged news, we have used an approach of merging events. We purge all matching clusters that arrive within a 4-h time frame of a saved event. For instance, in this study, there are 32,258 cases merged, which is important considering a total of only 35,453 events created. A key benefit of merging is to avoid repeated purchases on a single story. Multiple transactions would have caused serial correlations in returns, which were previously shown to overstate Sharpe ratios. Furthermore, if the merged event arrives a few hours late, the price influence may have already been exhausted.

Therefore, seven models have been recommended. Recommendations are listed in Table 14. Advised models have Sharpe ratios ranging from 2.1 to 7.6 for the validation set and 3.1 to 11.7 on the testing set. Text models are preferred over price-based models.

5.2 Contributions

In summary, the research generates events from social content, predicts trading decisions on those events, and evaluates learning algorithms to produce recommendations for trading strategies. New media of delivering information to the public continue to emerge, and in order to adapt, event analysis and computational finance need to promote the generation of events from unstructured data. Most questions answered previously looked at structured data for events, such as earnings announcements. Now techniques to identify entities and cluster-related events from casual conversation can move the research forward to readjust to current and future social media content. Furthermore, machine formed actionable trading decisions may improve as natural language processing and time-series classification techniques advance.

Bibliography

[1] J. Bollen, H. Mao and X. Zeng, Twitter mood predicts the stock market, J. Comput. Financ. 2 (2011), 1–8.10.1016/j.jocs.2010.12.007Search in Google Scholar

[2] W. S. Chan, Stock price reaction to news and no-news: drift and reversal after headlines, J. Financ. Econ. 70 (2003), 223–260.10.1016/S0304-405X(03)00146-6Search in Google Scholar

[3] E. Chan, Quantitative Trading: How to Build Your Own Algorithmic Trading Business, John Wiley & Sons, Hoboken, New Jersey, 2008.Search in Google Scholar

[4] J. Davis and M. Goadrich, The relationship between precision-recall and ROC curves, In: Proceedings of the 23rd International Conference on Machine Learning (ACM), pp. 233–240, 2006.10.1145/1143844.1143874Search in Google Scholar

[5] E. F. Fama, The behavior of stock-market prices, J. Bus. 38 (1965), 34–105.10.1086/294743Search in Google Scholar

[6] github. (2017). https://github.com.Search in Google Scholar

[7] A. Groß-Klußmann and N. Hautsch, When machines read the news: using automated text analytics to quantify high frequency news-implied market reactions, J. Empir. Financ. 18 (2011), 321–340.10.1016/j.jempfin.2010.11.009Search in Google Scholar

[8] Q. V. Le and T. Mikolov, Distributed representations of sentences and documents, In: Proceedings of the International Conference on Machine Learning, Journal of Machine Learning Research (ICML, JMLR), pp. 1188–1196, 2014.Search in Google Scholar

[9] D. Leinweber and J. Sisk, Event driven trading and the ‘new news’, J. Portfolio Manage. 38 (2011), 110–124.10.3905/jpm.2011.38.1.110Search in Google Scholar

[10] A. W. Lo, The statistics of Sharpe ratios, Financ. Anal. J. 58 (2002), 36–52.10.2469/faj.v58.n4.2453Search in Google Scholar

[11] A. C. MacKinlay, Event studies in economics and finance, J. Econ. Lit. 35 (1997), 13–39.Search in Google Scholar

[12] M. Makrehchi, S. Shah and W. Liao, Stock prediction using event-based sentiment analysis, In: Web Intelligence and Intelligent Agent Technologies, IEEE, pp. 337–342, 2013.10.1109/WI-IAT.2013.48Search in Google Scholar

[13] T. Mikolov, M. Karafiát, L. Burget, J. Cernocky and S. Khudanpur, Recurrent neural network based language model, In: INTERSPEECH 2010 11th Annual Conference of the International Speech Communication Association, Springer, Chiba, Japan, pp. 1045–1048, September 2010.10.21437/Interspeech.2010-343Search in Google Scholar

[14] A. K. Nassirtoussi, S. Aghabozorgi, T. Y. Wah and D. C. L. Ngo, Text mining for market prediction: a systematic review, Expert Syst. Appl. 41 (2014), 7653–7670.10.1016/j.eswa.2014.06.009Search in Google Scholar

[15] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider and N. A. Smith, Improved part-of-speech tagging for online conversational text with word clusters, In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2013.Search in Google Scholar

[16] G. Rachlin, M. Last, D. Alberg and A. Kandel, Admiral: a data mining based financial trading system, In: Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), IEEE, pp. 720–725, 2007.10.1109/CIDM.2007.368947Search in Google Scholar

[17] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis and A. Jaimes, Correlating financial time series with micro-blogging activity, In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (ACM), pp. 513–522, 2012.10.1145/2124295.2124358Search in Google Scholar

[18] R. P. Schumaker and H. Chen, Textual analysis of stock market prediction using breaking financial news: the AZFin text system, ACM T. Inform. Syst. 27 (2009), 12.10.1145/1462198.1462204Search in Google Scholar

[19] W. F. Sharpe, The Sharpe ratio, J. Portfolio Manage. 21 (1994), 49–58.10.3905/jpm.1994.409501Search in Google Scholar

[20] A. Soni, N. J. van Eck and U. Kaymak, Prediction of stock price movements based on concept map information, In: IEEE Symposium on Computational Intelligence in Multicriteria Decision Making (IEEE), pp. 205– 211, 2007.10.1109/MCDM.2007.369438Search in Google Scholar

[21] H. Sul, A. R. Dennis and L. I. Yuan, Trading on Twitter: the financial information content of emotion in social media, In: (HICSS), 2014 47th Hawaii International Conference on System Sciences (IEEE), pp. 806–815, 2014.Search in Google Scholar

[22] P. C. Tetlock, M. Saar-Tsechansky and S. Macskassy, More than words: quantifying language to measure firms’ fundamentals, J. Financ. 63 (2008), 1437–1467.10.1111/j.1540-6261.2008.01362.xSearch in Google Scholar

[23] S. Wang and C. D. Manning, Baselines and bigrams: simple, good sentiment and topic classification, In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2 (ACM), pp. 90–94, 2012.Search in Google Scholar

Received: 2017-11-01

Accepted: 2018-06-17

Published Online: 2018-07-12

This work is licensed under the Creative Commons Attribution 4.0 Public License.