Equity Market Sentiment Signal - NLP on News Text Data

juliamtw20
May 21, 2024
3 min read

Updated: Jun 5, 2024

Project Objective and Overview

The objective of this project is to predict the market movement on the next Monday from this Friday through the news contents. Simply say, to predict whether the target index will be going upward or downward by analyzing the sentiment of news.

Our approach is, from the News text data, we conducted both Sentiment Analysis and Text Classification to get the sentiment and labels from text. Then, we create factors from them for our NN model, which will capture their feature. Then using Cosine Similarity to find the similar features in the past, which will be refered to our final prediction of the market movement.

Data Overview

We extracted our data from TDM studio. We got around 30,000 data, and the raw data includes date, news content, and publishers. The data date start from 2013-01-01 to the end of March in 2024, that is 11 years in total. The news amount is declining over years, we guess it might be caused by the transformation of news industry.

The news are from two pubisher, WSJ and NYtimes. With around 10k news from WSJ and 20K news from NYTimes. The chart below is the number of news per month over our time period covered.

Sentiment

We put our raw data into a pretrained sentiment analysis model to get their sentiment. The outcomes indicates whether the news are holding a positive, negative, or neutral perspective. As we see here, 54% of the news are neutral, while positive ones are more than negative ones.

Sentiment Model

Before we look into next label, let’s take a look at our sentiment model. We get this DistilRoBERTa based model from Hugging face. This model added several layers on top of DistilRoBERTa model for classifying their sentiment. They trained it with around 5 thousand text data. and achieve an accuracy of 0.98.

Classification Data and Model

Our next labels are region, asset class, and industry. As we can see here, some of the labels didn’t classified for any news text. However, at this stage, I can’t validate whether they are accurate or not. I’ll be talking about some possible solutions in further implementation. And regretfully, since the limitation of data amount and our classification. Our models for predicting market movement will only focus on industry.

Our appraoch is getting centroid from our news text and keyword list, then using K nearest neighbors to classify each news into the corresponding labels. The way we getting centroid from news text is by calculating their TF-IDF score and capture the top 100 keywords from each news. Then using SpaCy to embed them into vectors to get their centroid. As for the keywords list, we are building it manually now. We also vectorize them and get the centroid from it.

Exploratory Data Analysis (EDA)

The sentiment distribution by source is shown here. New york times give a lot of neutral news. And both of the sources have more positive news then negatives ones. Which leads to the overall mean sentiment score of 63, where 50 means neutral.

Then is the correlation of sentiment score and return of index. The index we are using here are all from MSCI World Sector Indices.

We calculated the correlation by week and month. And observed that industries with more data remain the same direction. and that monthly correlation shows more negative correlations. We assume that there should be positive correlation between thes two factors. So we suggests the market reaction to our sentiment score might be shorter.

And also the correlation is not high enough, so we decide to to put more factors into our predictive model for higher accuracy.

Predictive Model Design

Further Implementation

Since the lack of data with labels for industries, region, and asset classes, we couldn’t validate our classify outcomes. To train a model with custom needs, I suggest to gather the data with specificed labels. By doing so, we can train the model to achieve higher accuracy.

Two suggested approach includes focusing on feature engineering to better capture the keywords from news. Then put them into simpler models like logistic model for classifying, and optimize it on training data. This approach will cost less time and computing units.

Or second, could train the news by finetuning RoBERTa model, which could capture the contextual semantics from long articles. This could potentially achieve higher accuracy, but it might consume more time and computing units.

As for the predictive model, I'm now doing our research on only industries due to the time and data limitation. But with more data gathered, I think the same logic could easily be applied to also different regions and asset classes by putting index wanted as the target variable.

This is the market movement predictive model by sentiment. Thank for your time, attention, and participation today. Please feel free to ask any if there are question regarding our project.