Machine learning is difficult for financial market forecasting

Source: Internet
Author: User
Tags data distribution deep learning machine learning financial forecasting reinforcement learning

The financial market has become one of the first to adopt the machine learning (ML) market. Since the 1980s, people have been using ML to discover the laws of the market. Although ML has achieved great success in predicting market outcomes, recent in-depth learning has not helped much in financial market forecasts. While deep learning and other ML technologies have finally made Alexa, Google Assistant and Google Photos possible, there has been little progress in the stock market.

However, I apply machine learning to real-world financial forecasting problems. Although there are many papers claiming that the deep learning model has been successfully applied, I still look at these results with suspicion. Some models do have better precision. However, the magnitude of the difference is often not large enough.

Improvements in NLP help to increase the effectiveness of quantitative strategies that rely on document analysis. This is a rare benefit of the deep learning model in financial markets.

All of this confirms the fact that financial markets are inherently unpredictable. There are many reasons for this to be unpredictable. I want to highlight some of the main reasons for making it difficult:

Data distribution:

The problem of data distribution is critical – almost all research papers that do financial forecasting ignore this.

We can compare financial data sets with image classification data sets to better understand this. Let us consider the CIFAR-10 data set. It includes 10 classes. There are 5000 images in the training set for each class, and 1000 images in the test set for each class.

We expect that in the training set of dog classification, the distribution of pixel weights is similar to the distribution in the dog classification test set. In other words, the dog's image will be included in the training set as well as the dog in the test set. This is a silly explanation: the dog's image must contain a dog.

For most financial data sets, this obvious attribute is not effective. What you might see in the future is completely different from the data you are currently seeing. In fact, applying machine learning to the real world is a relatively common problem. In addition to ensuring that the test and training data sets have similar distributions, it must be ensured that the trained models are used in the product only if future data follows the training/verification distribution.

Although most researchers are concerned not to include predictive bias in their research, almost everyone does not recognize data distribution issues.

Forward optimization is a possible option to solve this problem. This is known among the practitioners, but researchers often forget to mention this. However, even advancing optimization is not a panacea for solving potential problems—it assumes what the future data distribution will look like. This is why the method of forward optimization does not really bring you high precision - it is only practical.

Small Sample Sizes

Machine learning often requires predictions from small data sets. An example is labor statistics, such as unemployment and non-agricultural income. Get one data point per month, not enough historical data. An extreme example is the financial crisis – there is only one data point for us to learn from.

This makes it very difficult to apply automated learning methods. One way many people end up is to combine less frequent statistics with relatively frequent data. For example, you can combine non-agricultural income with daily stock returns and provide a combined data set to the model. However, a lot of supervision is often required to eliminate doubts about the quality of the model.

Unquantifiable Data

Some may say that the timetable for our financial history is the same as human history itself. Unfortunately, it is more difficult to convert to quantified data in a form that the algorithm can understand. For example, even if we have a comprehensive understanding of what happened during the Great Depression of the 1930s, it is difficult to translate it into a form of learning that can be used for automation.

It's Quite Complex

A variety of factors drive prices at different scales:

  • High-frequency trading and algorithmic trading are the main drivers of price in the short term (less than 1 day);

  • Both the opening and closing prices have their own model - including stocks and futures - the two asset classes I use;

  • News and rumors are the driving force when it comes to multi-day lines. Detailed company news can occur at any time without prior notice. However, the schedule of certain events is known in advance, such as the company's plan reports and economic data at a glance;

  • Value investments and economic cycles are most important when it comes to price changes over many years.

Expert groups can be used to combine models of different sizes, but this is also a problem. (Note that the expert group is a very common technique for combining models of the same size – used by almost all quantitative asset management companies.)

Partially visible Observable Markov Decision Process

I am happy to consider the time series of prices as part of the visible Markov decision process (POMDP). No one has a complete picture at any point in time. I don't know what will happen tomorrow - but you still have to make a decision about the deal. You get very little information. At the same time, the distribution of data is constantly changing.

I have tried to apply the reinforcement learning method to financial issues. Even if I simplify the problem (ie, state and behavioral space), I can't learn anything useful. It took me a few weeks to debug why it didn't work—the result was that the RL algorithm needed to be predictable enough.

Recommended system similarity (Similarities to Recommender Systems)

ML can be applied to a very wide range of fields. In all of these, I found that the recommendation system is the closest to financial forecasting. Contrasting the difficulty of potential problems. Compared with the entertainment Recsys system, the analysis raises the difficulty of potential problems.

  • Both have relatively low precision. Let's consider the Netflix example. Netflix shows at least 20 movie options on the homepage. Therefore, for each suggestion, the average likelihood of selecting a user watching a movie is less than 1/20. There is a "less than" sign because the user may just leave without looking at anything. Similarly, the accuracy of most binary classification problems in financial time series is already around 50%.

  • Both data have a lot of noise. In both cases, the signal-to-noise ratio is high. The financial time series is higher than the noise, because many different factors affect the price. The Recsys dataset contains noise (PDF) because the user's browsing is usually affected - the user can access a specific Amazon product page and completely unintentionally buy anything from such a product - this ends the addition of noise.

  • Both data sets have seasonality. Amazon's purchase model (ie, product sales distribution) will be different from other time periods this year. The same applies to other Recsys issues, such as movie interest and YouTube video selection, depending on the time of year. Financial data is also seasonal, and the most common seasonal problem is the economic cycle.

  • Both must deal with invisible "events/commodities". Amazon adds new items to its catalog, keeps adding Netflix titles to the list of items, and every minute new videos are uploaded to YouTube. The recommendation system must solve this problem - how to recommend products that are not part of the training set. As mentioned in the data distribution section, financial data can contain completely different events than are available during model training.

  • Both must be modeled with different types of data. There are some separate features on YouTube, such as "The Last N Watched Videos List", which also has continuous features such as "watching duration of the last video." Similarly, financial data sets can consist of higher frequency prices and lower frequency economic figures.


End thinking:

If you want to leave this position because of one thing, it should be like this: Financial time series is a partial information game (POMDP), and even difficult for humans. We should not expect machines and algorithms to suddenly surpass human capabilities.

These algorithms are good at finding a hard-coded pattern and applying it. This is a double-edged sword, but sometimes it can't be done sometimes. It has helped most of the simple patterns to identify instances that have been discussed in detail. The next stage of identifying patterns in financial time series through unsupervised learning remains a dream that is difficult to achieve.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.