Last week was the long weekend for National Day. I didn’t have any outdoor activities because of the final exams of my Master’s program and the typhoon. However, I started deploying my project to the Google Cloud Platform in my spare time.
Previously, I wrote the article Why Is A Portfolio Website Important For A Data Scientist to share why I think side project is important for data scientists. The largest of my side projects is on stock price analysis and prediction using Python and machine learning models. This project started at the end of 2020, from preliminary planning to many technical trials and errors, and finally launched in June this year, during the worst time of COVID-19, and started the following PDCA cycle. This project has allowed me to grow substantially in my professional ability, and combining my professional ability with my investment ability has been one of the smartest life decisions I have made recently.
Side Project Can Go Further With Monetization Capabilities
The sense of accomplishment that finishing a side project is important, and the experience and enhanced expertise are also important. Still, the project can go even further if you can obtain additional value, such as money. A common way to monetize a side project is to offer free web tools to attract traffic and publish advertisements for profit. But actually, getting to this level means that we need to focus our efforts on areas outside of data science, such as running a website and social media. I’m bold enough to call stock forecasting the best because it’s a project that can monetize and accumulate experience in data analytics and investment operations. Since investment and programming have become mainstream in this era, combining these two learning paths seems natural and logical. Of course, the cost of learning will be very expensive, but it is just like the beauty of the climb over the mountains that is so appealing.
Some features of the stock price prediction project are noteworthy. First, the more recent predictions should be more accurate from a data prediction viewpoint, such as predicting product sales tomorrow and seven days simultaneously. Theoretically, the former has a higher accuracy rate. When applied to stock investments, this can lead to a short-term style of operation.
Secondly, the interaction between data and financial markets will result in a paradox: machine learning will work better for more recent predictions, but in investment markets, if the time dimension is too small, it will be easily disturbed by the bumpy ride of the short-term, which will affect the long-term trend judgment.
The two seem to contradict each other, and I’m not sure what kind of adjustment I should make for them. Or just regarding them from a philosophical point of view, then do nothing. Actually, it does not affect the planning and implementation of the project.
At work, before running each data analysis project, we used to keep asking ourselves, “What is the final action of this project? “. Most of the time, a project that does not lead to action is worthless. But in the case of stock market analysis, I think it’s more appropriate for data scientists to practice their skills because it can be actioned, optimized, and profitable.
Practice To Allocate Resources Like A Business Operator
The biggest difference between business operators and employees is that operators have the right to allocate the largest amount of resources, including human resources, financial resources, and hardware resources. Therefore, when a side project has profit potential, we can switch to the operator’s perspective to evaluate whether to increase investment, allocate investment projects, and even conduct a risk assessment and ROI calculation. I believe this is very helpful in strategic thinking.
For the stock prediction project, I have two investments, a subscription to the Taiwan Economic Journal (TEJ) database and using the Google Cloud Platform.
Learning to leverage professional partners was one of the most rewarding aspects of my recent growth. There is a lot of free data on the web, but most of them must be collected manually or through a web crawler, both of which require a high investment of time. Web crawlers may seem convenient, but I think it is still a highly labor-intensive task because the website’s structure may change, and the server may block the web crawler. And dealing with these occasional mishaps takes a lot of effort, and that’s just the first step in the project. The time investment is the most important of all resources, and buying the data directly from someone else is the way to be once and for all. In addition, some open data is released in web or PDF format, such as financial reports, and it can be a long and desperate road to research how to organize the tables on the web into analyzable data by yourself. I’ve tried.
My computer is a Mac Pro, which normally operates at around 65°C. The picture below showed the temperature of my computer when running models; at the time, the CPU usage rate was 100%, and the temperature was close to an alarming 100°C. Of course, computer overheating can cause the motherboard to burn out and the battery to expand. Both are dangerous and not the situations we want to see. Still, it is a necessary obstacle on the road to personal practice.
There is an additive effect using a paid cloud platform. We are more demanding on the quality and performance of our programming because these platforms usually charge for flow or computation time. The longer the computation time, the higher the cost. At work, when there is not enough time, we usually don’t optimize the performance frequently. For example, if a program needs to run for eight hours, we will run the file the day before. But when time is money, and it is the money from our pocket, we will develop new optimization methods even if it beats their brains.
Big data analysis requires hardware resources and computing power that the average person can’t afford on a home computer. Even if you use a desktop computer with good heat dissipation, the computing time may still be so long that it may be a disaster, especially for some modern deep learning algorithms, which are impossible to apply. Fortunately, cloud platforms such as GCP, AWS, and Azure are now very convenient, complete, and completely affordable for the average person, so it is almost as if computing energy is commonplace. Regarding the CPU of a computer, the common specification on the market now is 4-8 cores, but on the cloud platform, we can switch to more powerful machines at any time if needed. According to my current usage, the monthly cost is around 3,000 NTD, but if a side project is not profitable, the 3,000 NTD will be an expense rather than an investment. In this case, I always think about it again and again before I swipe my credit card.
If you want to use online opinion data for predictive models, many companies offer this kind of web crawler service, such as Octoparse and Parsehub. Just pay a fee, and you can leverage professional partners to search for data and download it back to your database so that we can focus more on worthy projects, such as public opinion analysis, emotional analysis, and natural language processing. But natural language is also a deep-seated problem, especially in Chinese, where sentence breaks are more complex than in English. Supposing you don’t want to plunge into this bottomless abyss, you may want to leverage professional partners again, such as Google and AWS, which both provide natural language analysis services that are easy to use with just a few lines of code. However, all of the above are fee-based services. Let’s go back to the main objective of this article, that side project with monetization potential can go farther and has more chances to get in touch with more diverse analytic fields.
The reasonability of using media content and social media data as a basis for investment is beyond the focus of this article, which can only say that it is technically feasible.
Expanding Software Development Skills
In terms of work, data analysts, data scientists, data engineers, and so on are usually in their roles due to the division of labor in the organization, but sometimes they can’t see the wood for the trees. However, many new challenges will come when side projects become large enough, such as cloud platform management, database and connectivity issues. These tasks may be beyond the scope of a data scientist’s defense in a narrow sense, but for a person who wants to see the wood for the trees, I can’t think of any other project more suitable for a data scientist to practice.
Some think the stock market is composed of people and unpredictable human behaviors, so the stock market is also unpredictable. As a result, they do not believe quantitative trading is working. The main difference between the two sides is that the so-called prediction is not the holy grail, which cannot predict the price without bias. It is common to have errors in the prediction, and complete misjudgment often happens, like people reading the daily weather forecast. They will not demand the highest and lowest temperature to be completely accurate because it is unreasonable and impractical. Therefore, how to use the prediction results as the basis within the possible error range to find the most suitable entry and exit timing is the first problem we must face, in addition to the technical side.
The human brain is probably more accurate than the computer when major, unexpected, difficult to quantify financial events or black swan events occur, such as the COVID-19 epidemic at its worst and U.S. stocks experiencing four circuit breakers in 10 days. We can fully expect the model to fail, this case is indeed more suitable to let the human will override the model, but it is an extreme case. The second problem we must face is whether we should continue to trust the models’ predictions when other smaller events occur.
Although prediction models do not always work, in general, as long as the model’s prediction results are more accurate than human predictions or significantly reduce the time spent on manual data research, then it will be an available method. Excluding the case of trading bots, data analytics is only an assisted tool, and people still make the final decision. The human-machine synergy is the trend of the AI era. The healthiest mindset is to think of the prediction model as a colleague with great ability, occasional problems, and poor communication skills. How to work with this colleague will require some experience. Even though his work performance is good, he may still lose because of the lack of domain knowledge. We can discuss this in a new post.
Finally, I would like to recommend TEJ’s database service. I am not publishing an advertisement, and TEJ does not seem to have developed a referral code yet, so this part comes to my mind as a user recommendation. Many international companies provide data services about U.S. stocks or cryptocurrencies, but the market size of Taiwan stocks is much smaller. Among a hundred people, there may not be even one who needs this service. It is not easy to survive in this small and niche market. I hope I can recommend this good service to everyone, and I hope TEJ can operate for a long time, so I don’t have to check the health of the web crawler every day.