Kajol Kuldeep Sarve, Student, Manthan Ashok Khandelwal, Student, Dasari Anantha Reddy, Assistant Professor
Department of Information Technology, KITS, Ramtek,Nagpur-441106.
Abstract:
In the finance world stock trading is one of the most important activities. The movement in the stock exchange depends on capital gains and losses and most people consider the stock market erratic and unpredictable. However, patterns that allow the prediction of some movement are found. Stock market analysis deals with the study of these patterns. It can be considered as an intelligent treatment of past and present financial data in order to predict the stock market future behavior. To build the proposed model we are going to use big data analysis and Machine learning. Big data analytics are used primarily in various sectors for accurate prediction and analysis of large data set. Machine learning manipulates the acquired knowledge for an accurate prediction. In this particular project seminar we are going to build a data pipeline to perform this analysis for any type and scale of data. Our approach is to integrate multiple open source modalities of Apache Hadoop ecosystem which takes in real time data and process it to produce valuable information to support decision making. We are going to implement this project using a machine learning tool namely Apache Mahout and then the data will be divided into test and training datasets, and make our linear regression based learning model to learn from training data and then predict the correlation between stock prices and the behavior of the stock.
KEYWORDS: big-data analytics, machine learning, hadoop, apachemahout, linear regression, stock market analysis.
Introduction:
Big Data refers to generation, storage and processing of large amount of data or information. Big data has been attached great importance for the proliferation of a lot of different sectors. It has been extensively employed by business organizations to formalize important business insights and intelligence. Besides, big data holds significant importance for the information, technology and cloud computing sector. Recently, the finance and banking sectors utilized big data to track the financial market activity. Big data analytics and network analytics were used to catch illegal trading in the financial markets. Similarly, traders, big banks, financial institutions and companies utilized big data for generating trade analytics utilized in high frequency trading. Besides, big data analytics also helped in the detection of illegal activities such as money laundering and financial frauds.
Hadoop Framework
Apache Hadoop is an open-source big-data framework providing a platform for handing large data sets through distributed storage and processing. The framework is based on the assumption that hardware failures are common and hence is designed such that it
automatically takes care of all the possible system failures. The ecosystem has Hadoop compatible File System (HDFS).HDFS is used to store large datasets reliably and to stream those data sets at high bandwidth to user applications. To store the data in HDFS flume is used. Apache Flume is a system used for moving massive quantities of streaming data into HDFS.
Machine Learning
Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Apache Mahout is a machine learning tool that enables developers to use optimized algorithms. It implements popular machine learning techniques such as recommendation, classification, and clustering. Mahout lets applications to analyze large sets of data effectively and in quick time. Mahout provides a package for linear regression. This package allows user to generate a model via training data and then apply the generated model to testing data in order to calculate the accuracy and achieve other related technical results. A prediction model is built based on information of a stock’s price in order to achieve higher accuracy.
Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ
based on – the kind of relationship between
dependent and independent variables, they are considering and the number of independent variables being used.
Simple linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship between the independent(x) and dependent(y) variable.
The main purpose of this paper is to help investors and companies to invest in various stocks based on some factors. The work target is to create a tool that analyses stock data of companies and implement these values to determine the value the particular stock will have in near future with suitable accuracy. These predicted and analysed data can be observed by individuals or companies to know the financial status of companies. Company and industries can use it to breakdown their limitation and enhances their stock value.Predictions will help the market dealers and investors to maximize their profit.
The stock market process is full of uncertainty and it’s affected by many factors. Hence the stock market prediction is one of the most important exertions in business and finance. But there is no application that provides accurate prediction and do not support large amount of data to be predicted .This motivates us to build a tool that will help to overcome these problems.
Review of Literature
Zhihao PENG. (2019) [1] in this paper, an approach of robust Cloudera-Hadoop based data pipeline is proposed to perform analysis for any scale and type of data, in which selected US stocks are analyzed to predict
daily gains based on real time data from Yahoo Finance.
V Kranthi Sai Reddy et al., (2018) [2] in this paper, according to author explains the prediction of a stock using machine learning. The technical and fundamental or the time series analysis issued by the most of the stockbrokers while making the stock predictions. The programming language is used to predict the stock market using machine learning is Python.
M.D. Jaweed et al., (2018) [3] the purpose of this study is to apply Hadoop Big Data to financial analysis and to identify top companies whose volume are traded highest in past years. For this research, historical data of NSE.Analyzing the data in QlikView.
VivekKanade et al., (2017) [4] in this paper, author used both fundamental and technical analysis are considered. Fundamental analysis is done using social media data by applying sentiment analysis process. Social media data has high impact today than ever, it can helpful in predicting the trend of the stock market and Technical analysis is done using historical data of stock prices by applying machine learning algorithms.
Arkilic. (2017) [5] this is documentation about the stock prediction using mahout and pydoop. In this project they use open source machine learning techniques and high- performance computing tools (Hadoop with mahout and pydoop &scikit learn) in order to predict movements of stocks (specifically Home Depot stocks) over various periods of time (10,20,30-year period).
V. Sandhiya et al.,(2017)[6] in this paper, they proposed a novel forecasting system which combines Map Reduce and genetic algorithm for predicting the stock market. In their system genetic algorithm is used to find forecasting function which when provided year of prediction it will generate the forecasted results. The prediction method used is neural network based on genetic algorithm. Neural Networks for prediction is learning based algorithm which trains itself based on given training dataset to process.
Varunesh Nichante (2016) [7] this paper involves discussion regarding the strategies that square measure used for analyzing each varieties of information. They would build up a f framework which can utilize content mining strategies to show the response of the share trading system to news articles and foresee their responses.
Aparna Nayak et al., (2016) [8] in this research paper, an attempt was made for prediction of stock market trend. Two models were built one for daily prediction and the other one for monthly prediction. Supervised machine learning algorithms were used to build the models. As part of the daily prediction model, historical prices were combined with sentiments.
Approach: Stock market is the important part of economy of the country and plays a vital role in the growth of the industry and commerce of the country. Both investors and industries are involved in stock market and want to know whether the stock will rise or fall over some certain period of time. It is based on the concept of demand and supply. If the demand for the company’s stock is higher,
then the company share price increases and if the demand for company’s stock is low then the company share price decreases. So our project will help to predict the stock prices and help the industries to grow their funds for business expansion.
We build a system which analyses stocks to predict daily gains in stocks market based on real time data from Yahoo Finance. To illustrate the processes, we select the random stocks. The daily stock prices are available on the Yahoo Finance and can be retrieved to generate various meaningful insights.
Our approach is to create a pipeline consist of five phases.
- Data Acquisition
- Data Injection
- Storage
- Pre processing
- Machine learning

In this proposed approach the stock data is considered for prediction from yahoo finance. Our stock data includes random stock of 144 rows and 7 columns. The stocks are considered on daily basis. After stock data is collected it is injected into HDFS using flume. Three components worked in concert to push the data into flume, these three were Source, Channel and Sink.
Results: To implement the proposed approach, there are generally five steps: Data Acquisition and Characterization, Data Injection, Storage, Pre-processing and Machine Learning.

The data is been acquired for further prediction from yahoo finance. Only one company’s data from ONGC.NS on daily basis is been collected of last five years i.e. till 2015. The data set is also available in CSV format for local analytics.


The data of ONGC.NS from yahoo finance is loaded to HDFS then applied machine learning algorithm linear regression using apache mahout.
Conclusion:
In this paper, the big data analytics are used for efficient stock market analysis and prediction. Generally, stock market is a domain that uncertainty and inability to accurately predict the stock values may result in huge financial losses. Through our work we were able to propose an approach to help us identify stocks with positive everyday return margins, which can be suggested to be the potential stocks for enhanced trading. Such approach will act as a Hadoop based pipeline to learn from past data and make decisions based on streaming updates which the stocks are profitable to trade in.We also used apache mahout for machine learning .Linear regression algorithm is used to learn from training data and then predict the correlation between stock prices and the behavior of the stock. The major advantage of forecasting is we can easily predict the stock exchanges for predicting the future trends so that investors may know about the market to invest their money on profit trades.
Future work:
We intend to further our study by automating the analysis processes using scheduling module, then obtain periodic recommendations for trading the stocks. We also plan to test some Neural Network model based learning rather than linear regression aims to accurately predict the US stock prices. Future research can be done with possible improvement such as more refined data and more accurate algorithm.
References
- Zhihao PENG (2019). “Stock analysis and prediction using big data analytics”,2019 International Conference on Intelligent Transportation, Big Data & Smart City (ICITBS).
- V Kranthi Reddy (2018). “Stock Market Prediction using Machine Learning”, International Research Journal of Engineering and Technology (IRJET),Vol 5 Issue 10, Oct 2018.
- M.D. Jaweed (2018). “Analysis of Stock Market by using Big Data Processing Environment”, International Journal of Pure and Applied Mathematics, Vol 119 No. 10.
- Vivek Kanade (2017). “Stock Market Prediction:Using Historical Data Analysis”, International Journal of Advanced Research in Computer Science and Software Engineering 2017,Vol 7 Issue 1.
- Arkilic (2017). “Stock Price Movement Prediction Using Mahout and Pydoop Documentation”,Release1 Oct 06, 2017.
- V.Sandhiya (2017). “Stock Market Prediction on Bigdata Using Machine Learning Algorithm”,International Journal of Engineering Science and Computing,Vol 7 Issue 4, April 2017.
- VaruneshNichante(2016). “A Review: Analysis of Stock Market by using Big Data Analytic Technology”, International Journal on Recent and Innovation Trends in Computing and Communication(IJIRITCC),Vol 4 Issue 1, Jan 2016.
- arna Nayak (2016). “Prediction Model for Indian Stock Market”, Twelfth International Multi- Conference on Information Processing (IMCIP).
- Mahantesh C. Angadi (2015). “Time Series Data Analysis for Stock Market Prediction using Data Mining Techniques with R”, International Journal of Advanced Research in Computer Science, Volume 6 No 6,July-August 2015.
- C. Ugwu (2014). “Machine Learning Application for Stock Market Prices Prediction”, IOSR Journal of Computer Engineering (IOSR- JCE),Volume 16 Issue 5.