Can We Greatly Predict the Price Movement of Bitcoin?

Bitcoin is a speculative instrument (I hesitate to call it an asset directly as the financial community has not settled on calling it an asset yet) that was created in 2009. It belongs to a family of speculative instruments called cryptocurrencies which contain many similar properties: trading occurs over a blockchain database, there is no physical counterpart of this instrument, etc.

In its early days, Bitcoin (hence forth we will refer to it as BTC) was a very obscure name and was rarely referred to in the mainstream. However, since 2015, BTC became far more popular as an investment vehicle and many became rich off of it. Since then, many have tried to predict BTC movements and have tried to become rich off of it. However, since the pre-pandemic peak on Dec 17, 2017, BTC became less favourable for the common investor as it smelled like a speculative trap. Since BTC did not produce anything in the actual economy, it is not backed by any economic phenomena and hence in the pre-pandemic era, it seemed as though it followed a similar trajectory as the Tulip Mania of the 1600s in the Netherlands.

However, since the pandemic began, we have seen massive BTC price inflation, and hence more people are once again interested in using BTC as an investment vehicle to become rich. In this notebook, we explore the properties of BTC (as in can we predict prices or their movement so as to generate trading profits) and eventually try to make a predictive network that can assist us in becoming rich as well.

This notebook also serves as fantastic introduction to the data science pipeline, and particularly the applications of this pipeline in a finance setting, as defined by:

  1. Data Curation
  2. Data Management and Representation
  3. Exploratory Data Analysis (and Hypothesis Testing)
  4. Machine Learning
  5. Generating Insights

Thereby, it is up to you, the reader, as to what to take away from this notebook. It serves a dual purpose: exploring the nature of BTC and introducing you to the overall data science pipeline.

Set Up

In this notebook, we will be using the following libraries. Because Tensorflow is typically not pre-downloaded in most machines, I have included code below that when executed will download Tensorflow to the current Jupyter Kernel.

Importing the Libraries

pandas Pandas is a data storage and manipulation library. Data is stored in 2-D dataframes which are similar to excel sheets. For us, we are using pandas dataframes to store the Date, Open Price, High Price, Low Price, Close Price, and Volume data.

numpy Numpy is a scientific computing library. Numpy stores data in n-dimensional arrays which are very similar to Matlab's matrices/vectors. Being a scientific computing library, Numpy optimizes computation speed. For us, using the scientific computing stack of numpy is ideal to train any ML model since a lot of computations are bound to occur.

matplotlib Matplotlib is a plotting library. For us, we use it to plot any data that we need to observe visually.

sklearn Scikit Learn is a popular Machine Learning library that contains, for us, a large number of pre-designed models that whose hyperparameters we can tune and deploy with ease.

tensorflow Tensorflow is a popular Deep learning library that contains various functions and modules we can use to design deep networks with relative ease.

datetime Datetime is library that allows us to convert string dates into date objects which has greater functionaliity (we can directly compare 2 datetime objects which will return the younger date etc.)

statsmodels Statsmodels is a popular statistical library whose OLS functionality we use to generate linear regressions.

Data Collection and Curation

Nature of Our Datasets

Bitcoin data is relatively sparse. Some datasets have data between 2014-2017, some with 2020-2021, but generally speaking there isn't 1 dataset that encompasses all of the price/volume data of Bitcoin we need. Hence, I used 2 datasets that overlap to get a more rich aggregate dataset such that we have a full dataset going back to 2013.

  1. Dataset 1: BTC-USD.csv Our first dataset (dataset1) stored in BTC-USD.csv contains the price and volume data of Bitcoin between April 10, 2020 to May 15, 2021.

BTC-USD.csv can be retrieved from https://finance.yahoo.com/quote/BTC-USD/history?period1=1410825600&period2=1621123200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true

  1. Dataset 2: BTCUSD_day.csv Our second dataset (dataset2) stored in BTCUSD_day.csv contains price and volume data of Bitcoin between April 28, 2013 to April 10, 2020.

BTCUSD_day.csv can be retrived from https://www.kaggle.com/prasoonkottarathil/btcinusd?select=BTCUSD_day.csv

Loading the Data

Cleaning the Data

Although our data is relatively well curated with Open, High, Low, and Close prices along with volume data, there are still lingering issues with our datasets.

In this section, we clean up the data s.t. after we can explore the data with ease and eventually train a model to predict prices. This section is broken down into multiple components listed out here:

Issue 1: Volume Mismatch

As we can see above,

dataset1 consists of the following columns: Date, Open, High, Low, Close, Adjusted Close, and Volume

while,

dataset2 consists of the following columns: Date, Symbol, Open, High, Low, Close, Volume BTC, Volume USD

There are 2 volumes present in dataset2, while 1 volume present in dataset1. This is naturally a problem as any predictive model we build will need a singluar definition of volume. Hence, we need to "standardize" which volume is the correct volume.

If we observe closer, we can see that Volume as defined in dataset1 corresponds to Volume USD as defined in dataset2. Hence, we can clean up this data by removing the BTC volume in dataset2

Issue 2: Extraneous Columns and Reverse Order

There are 2 primary issues with datset2:

The only issue with dataset1 is that we have an extraneous column of Adj. Close which we need to remove.

As we can now see, the extraneous column of Symbol has been removed and we have reset the order to oldest-latest for dataset2

Below, we now drop the Adj Close column from dataset1

Issue 3: Last Row Dataset 2

As we can see below, we have 2 copies of the data from April 10th, 2020. Because dataset2's April 10th, 2020 row has volume = 0, we will remove that row as that is illogical.

NOTE: Volume = 0 is possible for any trading instrument, however it is highly unlikely given that at the time people were regularly trading BTC hence, it is most likely corrupted data we see with Volume = 0 in dataset2. Hence we can go forward and only take dataset1's April 10th, 2020 row.

As you can see above, we have dropped the 1646th index which corresponded to the April 10th, 2020 row for dataset2.

Issue 4: Merging the Two Datasets

As we want to eventually train a full fledged model, we need to merge our datasets s.t. we only have 1 dataset which we can pass into a model.

Having only 1 dataset is also useful for initial observations of the data's distribution and will help us greatly in our Exploratory Data Analysis.

Issue 5: Ensuring Type Compatibility

We want to ensure that our Open, High, Low, Close prices and Volume are all numerical values. As shown below, we can see that they are all of numpy.float64 types.

A key fact above is that currently, the Date column is filled with some object type of data, most likely string. We confirm this via:

By this fact, we do want to convert these string values into datetime objects. Datetime objects allow us to access year, month, and day as individual fields, which will save us a lot of headache later as in current form we will have to parse the string to be able access those fields. Effectively, instead of having to parse the string every time to figure out the month, if we preprocess right now and convert all of these strings to datetime objects, it will be easier for us later down the line when we want to know the day, month, or year. Hence, we will convert these string-based dates to datetime objects.

Issue 6: Missing Data

Arguably the most important out of all the issues

We know that there is bound to be data missing in our data set. Rarely in data science do we perfect well curated data, as shown in the last 5 issues that we corrected. Hence we need to explore the missing data and how it is represented as well as ask critically why the data is missing.

The key questions we need to ask regarding the missing data are

Querying the Missing Data

Lets answer the first 3 questions:

To answer any of these questions, we need to query this missing data. To be able to query the missing data, we need to know what form this missing data takes. A hint we got in earlier sections is that our missing data might be have taken 2 forms:

Given these properties, we can check where the data is NaN and where the data is 0

We can see above that the data is NaN for 5 dates across all columns, and the data is 0 for volume for 3 dates. We also see that at any point only volume data is missing with 0 as replacement.

Hence we can answer all of our pertinent questions for this section:

Hypothesizing Why the Data is Missing

Lets answer our last question now: Are there any underlying correllations in the missing data? (Classifying our missing data as MAR, MCAR, or MNAR)

The NaN Missing Data

When speculating why this trading data is missing, the immediate reason we can suspect is that the missing data is linked to specific holidays.

We can observe this via the string of missing trading data that was replaced by NaN between October 9th, 2020 to October 13th, 2020. October 12th, 2020, another missing entry in our data, was Columbus day, and in a lot of markets, they had considered the entire weekend all the way to that Friday to be Columbus day weekend. Hence, there seems to be a strong correllation, and in fact a plausible causation, that the missing data in that interval is due to Columbus Day weekend leading to most traders taking a vacation hence the lack of trading data.

This reasoning is not uncommon in trading data, and often not uncommon in general Time Series Data. An unrelated example would be that the New York Stock Exchange has operating hours of 9:30 - 4:00. Hence if we were analysing price data of a specific stock every minute during a day, we would get quite a few rows with missing data as the stock market is not actively trading at, for example, 3:00 am. This missingness is unrelated with the actual observed variables, but are actually affected by unobserved variables.

Going off this hypothesis, it seems as though the missing data that is NaN is uncorrellated with observed trading data, as price and volume are not causing the missingness. Hence, we would classify this missing data as Missing Completely at Random (MCAR) as the probability that data is missing is uncorrellated with our values in the data set. Hence we cannot model and can mostly ignore this missing data.

The 0 Missing Data

It is unclear exactly why this data is missing. Through cross-referencing other data sources, there is trading volume present on the 3 missing days: 08/26/2017, 08/12/2018, and 10/16/2018.

We can hypothesize that this might just be corrupt data, as there are only 3 occurrences of volume = 0. Although this is a simplestic hypothesis as to why this data is missing, it might be impossible to determine if there is a non-random distribution of why this data is missing as there are very few data points with missing volume values.

Hence, as we can classify this volume data to be Missing Completely at Random (MCAR) as well.

Dealing with the Missing Data

Now that we know we can classify all the missing data as MCAR, we can deterimine what to do with this data. There are 3 typical approaches to deal with missing data:

  1. Remove the missing data
  2. Impute the missing data (replace the missing data)
  3. Encode the missing data (tell our model to ignore certain components of the missing data)

Dealing with the NaN Data

Because the NaN data has all columns with missing data, we can directly remove the missing data as it is likely impossible to encode or impute since we don't have any direct variable value present in the missing data by which we could impute the data (direct being defined as we don't have variable in the missing data that has a value since all of it is missing).

Dealing with the 0 Data

We can get the missing volume data via online data bases. As such, we will use yahoo and manually impute the data from a manual online query over trying to download another file to fill in the data.

Full Definition of the Data

Since we have cleaned up the data, lets now give an aggregate definition of the data which we can refer back to moving forward.

  1. Date: Represents the date on which this data is recorded
  2. Open: Represents the starting/opening price when trading began on a specific day
  3. High: Represents the highest price recorded during trading on a specific trading day
  4. Low: Represents the lowest price recorded during trading on a specific trading day
  5. Close: Represents the ending/closing price when trading began on a specific day
  6. Volume: Represents the aggregate amount of BTC (measured in USD) that was traded on a specific day

Exploratory Data Analysis

Lets now try to observe the underlying data distribution. This will provide us insights as to what type of ML system to use, the potential biases, and how to approach overall prediction.

Visualizing Evolution of BTC Prices

Lets first look at the general progression of BTC Prices over time. Since there are 4 measurements of prices: Open, Low, High, Close, we will plot all 4 as individual lines.

A key fact see here, which might seem obvious for the average trader, is that these prices are heavily correllated with each other. This makes sense as the Open Price is built on the prior day's Close Price, and any High and Low Prices that might be achieved on that are dependent upon the Open Price as we are not going to see massive stochasisity directly from the price data.

Although this may seem like an obvious fact, it is always wise to ensure that any expected correllations you might see in your data actually manifest in EDA.

Exploring Both Volume and Price

As we can see above, all 4 price metrics roughly match each other. Hence we can choose a singular representative price metric for all 4. We will be choosing Close price to be our representative price metric, although you may choose any.

Hence, lets now look at a graph with both volume and price plotted to get a fuller view of the underlying trends.

The key thing to note here is that although there was a price spike in 2018, a lot of the upward price movement in BTC is very much correllated with increased trading activity seen in BTC in the last year (since early 2020). As we are aware, part of this is likely due to the COVID-19 Pandemic as the start of this Volume trend began around March-April-May of 2020 as per the graph.

Lets explore this relationship between Volume and Price by splitting the data set into 2: one before March 11th, 2020 (the day the WHO declared COVID to be a pandemic) and one after. We do this because it is clear that this trading activity picked up during COVID-quarantine, which is an exogenous variable within our data. Thereby, we should analyze the underlying distribution of Volume and Price data via splitting this data set s.t. we do not observe the effect of the COVID-19 Exogenous Variable in our statistical data exploration.

Observing Volume vs. Close Price BEFORE WHO Declaration of COVID-19 to be a Pandemic

The easiest way to observe a relationship between variables is to estimate a linear regression between the 2 variables. Hence, we will try that.

NOTE: The linear regression shown above is clearly deviating from a best fit line due to a few outliers.

The common conviction among many statisticians seeing this would be to remove the outliers and re-estimate the linear regression. However, we will not do this. That is because these outliers are important to the price and volume movement of BTC and warrant further study. We know by looking at the volume data, BTC was relatively illiquid (meaning that it was not easily bought and sold) up until the COVID-19 Pandemic, and hence these volume spikes have certain casual factors we need to study.

Lets instead try to plot this in a time series graph.

NOTE: the volume spike between 2018-07 and 2019-01

It seems as though that volume spike is correllated with a drop off in BTC price. Here we can see that there seems to be somewhat of an underlying correllation between Volume and Price.

Generating the Volume vs. Close Price Regression AFTER WHO Declaration of COVID-19 to be a Pandemic

As earlier, lets look at the relationship between Volume and Price through the lens of a regression first.

NOTE: This regression above is clearly deviating from a best fit line due to a few outliers.

As earlier, we will not try to re-estimate the linear regression by removing the outliers since these deviations require study due to their magnitude. As before, we will instead plot a time series graph of volume and price.

NOTE: The large volume spike near 2021-03 timestamp

This graph seems to suggest that the correllation between volume and price pre-pandemic continues into the pandemic world. The large spike in trading activity was immediately followed by a rally in BTC price. This further signifies that the correllation between volume and price is temporal-invariant with respect to its existence. This means that throughout all time periods, there does exist a correllation between volume and price.

Hypothesis Testing of Price and Volume Relationship

Lets conduct Hypothesis testing to cement the relationship between Price and Volume and to measure its statistical significance.

Our null hypothesis will be that volume has no effect on price, and hence our alternative hypothesis will be that volume may have an effect on price.

As shown above, since the p-value is 0 for the Volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pre-pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.

Similarly, as shown above, since the p-value is 0 for the volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.

Observing the Relationship Between Price Change and Volume

Another key variable we want to observe is the daily percentage change in price from open to close. This variale has an important statistical implication: if we observe this variable to be stationary about a 0 mean, then we know that BTC evolves in a cyclic fashion, and in contrast if we observe this variable to be non-stationary, then we know that BTC is a stochastic variable that cannot be predicted via determining where we are in the cycle.

Hence, lets visualize the %Δ in price daily data.