Bitcoin is a speculative instrument (I hesitate to call it an asset directly as the financial community has not settled on calling it an asset yet) that was created in 2009. It belongs to a family of speculative instruments called cryptocurrencies which contain many similar properties: trading occurs over a blockchain database, there is no physical counterpart of this instrument, etc.
In its early days, Bitcoin (hence forth we will refer to it as BTC) was a very obscure name and was rarely referred to in the mainstream. However, since 2015, BTC became far more popular as an investment vehicle and many became rich off of it. Since then, many have tried to predict BTC movements and have tried to become rich off of it. However, since the pre-pandemic peak on Dec 17, 2017, BTC became less favourable for the common investor as it smelled like a speculative trap. Since BTC did not produce anything in the actual economy, it is not backed by any economic phenomena and hence in the pre-pandemic era, it seemed as though it followed a similar trajectory as the Tulip Mania of the 1600s in the Netherlands.
However, since the pandemic began, we have seen massive BTC price inflation, and hence more people are once again interested in using BTC as an investment vehicle to become rich. In this notebook, we explore the properties of BTC (as in can we predict prices or their movement so as to generate trading profits) and eventually try to make a predictive network that can assist us in becoming rich as well.
This notebook also serves as fantastic introduction to the data science pipeline, and particularly the applications of this pipeline in a finance setting, as defined by:
Thereby, it is up to you, the reader, as to what to take away from this notebook. It serves a dual purpose: exploring the nature of BTC and introducing you to the overall data science pipeline.
In this notebook, we will be using the following libraries. Because Tensorflow is typically not pre-downloaded in most machines, I have included code below that when executed will download Tensorflow to the current Jupyter Kernel.
!pip install tensorflow
Collecting tensorflow Downloading tensorflow-2.5.0-cp38-cp38-manylinux2010_x86_64.whl (454.4 MB) |████████████████████████████████| 454.4 MB 28 kB/s eta 0:00:011 |█▏ | 17.3 MB 2.3 MB/s eta 0:03:09 |█▊ | 24.4 MB 3.8 MB/s eta 0:01:53 |██▍ | 33.4 MB 6.6 MB/s eta 0:01:04 |██▋ | 36.9 MB 5.9 MB/s eta 0:01:12 |██▊ | 38.3 MB 5.9 MB/s eta 0:01:11 |██▊ | 38.6 MB 5.9 MB/s eta 0:01:11 |███▊ | 52.5 MB 5.3 MB/s eta 0:01:16 |████▌ | 64.5 MB 4.0 MB/s eta 0:01:38 |█████▋ | 79.7 MB 4.1 MB/s eta 0:01:31 |█████▉ | 83.3 MB 5.0 MB/s eta 0:01:14 |██████ | 85.6 MB 5.0 MB/s eta 0:01:14 |██████▏ | 88.1 MB 5.6 MB/s eta 0:01:06 |███████▊ | 110.1 MB 4.0 MB/s eta 0:01:27 |████████ | 114.8 MB 9.6 MB/s eta 0:00:36 |█████████▋ | 137.2 MB 5.3 MB/s eta 0:01:00 |█████████▉ | 138.9 MB 5.3 MB/s eta 0:01:00 |██████████▊ | 152.4 MB 2.7 MB/s eta 0:01:54 |███████████▏ | 159.0 MB 2.1 MB/s eta 0:02:20 |███████████▍ | 161.6 MB 10.0 MB/s eta 0:00:30 |███████████▊ | 165.7 MB 10.0 MB/s eta 0:00:29 |████████████ | 169.6 MB 9.5 MB/s eta 0:00:31 |████████████▍ | 175.7 MB 9.0 MB/s eta 0:00:31 |████████████▊ | 180.2 MB 9.0 MB/s eta 0:00:31 |████████████▊ | 180.7 MB 9.0 MB/s eta 0:00:31 |██████████████▌ | 205.3 MB 2.9 MB/s eta 0:01:26 |███████████████▍ | 218.9 MB 11.6 MB/s eta 0:00:21 |███████████████▊ | 223.4 MB 4.6 MB/s eta 0:00:50 |████████████████▍ | 233.0 MB 6.6 MB/s eta 0:00:34 |█████████████████ | 242.6 MB 4.9 MB/s eta 0:00:44 |██████████████████▏ | 258.7 MB 6.0 MB/s eta 0:00:33 |███████████████████▌ | 277.0 MB 5.0 MB/s eta 0:00:36 |██████████████████████▌ | 320.2 MB 5.2 MB/s eta 0:00:26 |██████████████████████▊ | 323.0 MB 5.2 MB/s eta 0:00:26 |███████████████████████ | 325.5 MB 7.4 MB/s eta 0:00:18 |███████████████████████▌ | 333.8 MB 2.8 MB/s eta 0:00:43 |███████████████████████▊ | 337.4 MB 3.6 MB/s eta 0:00:33 |███████████████████████▉ | 337.8 MB 3.6 MB/s eta 0:00:33 |█████████████████████████ | 354.7 MB 4.7 MB/s eta 0:00:22 |█████████████████████████▍ | 360.3 MB 4.7 MB/s eta 0:00:20 |█████████████████████████▌ | 361.8 MB 4.8 MB/s eta 0:00:20 |█████████████████████████▌ | 362.4 MB 4.8 MB/s eta 0:00:20 |██████████████████████████ | 370.3 MB 1.5 MB/s eta 0:00:57 |███████████████████████████▌ | 391.1 MB 2.2 MB/s eta 0:00:29 |████████████████████████████ | 398.1 MB 5.7 MB/s eta 0:00:10 |████████████████████████████▌ | 404.3 MB 5.4 MB/s eta 0:00:10 |██████████████████████████████ | 425.6 MB 2.5 MB/s eta 0:00:12 |██████████████████████████████▌ | 432.5 MB 1.9 MB/s eta 0:00:12 |██████████████████████████████▋ | 434.5 MB 6.0 MB/s eta 0:00:04 |███████████████████████████████ | 438.8 MB 6.0 MB/s eta 0:00:03 |███████████████████████████████ | 439.8 MB 12.5 MB/s eta 0:00:02 |███████████████████████████████ | 441.7 MB 12.5 MB/s eta 0:00:02 Requirement already satisfied: six~=1.15.0 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (1.15.0) Requirement already satisfied: h5py~=3.1.0 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (3.1.0) Collecting absl-py~=0.10 Downloading absl_py-0.12.0-py3-none-any.whl (129 kB) |████████████████████████████████| 129 kB 14.8 MB/s eta 0:00:01 Collecting wrapt~=1.12.1 Downloading wrapt-1.12.1.tar.gz (27 kB) Requirement already satisfied: numpy~=1.19.2 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (1.19.5) Collecting tensorflow-estimator<2.6.0,>=2.5.0rc0 Downloading tensorflow_estimator-2.5.0-py2.py3-none-any.whl (462 kB) |████████████████████████████████| 462 kB 6.9 MB/s eta 0:00:01 Collecting grpcio~=1.34.0 Downloading grpcio-1.34.1-cp38-cp38-manylinux2014_x86_64.whl (4.0 MB) |████████████████████████████████| 4.0 MB 4.5 MB/s eta 0:00:01 |███████████████████████▎ | 2.9 MB 4.5 MB/s eta 0:00:01 Collecting keras-nightly~=2.5.0.dev Downloading keras_nightly-2.5.0.dev2021032900-py2.py3-none-any.whl (1.2 MB) |████████████████████████████████| 1.2 MB 7.3 MB/s eta 0:00:01 Collecting flatbuffers~=1.12.0 Downloading flatbuffers-1.12-py2.py3-none-any.whl (15 kB) Collecting tensorboard~=2.5 Downloading tensorboard-2.5.0-py3-none-any.whl (6.0 MB) |████████████████████████████████| 6.0 MB 14.6 MB/s eta 0:00:01 Collecting gast==0.4.0 Downloading gast-0.4.0-py3-none-any.whl (9.8 kB) Collecting google-pasta~=0.2 Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB) |████████████████████████████████| 57 kB 4.6 MB/s eta 0:00:01 Collecting opt-einsum~=3.3.0 Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB) |████████████████████████████████| 65 kB 4.4 MB/s eta 0:00:011 Collecting astunparse~=1.6.3 Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB) Requirement already satisfied: typing-extensions~=3.7.4 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (3.7.4.3) Collecting keras-preprocessing~=1.1.2 Downloading Keras_Preprocessing-1.1.2-py2.py3-none-any.whl (42 kB) |████████████████████████████████| 42 kB 1.5 MB/s eta 0:00:01 Collecting termcolor~=1.1.0 Downloading termcolor-1.1.0.tar.gz (3.9 kB) Requirement already satisfied: protobuf>=3.9.2 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (3.14.0) Requirement already satisfied: wheel~=0.35 in /opt/conda/lib/python3.8/site-packages (from tensorflow) (0.36.2) Collecting werkzeug>=0.11.15 Downloading Werkzeug-2.0.0-py3-none-any.whl (288 kB) |████████████████████████████████| 288 kB 12.1 MB/s eta 0:00:01 Collecting tensorboard-plugin-wit>=1.6.0 Downloading tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB) |████████████████████████████████| 781 kB 7.8 MB/s eta 0:00:01 Requirement already satisfied: requests<3,>=2.21.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard~=2.5->tensorflow) (2.25.1) Collecting markdown>=2.6.8 Downloading Markdown-3.3.4-py3-none-any.whl (97 kB) |████████████████████████████████| 97 kB 3.8 MB/s eta 0:00:01 Collecting google-auth-oauthlib<0.5,>=0.4.1 Downloading google_auth_oauthlib-0.4.4-py2.py3-none-any.whl (18 kB) Collecting google-auth<2,>=1.6.3 Downloading google_auth-1.30.0-py2.py3-none-any.whl (146 kB) |████████████████████████████████| 146 kB 7.2 MB/s eta 0:00:01 Collecting tensorboard-data-server<0.7.0,>=0.6.0 Downloading tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9 MB) |████████████████████████████████| 4.9 MB 5.2 MB/s eta 0:00:01 |██▋ | 399 kB 5.2 MB/s eta 0:00:01 Requirement already satisfied: setuptools>=41.0.0 in /opt/conda/lib/python3.8/site-packages (from tensorboard~=2.5->tensorflow) (49.6.0.post20210108) Collecting rsa<5,>=3.1.4 Downloading rsa-4.7.2-py3-none-any.whl (34 kB) Collecting cachetools<5.0,>=2.0.0 Downloading cachetools-4.2.2-py3-none-any.whl (11 kB) Collecting pyasn1-modules>=0.2.1 Downloading pyasn1_modules-0.2.8-py2.py3-none-any.whl (155 kB) |████████████████████████████████| 155 kB 7.2 MB/s eta 0:00:01 Collecting requests-oauthlib>=0.7.0 Downloading requests_oauthlib-1.3.0-py2.py3-none-any.whl (23 kB) Collecting pyasn1<0.5.0,>=0.4.6 Downloading pyasn1-0.4.8-py2.py3-none-any.whl (77 kB) |████████████████████████████████| 77 kB 3.1 MB/s eta 0:00:01 Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow) (1.26.3) Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow) (2.10) Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow) (2020.12.5) Requirement already satisfied: chardet<5,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests<3,>=2.21.0->tensorboard~=2.5->tensorflow) (4.0.0) Requirement already satisfied: oauthlib>=3.0.0 in /opt/conda/lib/python3.8/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.5->tensorflow) (3.0.1) Building wheels for collected packages: termcolor, wrapt Building wheel for termcolor (setup.py) ... done Created wheel for termcolor: filename=termcolor-1.1.0-py3-none-any.whl size=4829 sha256=e8f80d6c42650f141e467a80050ef6a6c8efba11b67bc2f00c6afe512b9f7939 Stored in directory: /home/jovyan/.cache/pip/wheels/a0/16/9c/5473df82468f958445479c59e784896fa24f4a5fc024b0f501 Building wheel for wrapt (setup.py) ... done Created wheel for wrapt: filename=wrapt-1.12.1-cp38-cp38-linux_x86_64.whl size=81771 sha256=c49c0a6c134f732af69484d67c84151675a18ce99cdeba048f0357aa0ec143fc Stored in directory: /home/jovyan/.cache/pip/wheels/5f/fd/9e/b6cf5890494cb8ef0b5eaff72e5d55a70fb56316007d6dfe73 Successfully built termcolor wrapt Installing collected packages: pyasn1, rsa, pyasn1-modules, cachetools, requests-oauthlib, google-auth, werkzeug, tensorboard-plugin-wit, tensorboard-data-server, markdown, grpcio, google-auth-oauthlib, absl-py, wrapt, termcolor, tensorflow-estimator, tensorboard, opt-einsum, keras-preprocessing, keras-nightly, google-pasta, gast, flatbuffers, astunparse, tensorflow Successfully installed absl-py-0.12.0 astunparse-1.6.3 cachetools-4.2.2 flatbuffers-1.12 gast-0.4.0 google-auth-1.30.0 google-auth-oauthlib-0.4.4 google-pasta-0.2.0 grpcio-1.34.1 keras-nightly-2.5.0.dev2021032900 keras-preprocessing-1.1.2 markdown-3.3.4 opt-einsum-3.3.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 requests-oauthlib-1.3.0 rsa-4.7.2 tensorboard-2.5.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.5.0 tensorflow-estimator-2.5.0 termcolor-1.1.0 werkzeug-2.0.0 wrapt-1.12.1
pandas Pandas is a data storage and manipulation library. Data is stored in 2-D dataframes which are similar to excel sheets. For us, we are using pandas dataframes to store the Date, Open Price, High Price, Low Price, Close Price, and Volume data.
numpy Numpy is a scientific computing library. Numpy stores data in n-dimensional arrays which are very similar to Matlab's matrices/vectors. Being a scientific computing library, Numpy optimizes computation speed. For us, using the scientific computing stack of numpy is ideal to train any ML model since a lot of computations are bound to occur.
matplotlib Matplotlib is a plotting library. For us, we use it to plot any data that we need to observe visually.
sklearn Scikit Learn is a popular Machine Learning library that contains, for us, a large number of pre-designed models that whose hyperparameters we can tune and deploy with ease.
tensorflow Tensorflow is a popular Deep learning library that contains various functions and modules we can use to design deep networks with relative ease.
datetime Datetime is library that allows us to convert string dates into date objects which has greater functionaliity (we can directly compare 2 datetime objects which will return the younger date etc.)
statsmodels Statsmodels is a popular statistical library whose OLS functionality we use to generate linear regressions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
import tensorflow as tf
import datetime
from datetime import datetime
from statsmodels.formula.api import ols
Bitcoin data is relatively sparse. Some datasets have data between 2014-2017, some with 2020-2021, but generally speaking there isn't 1 dataset that encompasses all of the price/volume data of Bitcoin we need. Hence, I used 2 datasets that overlap to get a more rich aggregate dataset such that we have a full dataset going back to 2013.
BTC-USD.csv can be retrieved from https://finance.yahoo.com/quote/BTC-USD/history?period1=1410825600&period2=1621123200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
BTCUSD_day.csv can be retrived from https://www.kaggle.com/prasoonkottarathil/btcinusd?select=BTCUSD_day.csv
dataset1 = pd.read_csv('BTC-USD.csv')
dataset1
Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|
0 | 2020-04-10 | 7303.815430 | 7303.815430 | 6802.475098 | 6865.493164 | 6865.493164 | 4.362284e+10 |
1 | 2020-04-11 | 6867.440430 | 6926.069824 | 6789.920898 | 6859.083008 | 6859.083008 | 3.122209e+10 |
2 | 2020-04-12 | 6858.067871 | 7119.947266 | 6811.078125 | 6971.091797 | 6971.091797 | 3.575957e+10 |
3 | 2020-04-13 | 6965.616699 | 6965.616699 | 6668.259766 | 6845.037598 | 6845.037598 | 3.861931e+10 |
4 | 2020-04-14 | 6843.281738 | 6958.557129 | 6793.821289 | 6842.427734 | 6842.427734 | 3.411043e+10 |
... | ... | ... | ... | ... | ... | ... | ... |
396 | 2021-05-11 | 55847.242188 | 56872.542969 | 54608.652344 | 56704.574219 | 56704.574219 | 6.130840e+10 |
397 | 2021-05-12 | 56714.531250 | 57939.363281 | 49150.535156 | 49150.535156 | 49150.535156 | 7.521540e+10 |
398 | 2021-05-13 | 49735.433594 | 51330.843750 | 46980.019531 | 49716.191406 | 49716.191406 | 9.672115e+10 |
399 | 2021-05-14 | 49682.980469 | 51438.117188 | 48868.578125 | 49880.535156 | 49880.535156 | 5.573750e+10 |
400 | 2021-05-15 | NaN | NaN | NaN | NaN | NaN | NaN |
401 rows × 7 columns
dataset2 = pd.read_csv('BTCUSD_day.csv')
dataset2
Date | Symbol | Open | High | Low | Close | Volume BTC | Volume USD | |
---|---|---|---|---|---|---|---|---|
0 | 2020-04-10 | BTCUSD | 7315.25 | 7315.25 | 7315.25 | 7315.25 | 0.00 | 0.00 |
1 | 2020-04-09 | BTCUSD | 7369.60 | 7378.85 | 7115.04 | 7315.25 | 2237.13 | 16310014.53 |
2 | 2020-04-08 | BTCUSD | 7201.81 | 7432.23 | 7152.80 | 7369.60 | 2483.60 | 18138080.27 |
3 | 2020-04-07 | BTCUSD | 7336.96 | 7468.42 | 7078.00 | 7201.81 | 2333.34 | 17047120.32 |
4 | 2020-04-06 | BTCUSD | 6775.21 | 7369.76 | 6771.01 | 7336.96 | 3727.47 | 26533750.17 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1642 | 2015-10-12 | BTCUSD | 249.50 | 249.50 | 247.60 | 247.60 | 38.28 | 9493.89 |
1643 | 2015-10-11 | BTCUSD | 246.30 | 249.50 | 246.30 | 249.50 | 12.22 | 3021.12 |
1644 | 2015-10-10 | BTCUSD | 245.39 | 246.30 | 244.60 | 246.30 | 12.17 | 2984.44 |
1645 | 2015-10-09 | BTCUSD | 243.95 | 249.97 | 243.60 | 245.39 | 30.99 | 7651.63 |
1646 | 2015-10-08 | BTCUSD | 242.50 | 245.00 | 242.50 | 243.95 | 18.80 | 4595.84 |
1647 rows × 8 columns
Although our data is relatively well curated with Open, High, Low, and Close prices along with volume data, there are still lingering issues with our datasets.
In this section, we clean up the data s.t. after we can explore the data with ease and eventually train a model to predict prices. This section is broken down into multiple components listed out here:
As we can see above,
dataset1 consists of the following columns: Date, Open, High, Low, Close, Adjusted Close, and Volume
while,
dataset2 consists of the following columns: Date, Symbol, Open, High, Low, Close, Volume BTC, Volume USD
There are 2 volumes present in dataset2, while 1 volume present in dataset1. This is naturally a problem as any predictive model we build will need a singluar definition of volume. Hence, we need to "standardize" which volume is the correct volume.
If we observe closer, we can see that Volume as defined in dataset1 corresponds to Volume USD as defined in dataset2. Hence, we can clean up this data by removing the BTC volume in dataset2
# Drops the Volume BTC Column
dataset2 = dataset2.drop('Volume BTC', axis=1)
# renames the Volume USD column to Volume
dataset2.rename(columns={"Volume USD" : "Volume"}, inplace=True)
# displays the first 5 rows to see how the data looks now
dataset2.head()
Date | Symbol | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|---|
0 | 2020-04-10 | BTCUSD | 7315.25 | 7315.25 | 7315.25 | 7315.25 | 0.00 |
1 | 2020-04-09 | BTCUSD | 7369.60 | 7378.85 | 7115.04 | 7315.25 | 16310014.53 |
2 | 2020-04-08 | BTCUSD | 7201.81 | 7432.23 | 7152.80 | 7369.60 | 18138080.27 |
3 | 2020-04-07 | BTCUSD | 7336.96 | 7468.42 | 7078.00 | 7201.81 | 17047120.32 |
4 | 2020-04-06 | BTCUSD | 6775.21 | 7369.76 | 6771.01 | 7336.96 | 26533750.17 |
There are 2 primary issues with datset2:
The only issue with dataset1 is that we have an extraneous column of Adj. Close which we need to remove.
# drops the Symbol column
dataset2.drop('Symbol', axis=1, inplace=True)
# reverses the order of the rows s.t. dataset2 is also in oldest-latest order
dataset2 = dataset2.iloc[::-1].reset_index(drop=True)
dataset2
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2015-10-08 | 242.50 | 245.00 | 242.50 | 243.95 | 4595.84 |
1 | 2015-10-09 | 243.95 | 249.97 | 243.60 | 245.39 | 7651.63 |
2 | 2015-10-10 | 245.39 | 246.30 | 244.60 | 246.30 | 2984.44 |
3 | 2015-10-11 | 246.30 | 249.50 | 246.30 | 249.50 | 3021.12 |
4 | 2015-10-12 | 249.50 | 249.50 | 247.60 | 247.60 | 9493.89 |
... | ... | ... | ... | ... | ... | ... |
1642 | 2020-04-06 | 6775.21 | 7369.76 | 6771.01 | 7336.96 | 26533750.17 |
1643 | 2020-04-07 | 7336.96 | 7468.42 | 7078.00 | 7201.81 | 17047120.32 |
1644 | 2020-04-08 | 7201.81 | 7432.23 | 7152.80 | 7369.60 | 18138080.27 |
1645 | 2020-04-09 | 7369.60 | 7378.85 | 7115.04 | 7315.25 | 16310014.53 |
1646 | 2020-04-10 | 7315.25 | 7315.25 | 7315.25 | 7315.25 | 0.00 |
1647 rows × 6 columns
As we can now see, the extraneous column of Symbol has been removed and we have reset the order to oldest-latest for dataset2
Below, we now drop the Adj Close column from dataset1
# removes the Adj. Close column
dataset1.drop('Adj Close', axis=1, inplace=True)
dataset1.head()
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2020-04-10 | 7303.815430 | 7303.815430 | 6802.475098 | 6865.493164 | 4.362284e+10 |
1 | 2020-04-11 | 6867.440430 | 6926.069824 | 6789.920898 | 6859.083008 | 3.122209e+10 |
2 | 2020-04-12 | 6858.067871 | 7119.947266 | 6811.078125 | 6971.091797 | 3.575957e+10 |
3 | 2020-04-13 | 6965.616699 | 6965.616699 | 6668.259766 | 6845.037598 | 3.861931e+10 |
4 | 2020-04-14 | 6843.281738 | 6958.557129 | 6793.821289 | 6842.427734 | 3.411043e+10 |
As we can see below, we have 2 copies of the data from April 10th, 2020. Because dataset2's April 10th, 2020 row has volume = 0, we will remove that row as that is illogical.
NOTE: Volume = 0 is possible for any trading instrument, however it is highly unlikely given that at the time people were regularly trading BTC hence, it is most likely corrupted data we see with Volume = 0 in dataset2. Hence we can go forward and only take dataset1's April 10th, 2020 row.
dataset1
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2020-04-10 | 7303.815430 | 7303.815430 | 6802.475098 | 6865.493164 | 4.362284e+10 |
1 | 2020-04-11 | 6867.440430 | 6926.069824 | 6789.920898 | 6859.083008 | 3.122209e+10 |
2 | 2020-04-12 | 6858.067871 | 7119.947266 | 6811.078125 | 6971.091797 | 3.575957e+10 |
3 | 2020-04-13 | 6965.616699 | 6965.616699 | 6668.259766 | 6845.037598 | 3.861931e+10 |
4 | 2020-04-14 | 6843.281738 | 6958.557129 | 6793.821289 | 6842.427734 | 3.411043e+10 |
... | ... | ... | ... | ... | ... | ... |
396 | 2021-05-11 | 55847.242188 | 56872.542969 | 54608.652344 | 56704.574219 | 6.130840e+10 |
397 | 2021-05-12 | 56714.531250 | 57939.363281 | 49150.535156 | 49150.535156 | 7.521540e+10 |
398 | 2021-05-13 | 49735.433594 | 51330.843750 | 46980.019531 | 49716.191406 | 9.672115e+10 |
399 | 2021-05-14 | 49682.980469 | 51438.117188 | 48868.578125 | 49880.535156 | 5.573750e+10 |
400 | 2021-05-15 | NaN | NaN | NaN | NaN | NaN |
401 rows × 6 columns
dataset2
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2015-10-08 | 242.50 | 245.00 | 242.50 | 243.95 | 4595.84 |
1 | 2015-10-09 | 243.95 | 249.97 | 243.60 | 245.39 | 7651.63 |
2 | 2015-10-10 | 245.39 | 246.30 | 244.60 | 246.30 | 2984.44 |
3 | 2015-10-11 | 246.30 | 249.50 | 246.30 | 249.50 | 3021.12 |
4 | 2015-10-12 | 249.50 | 249.50 | 247.60 | 247.60 | 9493.89 |
... | ... | ... | ... | ... | ... | ... |
1642 | 2020-04-06 | 6775.21 | 7369.76 | 6771.01 | 7336.96 | 26533750.17 |
1643 | 2020-04-07 | 7336.96 | 7468.42 | 7078.00 | 7201.81 | 17047120.32 |
1644 | 2020-04-08 | 7201.81 | 7432.23 | 7152.80 | 7369.60 | 18138080.27 |
1645 | 2020-04-09 | 7369.60 | 7378.85 | 7115.04 | 7315.25 | 16310014.53 |
1646 | 2020-04-10 | 7315.25 | 7315.25 | 7315.25 | 7315.25 | 0.00 |
1647 rows × 6 columns
dataset2.drop(1646, axis=0, inplace=True)
dataset2.tail()
# Observe the 1646 row has been dropped
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
1641 | 2020-04-05 | 6870.20 | 6907.90 | 6678.60 | 6775.21 | 8662210.80 |
1642 | 2020-04-06 | 6775.21 | 7369.76 | 6771.01 | 7336.96 | 26533750.17 |
1643 | 2020-04-07 | 7336.96 | 7468.42 | 7078.00 | 7201.81 | 17047120.32 |
1644 | 2020-04-08 | 7201.81 | 7432.23 | 7152.80 | 7369.60 | 18138080.27 |
1645 | 2020-04-09 | 7369.60 | 7378.85 | 7115.04 | 7315.25 | 16310014.53 |
As you can see above, we have dropped the 1646th index which corresponded to the April 10th, 2020 row for dataset2.
As we want to eventually train a full fledged model, we need to merge our datasets s.t. we only have 1 dataset which we can pass into a model.
Having only 1 dataset is also useful for initial observations of the data's distribution and will help us greatly in our Exploratory Data Analysis.
frames = [dataset2, dataset1]
# merging the two data sets
data = pd.concat(frames).reset_index(drop=True)
data
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2015-10-08 | 242.500000 | 245.000000 | 242.500000 | 243.950000 | 4.595840e+03 |
1 | 2015-10-09 | 243.950000 | 249.970000 | 243.600000 | 245.390000 | 7.651630e+03 |
2 | 2015-10-10 | 245.390000 | 246.300000 | 244.600000 | 246.300000 | 2.984440e+03 |
3 | 2015-10-11 | 246.300000 | 249.500000 | 246.300000 | 249.500000 | 3.021120e+03 |
4 | 2015-10-12 | 249.500000 | 249.500000 | 247.600000 | 247.600000 | 9.493890e+03 |
... | ... | ... | ... | ... | ... | ... |
2042 | 2021-05-11 | 55847.242188 | 56872.542969 | 54608.652344 | 56704.574219 | 6.130840e+10 |
2043 | 2021-05-12 | 56714.531250 | 57939.363281 | 49150.535156 | 49150.535156 | 7.521540e+10 |
2044 | 2021-05-13 | 49735.433594 | 51330.843750 | 46980.019531 | 49716.191406 | 9.672115e+10 |
2045 | 2021-05-14 | 49682.980469 | 51438.117188 | 48868.578125 | 49880.535156 | 5.573750e+10 |
2046 | 2021-05-15 | NaN | NaN | NaN | NaN | NaN |
2047 rows × 6 columns
We want to ensure that our Open, High, Low, Close prices and Volume are all numerical values. As shown below, we can see that they are all of numpy.float64 types.
data.dtypes
Date object Open float64 High float64 Low float64 Close float64 Volume float64 dtype: object
A key fact above is that currently, the Date column is filled with some object type of data, most likely string. We confirm this via:
print(type(data['Date'][0]))
print(type(data['Date'][2040]))
<class 'str'> <class 'str'>
By this fact, we do want to convert these string values into datetime objects. Datetime objects allow us to access year, month, and day as individual fields, which will save us a lot of headache later as in current form we will have to parse the string to be able access those fields. Effectively, instead of having to parse the string every time to figure out the month, if we preprocess right now and convert all of these strings to datetime objects, it will be easier for us later down the line when we want to know the day, month, or year. Hence, we will convert these string-based dates to datetime objects.
# converts Date column from a Series of strings to a Series of datetime objects
data['Date'] = data['Date'].apply(lambda x: datetime.strptime(x, "%Y-%m-%d"))
type(data['Date'][0])
pandas._libs.tslibs.timestamps.Timestamp
Arguably the most important out of all the issues
We know that there is bound to be data missing in our data set. Rarely in data science do we perfect well curated data, as shown in the last 5 issues that we corrected. Hence we need to explore the missing data and how it is represented as well as ask critically why the data is missing.
The key questions we need to ask regarding the missing data are
Lets answer the first 3 questions:
To answer any of these questions, we need to query this missing data. To be able to query the missing data, we need to know what form this missing data takes. A hint we got in earlier sections is that our missing data might be have taken 2 forms:
Given these properties, we can check where the data is NaN and where the data is 0
# prints locations where data is NaN
data[data.isna().any(axis=1)]
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
1653 | 2020-04-17 | NaN | NaN | NaN | NaN | NaN |
1828 | 2020-10-09 | NaN | NaN | NaN | NaN | NaN |
1831 | 2020-10-12 | NaN | NaN | NaN | NaN | NaN |
1832 | 2020-10-13 | NaN | NaN | NaN | NaN | NaN |
2046 | 2021-05-15 | NaN | NaN | NaN | NaN | NaN |
# prints locations where data is 0 for Volume
data[data['Volume'] == 0]
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
688 | 2017-08-26 | 4371.25 | 4371.25 | 4371.25 | 4371.25 | 0.0 |
1039 | 2018-08-12 | 6091.87 | 6091.87 | 6091.87 | 6091.87 | 0.0 |
1104 | 2018-10-16 | 6437.66 | 6437.66 | 6428.43 | 6428.43 | 0.0 |
# prints locations where data is 0 for High
data[data['High'] == 0]
Date | Open | High | Low | Close | Volume |
---|
# prints locations where data is 0 for Low
data[data['Low'] == 0]
Date | Open | High | Low | Close | Volume |
---|
# prints locations where data is 0 for Open
data[data['Open'] == 0]
Date | Open | High | Low | Close | Volume |
---|
# prints locations where data is 0 for Close
data[data['Close'] == 0]
Date | Open | High | Low | Close | Volume |
---|
We can see above that the data is NaN for 5 dates across all columns, and the data is 0 for volume for 3 dates. We also see that at any point only volume data is missing with 0 as replacement.
Hence we can answer all of our pertinent questions for this section:
What data is missing? The missing data is listed above. We are missing 5 rows of data where every column has data missing represented by NaN. We also have 3 rows of data where only the volume column has data missing and is represented by 0.
How much data is missing? As mentioned above, we are missing a total of 8 rows of data where 5 rows are fully missing and 3 rows are partially missing
For what variables is there missing data? As mentioned above, we are missing, for 5 rows, all of the price and volume data, while for 3 rows, we are missing just the volume data.
Lets answer our last question now: Are there any underlying correllations in the missing data? (Classifying our missing data as MAR, MCAR, or MNAR)
When speculating why this trading data is missing, the immediate reason we can suspect is that the missing data is linked to specific holidays.
We can observe this via the string of missing trading data that was replaced by NaN between October 9th, 2020 to October 13th, 2020. October 12th, 2020, another missing entry in our data, was Columbus day, and in a lot of markets, they had considered the entire weekend all the way to that Friday to be Columbus day weekend. Hence, there seems to be a strong correllation, and in fact a plausible causation, that the missing data in that interval is due to Columbus Day weekend leading to most traders taking a vacation hence the lack of trading data.
This reasoning is not uncommon in trading data, and often not uncommon in general Time Series Data. An unrelated example would be that the New York Stock Exchange has operating hours of 9:30 - 4:00. Hence if we were analysing price data of a specific stock every minute during a day, we would get quite a few rows with missing data as the stock market is not actively trading at, for example, 3:00 am. This missingness is unrelated with the actual observed variables, but are actually affected by unobserved variables.
Going off this hypothesis, it seems as though the missing data that is NaN is uncorrellated with observed trading data, as price and volume are not causing the missingness. Hence, we would classify this missing data as Missing Completely at Random (MCAR) as the probability that data is missing is uncorrellated with our values in the data set. Hence we cannot model and can mostly ignore this missing data.
It is unclear exactly why this data is missing. Through cross-referencing other data sources, there is trading volume present on the 3 missing days: 08/26/2017, 08/12/2018, and 10/16/2018.
We can hypothesize that this might just be corrupt data, as there are only 3 occurrences of volume = 0. Although this is a simplestic hypothesis as to why this data is missing, it might be impossible to determine if there is a non-random distribution of why this data is missing as there are very few data points with missing volume values.
Hence, as we can classify this volume data to be Missing Completely at Random (MCAR) as well.
Now that we know we can classify all the missing data as MCAR, we can deterimine what to do with this data. There are 3 typical approaches to deal with missing data:
Because the NaN data has all columns with missing data, we can directly remove the missing data as it is likely impossible to encode or impute since we don't have any direct variable value present in the missing data by which we could impute the data (direct being defined as we don't have variable in the missing data that has a value since all of it is missing).
# drops any row with NaN values; axis = 0 -> drops rows
data = data.dropna(axis=0)
# resets indices
data = data.reset_index(drop=True)
data
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2015-10-08 | 242.500000 | 245.000000 | 242.500000 | 243.950000 | 4.595840e+03 |
1 | 2015-10-09 | 243.950000 | 249.970000 | 243.600000 | 245.390000 | 7.651630e+03 |
2 | 2015-10-10 | 245.390000 | 246.300000 | 244.600000 | 246.300000 | 2.984440e+03 |
3 | 2015-10-11 | 246.300000 | 249.500000 | 246.300000 | 249.500000 | 3.021120e+03 |
4 | 2015-10-12 | 249.500000 | 249.500000 | 247.600000 | 247.600000 | 9.493890e+03 |
... | ... | ... | ... | ... | ... | ... |
2037 | 2021-05-10 | 58250.871094 | 59519.355469 | 54071.457031 | 55859.796875 | 7.177655e+10 |
2038 | 2021-05-11 | 55847.242188 | 56872.542969 | 54608.652344 | 56704.574219 | 6.130840e+10 |
2039 | 2021-05-12 | 56714.531250 | 57939.363281 | 49150.535156 | 49150.535156 | 7.521540e+10 |
2040 | 2021-05-13 | 49735.433594 | 51330.843750 | 46980.019531 | 49716.191406 | 9.672115e+10 |
2041 | 2021-05-14 | 49682.980469 | 51438.117188 | 48868.578125 | 49880.535156 | 5.573750e+10 |
2042 rows × 6 columns
We can get the missing volume data via online data bases. As such, we will use yahoo and manually impute the data from a manual online query over trying to download another file to fill in the data.
# acquires the new indices where we need to update the 0 volume data since we re-indexed in earlier bloc
# Turns out they are the same, but good practice still to ensure you are updating the correct indices
data[data['Volume'] == 0]
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
688 | 2017-08-26 | 4371.25 | 4371.25 | 4371.25 | 4371.25 | 0.0 |
1039 | 2018-08-12 | 6091.87 | 6091.87 | 6091.87 | 6091.87 | 0.0 |
1104 | 2018-10-16 | 6437.66 | 6437.66 | 6428.43 | 6428.43 | 0.0 |
# Replaces the 0 volume values with values searched up on yahoo
data.at[688, 'Volume'] = 1511609984
data.at[1039, 'Volume'] = 5665250000
data.at[1104, 'Volume'] = 4074800000
# displays whether we still have 0 volume values to ensure we did change the values
data[data['Volume'] == 0]
Date | Open | High | Low | Close | Volume |
---|
Since we have cleaned up the data, lets now give an aggregate definition of the data which we can refer back to moving forward.
data
Date | Open | High | Low | Close | Volume | |
---|---|---|---|---|---|---|
0 | 2015-10-08 | 242.500000 | 245.000000 | 242.500000 | 243.950000 | 4.595840e+03 |
1 | 2015-10-09 | 243.950000 | 249.970000 | 243.600000 | 245.390000 | 7.651630e+03 |
2 | 2015-10-10 | 245.390000 | 246.300000 | 244.600000 | 246.300000 | 2.984440e+03 |
3 | 2015-10-11 | 246.300000 | 249.500000 | 246.300000 | 249.500000 | 3.021120e+03 |
4 | 2015-10-12 | 249.500000 | 249.500000 | 247.600000 | 247.600000 | 9.493890e+03 |
... | ... | ... | ... | ... | ... | ... |
2037 | 2021-05-10 | 58250.871094 | 59519.355469 | 54071.457031 | 55859.796875 | 7.177655e+10 |
2038 | 2021-05-11 | 55847.242188 | 56872.542969 | 54608.652344 | 56704.574219 | 6.130840e+10 |
2039 | 2021-05-12 | 56714.531250 | 57939.363281 | 49150.535156 | 49150.535156 | 7.521540e+10 |
2040 | 2021-05-13 | 49735.433594 | 51330.843750 | 46980.019531 | 49716.191406 | 9.672115e+10 |
2041 | 2021-05-14 | 49682.980469 | 51438.117188 | 48868.578125 | 49880.535156 | 5.573750e+10 |
2042 rows × 6 columns
Lets now try to observe the underlying data distribution. This will provide us insights as to what type of ML system to use, the potential biases, and how to approach overall prediction.
Lets first look at the general progression of BTC Prices over time. Since there are 4 measurements of prices: Open, Low, High, Close, we will plot all 4 as individual lines.
plot_cols = ['Open', 'High', 'Low', 'Close']
# setting size of the plot
plt.figure(figsize=(20,10))
# titling the graph and setting the labels
plt.title('Historical BTC Price Data in USD')
plt.ylabel('Price')
#plotting each price metric
for col in plot_cols:
y = data[col]
x = data['Date']
plt.plot(x, y, label=col)
plt.legend(bbox_to_anchor=(1.1,1), loc=0, borderaxespad=0)
A key fact see here, which might seem obvious for the average trader, is that these prices are heavily correllated with each other. This makes sense as the Open Price is built on the prior day's Close Price, and any High and Low Prices that might be achieved on that are dependent upon the Open Price as we are not going to see massive stochasisity directly from the price data.
Although this may seem like an obvious fact, it is always wise to ensure that any expected correllations you might see in your data actually manifest in EDA.
As we can see above, all 4 price metrics roughly match each other. Hence we can choose a singular representative price metric for all 4. We will be choosing Close price to be our representative price metric, although you may choose any.
Hence, lets now look at a graph with both volume and price plotted to get a fuller view of the underlying trends.
# plotting the top portion which is the Close price
top_plt = plt.subplot2grid((5,4), (0, 0), rowspan=3, colspan=4)
top_plt.plot(data['Date'], data['Close'])
# titling the top half
plt.title('Historical BTC Price Data in USD')
# titling the bottom half
plt.title('\nHistorical BTC Trading Volume')
# plotting the bottom portion which is Volume
bot_plt = plt.subplot2grid((5,4), (3,0), rowspan=3, colspan=4)
bot_plt.bar(data['Date'], data['Volume'])
plt.gcf().set_size_inches(12,8)
The key thing to note here is that although there was a price spike in 2018, a lot of the upward price movement in BTC is very much correllated with increased trading activity seen in BTC in the last year (since early 2020). As we are aware, part of this is likely due to the COVID-19 Pandemic as the start of this Volume trend began around March-April-May of 2020 as per the graph.
Lets explore this relationship between Volume and Price by splitting the data set into 2: one before March 11th, 2020 (the day the WHO declared COVID to be a pandemic) and one after. We do this because it is clear that this trading activity picked up during COVID-quarantine, which is an exogenous variable within our data. Thereby, we should analyze the underlying distribution of Volume and Price data via splitting this data set s.t. we do not observe the effect of the COVID-19 Exogenous Variable in our statistical data exploration.
The easiest way to observe a relationship between variables is to estimate a linear regression between the 2 variables. Hence, we will try that.
# generating the data up until March 11th 2020, the day the WHO declared COVID-19 to be a pandemic
# we also pass in the data sorted by volume
uptill_2020 = data[data['Date'] < datetime(2020, 3, 11)].sort_values('Volume')
x_uptill = uptill_2020['Volume']
y_uptill = uptill_2020['Close']
# modelling a linear regression for the data up till March 11th, 2020
# generates estimates -> m = slope parameter, b = intercept
m_b4, b_b4 = np.polyfit(x=x_uptill,y=y_uptill,deg=1)
# plotting the actual data
plt.figure(figsize=(20,10))
plt.title('Volume vs Close Price BEFORE March 11 2020')
plt.xlabel('Volume')
plt.ylabel('Price')
plt.plot(x_uptill, y_uptill, '.', x_uptill, m_b4 * x_uptill + b_b4)
[<matplotlib.lines.Line2D at 0x7f87445a5df0>, <matplotlib.lines.Line2D at 0x7f87445a5f10>]
NOTE: The linear regression shown above is clearly deviating from a best fit line due to a few outliers.
The common conviction among many statisticians seeing this would be to remove the outliers and re-estimate the linear regression. However, we will not do this. That is because these outliers are important to the price and volume movement of BTC and warrant further study. We know by looking at the volume data, BTC was relatively illiquid (meaning that it was not easily bought and sold) up until the COVID-19 Pandemic, and hence these volume spikes have certain casual factors we need to study.
Lets instead try to plot this in a time series graph.
# resorting values back to by date
uptill_2020 = uptill_2020.sort_values('Date')
# plotting the top portion which is the Close price
top_plt = plt.subplot2grid((5,4), (0, 0), rowspan=3, colspan=4)
top_plt.plot(uptill_2020['Date'], uptill_2020['Close'])
# titling the top half
plt.title('Historical BTC Price Data in USD UP TILL COVID-19')
# titling the bottom half
plt.title('\nHistorical BTC Trading Volume UP TILL COVID-19')
# plotting the bottom portion which is Volume
bot_plt = plt.subplot2grid((5,4), (3,0), rowspan=3, colspan=4)
bot_plt.bar(uptill_2020['Date'], uptill_2020['Volume'])
plt.gcf().set_size_inches(12,8)
NOTE: the volume spike between 2018-07 and 2019-01
It seems as though that volume spike is correllated with a drop off in BTC price. Here we can see that there seems to be somewhat of an underlying correllation between Volume and Price.
As earlier, lets look at the relationship between Volume and Price through the lens of a regression first.
# generating data after March 11th as that is when we knew we were in a pandemic
# we also pass in the data sorted by volume
after = data[data['Date'] >= datetime(2020,3,11)].sort_values('Volume')
x_after = after['Volume']
y_after = after['Close']
# modelling a linear regression for the data after March 11th, 2020
m_af,b_af = np.polyfit(x=x_after,y=y_after,deg=1)
# plotting the actual data
plt.figure(figsize=(20,10))
plt.title('Volume vs Close Price AFTER March 11 2020')
plt.xlabel('Volume')
plt.ylabel('Price')
plt.plot(x_after, y_after, '.', x_after, m_af * x_after + b_af)
[<matplotlib.lines.Line2D at 0x7f8736403040>, <matplotlib.lines.Line2D at 0x7f8736403c70>]
NOTE: This regression above is clearly deviating from a best fit line due to a few outliers.
As earlier, we will not try to re-estimate the linear regression by removing the outliers since these deviations require study due to their magnitude. As before, we will instead plot a time series graph of volume and price.
# resorting values back to by date
after = after.sort_values('Date')
# plotting the top portion which is the Close price
top_plt = plt.subplot2grid((5,4), (0, 0), rowspan=3, colspan=4)
top_plt.plot(after['Date'], after['Close'])
# titling the top half
plt.title('Historical BTC Price Data in USD UP TILL COVID-19')
# titling the bottom half
plt.title('\nHistorical BTC Trading Volume UP TILL COVID-19')
# plotting the bottom portion which is Volume
bot_plt = plt.subplot2grid((5,4), (3,0), rowspan=3, colspan=4)
bot_plt.bar(after['Date'], after['Volume'])
plt.gcf().set_size_inches(12,8)
NOTE: The large volume spike near 2021-03 timestamp
This graph seems to suggest that the correllation between volume and price pre-pandemic continues into the pandemic world. The large spike in trading activity was immediately followed by a rally in BTC price. This further signifies that the correllation between volume and price is temporal-invariant with respect to its existence. This means that throughout all time periods, there does exist a correllation between volume and price.
Lets conduct Hypothesis testing to cement the relationship between Price and Volume and to measure its statistical significance.
Our null hypothesis will be that volume has no effect on price, and hence our alternative hypothesis will be that volume may have an effect on price.
uptill_2020 = uptill_2020.sort_values('Volume')
model_uptill = ols(formula='Close~Volume', data=uptill_2020).fit()
print(model_uptill.summary())
OLS Regression Results ============================================================================== Dep. Variable: Close R-squared: 0.016 Model: OLS Adj. R-squared: 0.016 Method: Least Squares F-statistic: 26.57 Date: Mon, 17 May 2021 Prob (F-statistic): 2.86e-07 Time: 14:47:09 Log-Likelihood: -15669. No. Observations: 1616 AIC: 3.134e+04 Df Residuals: 1614 BIC: 3.135e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 4719.4076 99.084 47.630 0.000 4525.061 4913.754 Volume 2.796e-06 5.42e-07 5.155 0.000 1.73e-06 3.86e-06 ============================================================================== Omnibus: 79.708 Durbin-Watson: 0.835 Prob(Omnibus): 0.000 Jarque-Bera (JB): 83.339 Skew: 0.525 Prob(JB): 8.00e-19 Kurtosis: 2.634 Cond. No. 1.85e+08 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.85e+08. This might indicate that there are strong multicollinearity or other numerical problems.
As shown above, since the p-value is 0 for the Volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pre-pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.
after = after.sort_values('Volume')
model_after = ols(formula='Close~Volume', data=after).fit()
print(model_after.summary())
OLS Regression Results ============================================================================== Dep. Variable: Close R-squared: 0.417 Model: OLS Adj. R-squared: 0.415 Method: Least Squares F-statistic: 302.7 Date: Mon, 17 May 2021 Prob (F-statistic): 1.48e-51 Time: 14:47:57 Log-Likelihood: -4677.3 No. Observations: 426 AIC: 9359. Df Residuals: 424 BIC: 9367. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 5969.4087 1216.329 4.908 0.000 3578.623 8360.194 Volume 4.432e-07 2.55e-08 17.398 0.000 3.93e-07 4.93e-07 ============================================================================== Omnibus: 99.554 Durbin-Watson: 1.225 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1618.068 Skew: -0.481 Prob(JB): 0.00 Kurtosis: 12.499 Cond. No. 8.43e+10 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 8.43e+10. This might indicate that there are strong multicollinearity or other numerical problems.
Similarly, as shown above, since the p-value is 0 for the volume parameter, it suggests that it is nigh impossible for this sample of Volume-Price pairs to exist given that Volume has no effect on Price. Hence, we can conclude that in the pandemic phase, volume did have an apparent effect on Price and hence it is statistically significant thereby we reject the null hypothesis in favour of the alternative hypothesis.
Another key variable we want to observe is the daily percentage change in price from open to close. This variale has an important statistical implication: if we observe this variable to be stationary about a 0 mean, then we know that BTC evolves in a cyclic fashion, and in contrast if we observe this variable to be non-stationary, then we know that BTC is a stochastic variable that cannot be predicted via determining where we are in the cycle.
Hence, lets visualize the %Δ in price daily data.
# saving the daily price changes in percentage format in a new column
data['DailyChange'] = ((data['Close'] - data['Open'])/data['Open']) * 100
data['DailyChange']
0 0.597938 1 0.590285 2 0.370838 3 1.299229 4 -0.761523 ... 2037 -4.104787 2038 1.535138 2039 -13.336963 2040 -0.038689 2041 0.397631 Name: DailyChange, Length: 2042, dtype: float64
# displaying the daily price change
# plotting the top portion which is the daily price change
top_plt = plt.subplot2grid((5,4), (0, 0), rowspan=3, colspan=4)
# creates an x-axis line
horiz_line_data = np.array([0 for i in range(len(data['Date']))])
top_plt.plot(data['Date'], data['DailyChange'])
#plots the x-axis line
top_plt.plot(data['Date'], horiz_line_data, 'orange')
# titling the top half
plt.title('Historical BTC Price Change in Percentage Change')
# titling the bottom half
plt.title('\nHistorical BTC Trading Volume')
# plotting the bottom portion which is Volume
bot_plt = plt.subplot2grid((5,4), (3,0), rowspan=3, colspan=4)
bot_plt.bar(data['Date'], data['Volume'])
plt.gcf().set_size_inches(12,8)
As we can see above, there does seem to be a degree of stationarity present in the percentage change data up until the pandemic, after which there seems to be a large degree of non-stationarity present.
Lets now try to measure the relationship between volume and price change. We will use volume as our explanatory variable for the sake of consistency, however you can use either or. The important facet of this statistical analysis is to observe whether a relationship does in fact exist between the 2 variables, not so much the predictive power of the regression or its exact accuracy.
clone = data.sort_values('Volume')
x = clone['Volume']
y = clone['DailyChange']
# modelling a linear regression for the data after March 11th, 2020
m,b = np.polyfit(x=x,y=y,deg=1)
# plotting the actual data
plt.figure(figsize=(20,10))
plt.title('Volume vs Price Change')
plt.xlabel('Volume')
plt.ylabel('Price Change')
plt.plot(x, y, '.', x, m * x + b)
plt.show()
NOTE: Interestingly, there seems to be a lack of a relationship between daily volume and price change per the naked eye. The line is near horizontal and it seems as though the OLS regression almost defaulted to a mean line as the derived OLS Regression might have been extremely poor at explaining the variation (R-squared).
Noting from earlier, the graph does seem to suggest that PriceChange and Volume are not correllated. Lets further confirm this hypothesis via hypothesis testing.
We will classify our null hypothesis as there exists no correllation between PriceChange and Volume and our alternative hypothesis as there exists a correllation between PriceChange and Volume.
model = ols(formula='DailyChange~Volume', data=data).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: DailyChange R-squared: 0.000 Model: OLS Adj. R-squared: -0.000 Method: Least Squares F-statistic: 0.4736 Date: Mon, 17 May 2021 Prob (F-statistic): 0.491 Time: 18:23:15 Log-Likelihood: -5747.7 No. Observations: 2042 AIC: 1.150e+04 Df Residuals: 2040 BIC: 1.151e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.3118 0.097 3.229 0.001 0.122 0.501 Volume 3.047e-12 4.43e-12 0.688 0.491 -5.63e-12 1.17e-11 ============================================================================== Omnibus: 332.396 Durbin-Watson: 2.100 Prob(Omnibus): 0.000 Jarque-Bera (JB): 5722.980 Skew: -0.135 Prob(JB): 0.00 Kurtosis: 11.197 Cond. No. 2.36e+10 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.36e+10. This might indicate that there are strong multicollinearity or other numerical problems.
As we can see above, the p-value for the Volume parameter is 0.491 implying that there is almost a 50% chance that we could acquire a sample such as this and get a Volume estimator to be 3.047e-12 while the true Volume parameter is in fact 0, meaning that there is no correllation between Volume and PriceChange. This p-value is greater than alpha = 10%, meaning that we fail to reject the null, meaning that there is a very high likelihood that Volume and DailyChange are in fact not correllated.
Moving forward, we can now conduct a degree of Machine Learning on the underlying data distribution.
Using 0 time lags, we saw that there was a correllation between Closing Price and Volume, and there seemed to lack a correllation between Daily Price Change and Volume. However, as this is time series data, there might exist underlying time lagged correllations between variables. In our EDA, we did not explore that as we are unsure about which lags are actually correllated, and using AIC/BIC is outside the scope of the analysis typically performed in an EDA section.
Typically when building an ML model, we need to figure out what we are trying to optimize. Since the purpose of this notebook is to make a model that can generate profits, we can split this problem via a greedy algorthimic paradigm: we will use a greedy rule saying that if we optimize daily trading profits, we will optimize profits in the long run. That is to say, we will build an ML model that predicts whether BTC prices are going to close lower than open or close higher than open.
When we close lower than open, a key trading move we would make is to short BTC, that is to bet against the instrument via borrowing BTC from a broker, selling them, and then buying them back at a future date at hopefully a lower price.
When we close higher than open, a key trading move we would make is to long BTC, that iis to bet for the instrument via buying BTC from a broker, holding it, and then selling them for a profit.
Thereby, if we can predict whether the daily shift in prices is going lower or higher (a binary classification problem) measured by a positive or negative difference between the opening and closing price, then we can use this model to make trading decisions as to whether to buy or short every day at the opening price.
Due to time constraints as I did this entire project myself, I was unable to construct a full Neural Network to execute this binary classification. Although I dabbled in other ML models, I realised that ideally, we would use Neural Networks for probability precision. I have included my game plan to actually build out a full network.
For either time frame, the main metric by which I would measure success would be validation accuracy in the binary classification, and then deploying this into the "wild" for a moonth to see how it functions against incoming data (which we would similarly classify this as a test set). If accuracy in price movement predictions are above 70%, I would consider either network to have a pretty good performance.
BTC aggregate Price movements do seem to have a large degree of correllation with Volume. This is apparent between our candidate price of Close and Volume when we measured their correllation and conducted hypothesis testing on these variables.
In measuring whether daily price movement is related to volume, we came across the fact that daily price changes are in fact not that correllated with volume. It also seems as though price changes were some what stationary prior to the pandemic, but ever since the pandemic started, daily price changes have entered into a non-stationary epoch.
NOTE: However, both of these analyses should have an astrix next to them as we did not measure lagged correllation. There is a real possibility that lagged variables in our data set are actually correllated which can speak greater to our distribution. In building an actual ML model, using lagged variables would be essential as we are in many aspects trying to predict future price movements based on past price movements.
I hope by reading this notebook you have learned a greater bit of detail about the underlying statistical distribution of BTC, and it helps you in your own projects related to data science applications in finance. I also hope that seeing a full data science pipeline consisting of
As always, thank you so much for reading this!
The used data sets can be found here:
BTC-USD.csv can be retrieved from https://finance.yahoo.com/quote/BTC-USD/history?period1=1410825600&period2=1621123200&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
BTCUSD_day.csv can be retrived from https://www.kaggle.com/prasoonkottarathil/btcinusd?select=BTCUSD_day.csv