Predicting the direction of volatile assets like Bitcoin is a central challenge in quantitative finance. While daily noise can make short-term predictions resemble random walks, analyzing trends over slightly longer horizons, like a week, might offer more traction. This article details a Python-based approach using a Random Forest classifier and a rolling forecast methodology to predict whether Bitcoin’s price will be higher or lower seven days from the present, leveraging a pre-selected set of technical indicators. We’ll cover the theory, the implementation with code snippets, and how to interpret the results.
1. Theoretical Background
Before diving into the code, let’s understand the core concepts:
a) Random Forest Classifier
b) Feature Selection (Context)
This script assumes that a preliminary analysis has been performed to identify potentially predictive features. In our development process, Mutual Information scores were used to rank ~30 technical indicators based on their statistical relationship with the 1-day price direction. We will use the top 15 features identified in that analysis as inputs to our Random Forest model, assuming they might also hold relevance for the 7-day horizon.
c) Rolling Forecast Evaluation
d) Classification Metrics
Since we’re predicting direction (Up=1, Down=0), we use classification metrics:
Predicted Down (0) | Predicted Up (1) | |
Actual Down (0) | True Negative (TN) | False Positive(FP) |
Actual Up (1) | False Negative(FN) | True Positive (TP) |
2. Python Implementation Details
Let’s walk through the key parts of the Python script.
a) Setup and Configuration
Import libraries and set up parameters. Critically, set
PREDICTION_HORIZON = 7
and define the
TRAINING_WINDOW_DAYS
and the list of
TOP_FEATURES
derived from previous analysis.
Python
# ==============================================================================
# Imports
# ==============================================================================
import pandas as pd
import numpy as np
import yfinance as yf
import talib # Make sure TA-Lib is installed
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, ConfusionMatrixDisplay,
roc_auc_score)
import warnings
# ... (warnings configuration) ...
# ==============================================================================
# Configuration
# ==============================================================================
TICKER = 'BTC-USD'
START_DATE = '2021-01-01' # Needs enough data for rolling
END_DATE = None
INTERVAL = '1d' # Daily data
# --- Rolling Window Parameters ---
TRAINING_WINDOW_DAYS = 30 # Approx 1 month training window
PREDICTION_HORIZON = 7 # Predict direction 7 days ahead
# --- Feature Selection ---
# Using Top 15 features identified previously from MI analysis
TOP_FEATURES = [
'ROC_10', 'STOCHRSI_d', 'ADX_14', 'STOCHRSI_k', 'RSI_14',
'STOCH_k', 'ATR_14', 'EMA_20', 'STOCH_d', 'MACD',
'ULTOSC', 'BB_upper', 'SAR', 'Open_Close', 'MACD_hist'
]
# --- Random Forest Model Parameters ---
N_ESTIMATORS = 150
MAX_DEPTH = 8
MIN_SAMPLES_LEAF = 5
CLASS_WEIGHT = 'balanced'
RANDOM_STATE = 42
b) Data Loading and Indicator Calculation
Standard functions using yfinance
and talib
are used to fetch OHLCV data and compute the full set of ~30 technical
indicators.
Python
# Function definitions for load_data and calculate_indicators
# (Use the full function definitions from the previous script response)
# In main execution block:
data = load_data(TICKER, START_DATE, END_DATE, INTERVAL)
if data is not None:
data_indicators = calculate_indicators(data.copy())
c) Target Variable and Feature Preparation
The 7-day target variable (1 if price is higher 7 days later, 0
otherwise) is created. The data is cleaned of NaNs, and only the
TOP_FEATURES
columns are selected into the
X_all_features
DataFrame, while the Target
column becomes Y_all
.
Python
# Function definition for create_target (horizon=PREDICTION_HORIZON)
# (Use the function definition from the previous script response)
# In main execution block:
data_target = create_target(data_indicators, horizon=PREDICTION_HORIZON)
data_processed = data_target.dropna()
available_features = [f for f in TOP_FEATURES if f in data_processed.columns]
# ... (Error handling if features are missing) ...
X_all_features = data_processed[available_features]
Y_all = data_processed['Target']
Dates_all = data_processed.index # Keep dates for plotting results
d) The Rolling Forecast Loop
This is the core logic change from a simple train/test split.
Python
# --- Rolling Forecast Loop ---
all_predictions = []
all_actuals = []
all_predict_dates = []
all_probabilities = []
start_index = TRAINING_WINDOW_DAYS
end_index = len(X_all_features) - PREDICTION_HORIZON
print(f"\nStarting rolling forecast from index {start_index} to {end_index-1}...")
for i in range(start_index, end_index):
# 1. Define window boundaries
train_start_idx = i - TRAINING_WINDOW_DAYS
train_end_idx = i
predict_feature_idx = i
actual_target_idx = i
# 2. Extract current window data
X_train_window = X_all_features.iloc[train_start_idx:train_end_idx]
Y_train_window = Y_all.iloc[train_start_idx:train_end_idx]
X_predict_point = X_all_features.iloc[[predict_feature_idx]]
Y_actual_point = Y_all.iloc[actual_target_idx]
# 3. Scale features WITHIN the loop
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_window)
X_predict_scaled = scaler.transform(X_predict_point)
# 4. Build and Train Model WITHIN the loop
rf_model = RandomForestClassifier(
n_estimators=N_ESTIMATORS,
max_depth=MAX_DEPTH,
min_samples_leaf=MIN_SAMPLES_LEAF,
random_state=RANDOM_STATE,
n_jobs=-1,
class_weight=CLASS_WEIGHT
)
rf_model.fit(X_train_scaled, Y_train_window)
# 5. Predict and Store Results
prediction = rf_model.predict(X_predict_scaled)[0]
probability = rf_model.predict_proba(X_predict_scaled)[0, 1] # Robust extraction might be needed here too
all_predictions.append(prediction)
all_actuals.append(Y_actual_point)
all_probabilities.append(probability)
all_predict_dates.append(Dates_all[actual_target_idx])
# ... (Optional progress print) ...
print("Rolling forecast complete.")
Crucially, the StandardScaler
and
RandomForestClassifier
are initialized and fitted inside
the loop on each window’s data.
e) Aggregated Evaluation
After the loop completes, the collected predictions and actual values are used to calculate the overall performance metrics.
Python
# --- Evaluate Aggregated Results ---
if not all_actuals:
print("No predictions were made.")
else:
print("\n--- Aggregated Rolling Forecast Metrics ---")
accuracy = accuracy_score(all_actuals, all_predictions)
precision = precision_score(all_actuals, all_predictions, zero_division=0)
recall = recall_score(all_actuals, all_predictions, zero_division=0)
f1 = f1_score(all_actuals, all_predictions, zero_division=0)
try:
roc_auc = roc_auc_score(all_actuals, all_probabilities)
except ValueError:
roc_auc = float('nan')
# ... (print warning) ...
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision (for 1):{precision:.4f}")
# ... (print other metrics) ...
# Baseline comparison
majority_class_overall = Y_all.value_counts().idxmax()
baseline_accuracy = accuracy_score(all_actuals, np.full(len(all_actuals), majority_class_overall))
print(f"\nBaseline Accuracy (...): {baseline_accuracy:.4f}")
# Confusion Matrix Plotting
print("\n--- Confusion Matrix (Aggregated Rolling Forecast) ---")
cm = confusion_matrix(all_actuals, all_predictions)
print(cm)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
# ... (Plotting code for CM) ...
plt.show()
# Optional: Plot actual vs predicted directions over time
# ... (Plotting code for results_df) ...
plt.show()
3. Results and Interpretation (Based on Your Last Run)
Your last run with this rolling Random Forest approach yielded:
[[122 57] / [ 61 128]]
Interpretation:
These results show a clear improvement over random chance and the baseline of simply predicting the majority class. The model achieved ~68% accuracy in predicting the 7-day direction over the rolling test period. Precision and Recall are reasonably balanced (around 68-69%), indicating the model identifies ‘Up’ moves moderately well without excessively predicting ‘Up’ incorrectly. The AUC of ~0.75 suggests a decent discriminatory ability. While not perfect, these results indicate that the combination of selected features, the Random Forest model, and the rolling approach captured a statistically significant predictive signal in the historical data tested.
4. How to Use the Code
pandas
, numpy
, yfinance
,
matplotlib
, seaborn
,
scikit-learn
, and crucially, TA-Lib
(C library
+ Python wrapper) are installed.rolling_rf_btc.py
).TICKER
, START_DATE
,
TRAINING_WINDOW_DAYS
, PREDICTION_HORIZON
, or
Random Forest parameters if desired.python rolling_rf_btc.py
. It will take some time as the
model retrains repeatedly.5. Limitations and Conclusion
TOP_FEATURES
might lose predictive power over time.In conclusion, this script provides a robust framework for evaluating the predictive power of technical indicators for Bitcoin’s weekly direction using a Random Forest model and a realistic rolling forecast method. The results achieved (~68% accuracy, ~0.75 AUC historically) demonstrate a potential edge worthy of further investigation, but require critical interpretation and significant further development before any practical trading application.