Can Google’s TabFM Predict Polymarket Crypto Markets? A Zero-Shot Backtest

Can Google’s TabFM Predict Polymarket Crypto Markets? A Zero-Shot Backtest Markets tested: Polymarket 5-minute BTC/ETH/XRP directional contracts | Sample: 1,320 walk-forward predictions across 55 folds | Models: TabFM (zero-shot), XGBoost, Random Forest, Logistic Regression, SVM Executive Summary Polymarket’s 5-minute crypto contracts ask a deceptively simple question — will price be higher or lower five minutes…

Jul 3, 2026

Desk: Quant Research, Research

Tags: foundation models, machine learning, polymarket, prediction markets, quantitative research

Can Google’s TabFM Predict Polymarket Crypto Markets? A Zero-Shot Backtest

Markets tested: Polymarket 5-minute BTC/ETH/XRP directional contracts | Sample: 1,320 walk-forward predictions across 55 folds | Models: TabFM (zero-shot), XGBoost, Random Forest, Logistic Regression, SVM

Executive Summary

Polymarket’s 5-minute crypto contracts ask a deceptively simple question — will price be higher or lower five minutes from now — and settle it fast enough that there is nowhere to hide a bad model. This report tests whether Google Research’s TabFM (Tabular Foundation Model) can compete with trained classifiers on that question, using a strict, leakage-free walk-forward framework across 1,320 prediction points on BTC, ETH, and XRP.

The headline result: TabFM’s zero-shot predictions — no training, no gradient updates, no hyperparameter search — hit 53.86% accuracy on XRP, edging out a fully trained XGBoost baseline (51.97%) on that asset and landing within a point of the best model on every symbol tested. A companion analysis of Logistic Regression coefficients, Random Forest importances, and SHAP values across all three assets converges on the same three drivers of these markets: short-term momentum, medium-term trend exhaustion, and intra-session mean reversion around VWAP.

1. The Prediction Challenge & Data Alignment

Polymarket’s 5-minute prediction markets for major cryptocurrencies (BTC, ETH, and XRP) resolve UP if price rises over the interval, and DOWN if it falls or stays flat.

Predicting these markets requires a strict, leakage-free data pipeline. For a market spanning 16:00 to 16:05:

Predictions must be made at or before the 16:00 start time.
Features must use technical indicators calculated strictly from historical data before 16:00.
This setup matches the operational constraint of entering a live trade right as the market opens.

This analysis uses a 2-week historical dataset (early April 2026), yielding approximately 1,445 closed markets per asset. Features are extracted across three timescales:

Micro-scale (1-min): 5-minute rolling window statistics (mean, std, delta, last) of prices, session VWAP, and short-term RSI.
Medium-scale (15-min and 1-hour): Standard indicators such as RSI, MACD, ADX, and Bollinger Bands.
Macro-scale (1-day): Long-term trends such as the daily 200-day Simple Moving Average.

2. Foundation Models for Tabular Data: TabFM vs. Text LLMs

Traditional tabular machine learning models require explicit training by updating weights on a specific dataset. In contrast, TabFM is a Tabular Foundation Model that performs zero-shot classification.

Architecture and Inference Model

Specialized Architecture: Unlike text-based LLMs (e.g., GPT-4 or Llama) that process character tokens and output natural language, TabFM is designed specifically for numerical and categorical tabular data. It maps columns to vector embeddings and outputs class probabilities directly.
In-Context Learning (ICL): At inference time, TabFM’s weights remain frozen. Instead of parameter training, it reads a historical training set as "context" and makes predictions on test samples in a single forward pass, leveraging row and column attention mechanisms.
Local Execution: The model runs entirely locally using PyTorch (google/tabfm-1.0.0-pytorch) without external API calls or external text-generation servers.

3. Walk-Forward Backtesting: Folds & Runtime

Models are evaluated using sequential walk-forward validation, which models a real-world trading setup by expanding the training history over time.

Understanding Walk-Forward Folds

A "fold" represents a single backtesting step. For each fold:

The model is presented with a historical training context (120 observations, representing 10 hours of market history).
The model predicts the outcomes of the next test block (24 observations, representing a 2-hour window).
The training window then expands by 24 observations, and the process repeats for the next fold.
Across the 2-week dataset, this yields 55 sequential folds.

TabFM Computational Overhead

TabFM requires 7.5 to 8.5 seconds per fold, totaling approximately 7.5 minutes of compute time per asset (22 minutes total for BTC, ETH, and XRP combined).

Unlike traditional models that evaluate simple tree splits or linear equations in milliseconds, TabFM must pass the entire historical context window (120 rows) and the test queries together through a 24-block causal transformer network. On each step, the transformer dynamically calculates row-to-row and column-to-column attention weights, resulting in higher computational latency.

4. Performance Comparison

The table below outlines the backtesting accuracies and F1-scores across 1,320 prediction points:

Symbol	Metric	Logistic Regression	Random Forest	Support Vector Machine (SVM)	XGBoost Baseline	TabFM (Zero-Shot)
BTC	Accuracy	52.58%	50.98%	49.24%	53.86%	50.00%
	F1-Score	48.77%	50.87%	50.59%	51.78%	46.34%
ETH	Accuracy	50.98%	51.82%	53.86%	53.11%	51.06%
	F1-Score	43.59%	47.18%	45.87%	49.47%	41.27%
XRP	Accuracy	53.41%	54.09%	51.21%	51.97%	53.86%
	F1-Score	44.34%	44.10%	43.51%	46.90%	45.18%
Overall	Accuracy	52.32%	52.30%	51.44%	52.98%	51.64%
	F1-Score	45.57%	47.38%	46.66%	49.43%	44.35%

TabFM’s zero-shot inference is highly competitive: it outperforms the trained XGBoost baseline on XRP (53.86% vs. 51.97%) and matches the top SVM accuracy on ETH — without ever seeing a labeled training example specific to this dataset.

5. Economic & Quantitative Interpretation of the Models

Analyzing the inner workings of Logistic Regression and Random Forest models reveals the structural dynamics of prediction market behavior.

5.1 Logistic Regression Coefficients: A Quantitative Rationale

The coefficients from the Logistic Regression models represent the change in log-odds of a market resolving UP for a one standard deviation increase in the corresponding feature.

BTC Logistic Regression coefficients for Polymarket UP/DOWN prediction

ETH Logistic Regression coefficients for Polymarket UP/DOWN prediction

These charts show a highly consistent structure across all three assets:

Momentum (tf_15min_rsi_14) — Strongly Positive (e.g., +0.550 for XRP): Strong price momentum leading up to the market opening is the single strongest indicator of an UP resolution. Economically, this reflects short-term trend persistence — buying interest on a 15-minute horizon tends to carry over into the subsequent 5-minute window.
Trend Exhaustion (tf_1hour_macd_hist) — Strongly Negative (e.g., -0.432 for XRP): The MACD histogram measures the acceleration of price trends. A highly elevated MACD histogram indicates the trend is moving at maximum velocity, which historically represents late-stage exhaustion. On a 5-minute horizon, these overextended trends are highly prone to reverse or consolidate.
Mean Reversion (tf_1min_mean_5_vwap_session) — Strongly Negative (e.g., -0.367 for XRP): When price trades significantly above the session’s Volume-Weighted Average Price (VWAP) in the final 5 minutes, the probability of an UP resolution falls. This is intra-session mean reversion — liquidity providers and arbitrageurs fade extreme short-term price deviations back toward the average session value.

5.2 Random Forest Feature Importances

The Gini importances of the Random Forest model confirm the importance of these same variables.

BTC Random Forest Gini feature importances

ETH Random Forest Gini feature importances

Short-term momentum (tf_15min_rsi_14) and session VWAP deviation (tf_1min_mean_5_vwap_session) dominate the importance scores, each accounting for 12% to 15% of the total predictive power. Conversely, long-term trends like the daily 200-day moving average (tf_1day_sma_200) account for less than 1% of the model’s split decisions — macro structural trends have essentially no statistical influence on 5-minute prediction horizons.

5.3 SHAP Summary (Beeswarm Analysis)

We computed Shapley Additive exPlanations (SHAP) values for the Random Forest model to analyze the non-linear relationship and direction of feature impact on individual predictions.

BTC SHAP summary plot for Random Forest UP class

ETH SHAP summary plot for Random Forest UP class

On a SHAP summary plot, horizontal position represents a feature’s impact on the model output (positive values increase the probability of an UP prediction), and color represents the feature value (red is high, blue is low):

tf_15min_rsi_14: Red dots cluster on the positive side of the SHAP axis, blue dots on the negative side — high RSI values consistently push predictions toward UP, a clean momentum effect.
tf_1min_mean_5_vwap_session: Red dots cluster on the negative side, blue dots on the positive side — high deviations above the session VWAP push predictions toward DOWN, validating the mean-reversion driver.
tf_1min_delta_5_ema_9: This variable measures trend acceleration over the final minutes. High values (red) associate with positive SHAP values, showing short-term trend acceleration acts as a reinforcing signal.

The alignment across Logistic Regression coefficients, Random Forest importances, and SHAP distributions confirms that both linear and non-linear models capture the same underlying market structure: following medium-term momentum, but fading extreme short-term deviations from the session average.

6. Correlation Analysis & Decoupling

The heatmap visualizes the correlation matrix of the key indicators.

BTC technical indicator correlation heatmap

ETH technical indicator correlation heatmap

Indicators of similar frequencies show moderate correlation (e.g. r ≈ 0.65 between 15-minute and 1-hour RSI), but short-term 1-minute indicators are decoupled from hourly and daily trends. This structure lets the models combine microstructural mean-reversion signals with medium-term directional momentum without suffering from multicollinearity.

7. Computational Hardware Environment

To ensure complete transparency, all experiments were executed on a local workstation containing an Intel(R) Core(TM) Ultra 9 275HX CPU (24 Cores) and an NVIDIA GeForce RTX 5080 Laptop GPU.

8. Conclusion

This study demonstrates that Google Research’s TabFM zero-shot tabular foundation model is capable of predicting short-horizon prediction markets. By leveraging pre-trained representations and in-context learning with a 120-row historical window, TabFM achieved a competitive 53.86% accuracy on XRP, outperforming the fully trained XGBoost baseline on that asset.

Logistic Regression coefficients and SHAP values show that Polymarket outcomes on these 5-minute crypto contracts are driven by a combination of trend momentum (RSI), trend exhaustion (MACD), and mean reversion (session VWAP). Tabular foundation models carry higher computational latency from the transformer’s forward pass, but their zero-shot capability — skipping training entirely — is a genuinely useful alternative to standing up a full model-training pipeline for short-horizon financial forecasting.

None of the accuracies above clear a large enough margin over a 50% coin flip to be treated as a standalone edge before transaction costs, market impact, and Polymarket’s fee structure are accounted for. This is a research screen into what drives these markets, not a validated trading signal.

9. Replication Code

Replication code is provided below in collapsible dropdowns.

Click to view run_tabfm_prediction.py (TabFM Backtest Script)

"""
Walk-forward TabFM evaluation for Polymarket modeling datasets.

Outputs:
  polymarket_blogposts/results/<RUN_LABEL>/<SYMBOL>/historical_predictions.parquet
  polymarket_blogposts/results/<RUN_LABEL>/<SYMBOL>/confusion_matrix.csv
  polymarket_blogposts/results/<RUN_LABEL>/<SYMBOL>/summary.json
  polymarket_blogposts/results/<RUN_LABEL>/run_summary.json
"""

from __future__ import annotations

import argparse
import json
import logging
import math
import sys
import time
from pathlib import Path
from typing import Any

import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, roc_auc_score

# Ensure the local tabfm package is on path
BLOGPOSTS_DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(BLOGPOSTS_DIR / "tabfm"))

from tabfm import TabFMClassifier
from tabfm.src.pytorch import tabfm_v1_0_0 as tabfm_v1_0_0_pytorch

# Paths to datasets
TRADING_PROJECT_DIR = Path("C:/Users/MattO/Documents/projects_V2/2026/trading_polymarket_crypto_prediction")
DATA_ROOT = TRADING_PROJECT_DIR / "xgboost" / "data"
OUTPUT_ROOT = BLOGPOSTS_DIR / "results"
SYMBOLS = ["BTC", "ETH", "XRP"]

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log = logging.getLogger(__name__)


def _write_json(path: Path, payload: dict[str, Any]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", encoding="utf-8") as handle:
        json.dump(payload, handle, indent=2, ensure_ascii=True)
        handle.write("\n")


def _parse_symbols(raw: str) -> list[str]:
    requested = [part.strip().upper() for part in raw.split(",") if part.strip()]
    invalid = [symbol for symbol in requested if symbol not in SYMBOLS]
    if invalid:
        raise ValueError(f"Unsupported symbols: {', '.join(invalid)}")
    return requested or SYMBOLS


def _load_dataset(symbol: str) -> pd.DataFrame:
    path = DATA_ROOT / symbol / "combined" / "modeling.parquet"
    if not path.exists():
        raise FileNotFoundError(f"Missing modeling dataset: {path}")

    df = pd.read_parquet(path)
    if df.empty:
        return df

    df["time"] = pd.to_datetime(df["time"], errors="coerce", utc=True)
    df["target_up_binary"] = pd.to_numeric(df["target_up_binary"], errors="coerce")
    df = df.dropna(subset=["time", "target_up_binary"]).sort_values("time").drop_duplicates(subset=["time"], keep="last").reset_index(drop=True)

    # Fill any missing indicator values before feeding to TabFM
    df = df.ffill().bfill()
    return df


def _select_features(df: pd.DataFrame, mode: str) -> list[str]:
    all_features = [col for col in df.columns if col not in ["time", "target_up_binary"]]

    if mode == "all":
        return all_features
    elif mode == "key":
        # Curate 12 key indicators representing different periods and categories
        key_patterns = [
            "tf_15min_rsi_14",
            "tf_15min_macd_hist",
            "tf_15min_adx_14",
            "tf_1hour_rsi_14",
            "tf_1hour_macd_hist",
            "tf_1hour_adx_14",
            "tf_1day_rsi_14",
            "tf_1day_macd_hist",
            "tf_1day_sma_200",
            "tf_1min_last_rsi_7",
            "tf_1min_mean_5_vwap_session",
            "tf_1min_delta_5_ema_9"
        ]
        selected = [col for col in key_patterns if col in df.columns]
        if not selected:
            log.warning("No key features matched, falling back to first 10 columns.")
            return all_features[:10]
        return selected
    else:
        raise ValueError(f"Unknown feature mode: {mode}")


def _build_folds(n_rows: int, min_train_size: int, test_size: int, step_size: int, min_last_test_size: int) -> list[tuple[int, int, int]]:
    folds: list[tuple[int, int, int]] = []
    fold_id = 1
    train_end = min_train_size

    while train_end < n_rows:
        test_end = min(train_end + test_size, n_rows)
        if test_end - train_end < min_last_test_size:
            break
        folds.append((fold_id, train_end, test_end))
        fold_id += 1
        train_end += step_size

    return folds


def _fit_and_predict(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    X_test: pd.DataFrame,
    model: Any,
    n_estimators: int,
    max_num_rows: int | None,
) -> tuple[pd.Series, pd.Series, str]:
    if y_train.nunique(dropna=True) < 2:
        constant_class = int(y_train.iloc[0])
        prob = pd.Series([float(constant_class)] * len(X_test), index=X_test.index, dtype=float)
        pred = pd.Series([constant_class] * len(X_test), index=X_test.index, dtype=int)
        return pred, prob, "constant_baseline"

    # Initialize TabFMClassifier
    clf = TabFMClassifier(
        model=model,
        n_estimators=n_estimators,
        max_num_rows=max_num_rows,
        random_state=42,
        verbose=False,
    )
    clf.fit(X_train.to_numpy(), y_train.to_numpy())

    probs = clf.predict_proba(X_test.to_numpy())
    prob = pd.Series(probs[:, 1], index=X_test.index, dtype=float)
    pred = pd.Series(clf.predict(X_test.to_numpy()), index=X_test.index, dtype=int)
    return pred, prob, "tabfm"


def _metrics_payload(y_true: pd.Series, y_pred: pd.Series, y_prob: pd.Series) -> dict[str, Any]:
    payload: dict[str, Any] = {
        "rows": int(len(y_true)),
        "accuracy": float(accuracy_score(y_true, y_pred)),
        "precision_up": float(precision_score(y_true, y_pred, zero_division=0)),
        "recall_up": float(recall_score(y_true, y_pred, zero_division=0)),
        "f1_up": float(f1_score(y_true, y_pred, zero_division=0)),
    }
    try:
        payload["roc_auc"] = float(roc_auc_score(y_true, y_prob))
    except ValueError:
        payload["roc_auc"] = None
    return payload


def _confusion_frame(y_true: pd.Series, y_pred: pd.Series) -> pd.DataFrame:
    matrix = confusion_matrix(y_true, y_pred, labels=[0, 1])
    return pd.DataFrame(
        matrix,
        index=["actual_down_0", "actual_up_1"],
        columns=["pred_down_0", "pred_up_1"],
    )


def _evaluate_symbol(
    symbol: str,
    model: Any,
    run_dir: Path,
    min_train_size: int,
    test_size: int,
    step_size: int,
    min_last_test_size: int,
    feature_mode: str,
    n_estimators: int,
    max_num_rows: int | None,
    fast_test: bool,
) -> tuple[pd.DataFrame, dict[str, Any]]:
    df = _load_dataset(symbol)
    feature_columns = _select_features(df, feature_mode)
    feature_df = df[feature_columns].copy()
    target = df["target_up_binary"].astype(int)

    folds = _build_folds(
        n_rows=len(df),
        min_train_size=min_train_size,
        test_size=test_size,
        step_size=step_size,
        min_last_test_size=min_last_test_size,
    )
    if not folds:
        raise RuntimeError(f"Not enough rows to create walk-forward folds for {symbol}")

    if fast_test:
        log.info("Fast test mode enabled: running only the first 2 folds")
        folds = folds[:2]

    prediction_parts: list[pd.DataFrame] = []
    fold_rows: list[dict[str, Any]] = []

    for fold_idx, (fold_id, train_end, test_end) in enumerate(folds):
        t0 = time.time()

        train_index = feature_df.index[:train_end]
        test_index = feature_df.index[train_end:test_end]

        X_train = feature_df.loc[train_index]
        y_train = target.loc[train_index]
        X_test = feature_df.loc[test_index]
        y_test = target.loc[test_index]

        y_pred, y_prob, model_type = _fit_and_predict(
            X_train, y_train, X_test, model, n_estimators=n_estimators, max_num_rows=max_num_rows
        )

        dt = time.time() - t0
        log.info(
            "Symbol=%s Fold=%d/%d (%d train rows, %d test rows) completed in %.2fs",
            symbol,
            fold_id,
            len(folds),
            len(train_index),
            len(test_index),
            dt,
        )

        part = pd.DataFrame(
            {
                "time": df.loc[test_index, "time"].values,
                "symbol": symbol,
                "fold_id": fold_id,
                "model_type": model_type,
                "y_true": y_test.values,
                "y_pred": y_pred.values,
                "y_prob_up": y_prob.values,
                "train_rows": len(train_index),
                "test_rows": len(test_index),
            }
        )
        prediction_parts.append(part)

        fold_metrics = _metrics_payload(y_test, y_pred, y_prob)
        fold_metrics.update(
            {
                "fold_id": fold_id,
                "model_type": model_type,
                "train_rows": len(train_index),
                "test_rows": len(test_index),
            }
        )
        fold_rows.append(fold_metrics)

    predictions = pd.concat(prediction_parts, ignore_index=True).sort_values("time").reset_index(drop=True)
    predictions["correct"] = (predictions["y_true"] == predictions["y_pred"]).astype(int)

    fold_metrics_df = pd.DataFrame(fold_rows)
    confusion_df = _confusion_frame(predictions["y_true"], predictions["y_pred"])

    summary = {
        "symbol": symbol,
        "dataset_rows": int(len(df)),
        "feature_columns": int(len(feature_columns)),
        "predicted_rows": int(len(predictions)),
        "folds": int(len(folds)),
        "min_train_size": min_train_size,
        "test_size": test_size,
        "step_size": step_size,
        "n_estimators": n_estimators,
        "max_num_rows": max_num_rows,
        "feature_mode": feature_mode,
    }
    summary.update(_metrics_payload(predictions["y_true"], predictions["y_pred"], predictions["y_prob_up"]))

    symbol_dir = run_dir / symbol
    symbol_dir.mkdir(parents=True, exist_ok=True)
    predictions.to_parquet(symbol_dir / "historical_predictions.parquet", index=False)
    fold_metrics_df.to_csv(symbol_dir / "fold_metrics.csv", index=False)
    confusion_df.to_csv(symbol_dir / "confusion_matrix.csv")
    _write_json(symbol_dir / "summary.json", summary)

    _write_json(
        symbol_dir / "confusion_matrix.json",
        {
            "actual_down_0": {
                "pred_down_0": int(confusion_df.loc["actual_down_0", "pred_down_0"]),
                "pred_up_1": int(confusion_df.loc["actual_down_0", "pred_up_1"]),
            },
            "actual_up_1": {
                "pred_down_0": int(confusion_df.loc["actual_up_1", "pred_down_0"]),
                "pred_up_1": int(confusion_df.loc["actual_up_1", "pred_up_1"]),
            },
        },
    )

    return predictions, summary


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="Walk-forward TabFM evaluation.")
    parser.add_argument("--symbols", default="BTC,ETH,XRP", help="Comma-separated symbols to evaluate.")
    parser.add_argument("--run-label", default="tabfm_key_features", help="Output subfolder label.")
    parser.add_argument("--min-train-size", type=int, default=120, help="Initial training rows per symbol.")
    parser.add_argument("--test-size", type=int, default=24, help="Rows per test fold.")
    parser.add_argument("--step-size", type=int, default=24, help="How far to advance the expanding window per fold.")
    parser.add_argument("--min-last-test-size", type=int, default=12, help="Minimum size for the last partial fold.")
    parser.add_argument("--feature-mode", default="key", choices=["key", "all"], help="Feature selection mode.")
    parser.add_argument("--n-estimators", type=int, default=8, help="Number of TabFM ensemble estimators.")
    parser.add_argument("--max-num-rows", type=int, default=120, help="Max in-context training rows (None for all).")
    parser.add_argument("--fast-test", action="store_true", help="Only run 2 folds for testing setup.")
    return parser.parse_args()


def main() -> None:
    args = parse_args()
    symbols = _parse_symbols(args.symbols)
    run_dir = OUTPUT_ROOT / args.run_label
    run_dir.mkdir(parents=True, exist_ok=True)

    log.info("Loading PyTorch TabFM model...")
    t_start_load = time.time()
    model = tabfm_v1_0_0_pytorch.load(model_type='classification', device='cpu')
    log.info("TabFM model loaded in %.2fs", time.time() - t_start_load)

    combined_predictions: list[pd.DataFrame] = []
    run_summary: dict[str, Any] = {
        "run_label": args.run_label,
        "symbols": {},
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S")
    }

    for symbol in symbols:
        log.info("Evaluating symbol %s...", symbol)
        predictions, summary = _evaluate_symbol(
            symbol=symbol,
            model=model,
            run_dir=run_dir,
            min_train_size=args.min_train_size,
            test_size=args.test_size,
            step_size=args.step_size,
            min_last_test_size=args.min_last_test_size,
            feature_mode=args.feature_mode,
            n_estimators=args.n_estimators,
            max_num_rows=args.max_num_rows,
            fast_test=args.fast_test,
        )
        combined_predictions.append(predictions)
        run_summary["symbols"][symbol] = summary
        log.info(
            "%s walk-forward complete: predicted_rows=%d accuracy=%.4f",
            symbol,
            summary["predicted_rows"],
            summary["accuracy"],
        )

    combined_df = pd.concat(combined_predictions, ignore_index=True).sort_values(["symbol", "time"]).reset_index(drop=True)
    combined_df.to_parquet(run_dir / "combined_historical_predictions.parquet", index=False)

    combined_confusion = _confusion_frame(combined_df["y_true"], combined_df["y_pred"])
    combined_confusion.to_csv(run_dir / "combined_confusion_matrix.csv")

    overall = {
        "run_label": args.run_label,
        "symbols": symbols,
        "predicted_rows": int(len(combined_df)),
    }
    overall.update(_metrics_payload(combined_df["y_true"], combined_df["y_pred"], combined_df["y_prob_up"]))
    run_summary["overall"] = overall
    _write_json(run_dir / "run_summary.json", run_summary)

    log.info("Combined walk-forward summary written to %s", run_dir)
    log.info("Overall Accuracy: %.4f | Precision (Up): %.4f | Recall (Up): %.4f",
             overall["accuracy"], overall["precision_up"], overall["recall_up"])


if __name__ == "__main__":
    main()

Click to view run_baseline_models.py (Traditional Baselines Script)

"""
Walk-forward evaluation of standard baseline models:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)

Outputs:
  polymarket_blogposts/results/baselines/<SYMBOL>/historical_predictions_<MODEL>.parquet
  polymarket_blogposts/results/baselines/<SYMBOL>/summary_<MODEL>.json
  polymarket_blogposts/results/baselines/<SYMBOL>/feature_importances_rf.csv
  polymarket_blogposts/results/baselines/<SYMBOL>/coefficients_lr.csv
  polymarket_blogposts/results/baselines/run_summary.json
"""

import argparse
import json
import logging
import math
import sys
import time
from pathlib import Path
from typing import Any

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score, roc_auc_score

# Paths to datasets
TRADING_PROJECT_DIR = Path("C:/Users/MattO/Documents/projects_V2/2026/trading_polymarket_crypto_prediction")
DATA_ROOT = TRADING_PROJECT_DIR / "xgboost" / "data"
OUTPUT_ROOT = Path(__file__).resolve().parent / "results" / "baselines"
SYMBOLS = ["BTC", "ETH", "XRP"]

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log = logging.getLogger(__name__)


def _write_json(path: Path, payload: dict[str, Any]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    with open(path, "w", encoding="utf-8") as handle:
        json.dump(payload, handle, indent=2, ensure_ascii=True)
        handle.write("\n")


def _parse_symbols(raw: str) -> list[str]:
    requested = [part.strip().upper() for part in raw.split(",") if part.strip()]
    invalid = [symbol for symbol in requested if symbol not in SYMBOLS]
    if invalid:
        raise ValueError(f"Unsupported symbols: {', '.join(invalid)}")
    return requested or SYMBOLS


def _load_dataset(symbol: str) -> pd.DataFrame:
    path = DATA_ROOT / symbol / "combined" / "modeling.parquet"
    if not path.exists():
        raise FileNotFoundError(f"Missing modeling dataset: {path}")

    df = pd.read_parquet(path)
    if df.empty:
        return df

    df["time"] = pd.to_datetime(df["time"], errors="coerce", utc=True)
    df["target_up_binary"] = pd.to_numeric(df["target_up_binary"], errors="coerce")
    df = df.dropna(subset=["time", "target_up_binary"]).sort_values("time").drop_duplicates(subset=["time"], keep="last").reset_index(drop=True)

    # Fill any missing indicator values
    df = df.ffill().bfill()
    return df


def _select_features(df: pd.DataFrame, mode: str) -> list[str]:
    all_features = [col for col in df.columns if col not in ["time", "target_up_binary"]]

    if mode == "all":
        return all_features
    elif mode == "key":
        key_patterns = [
            "tf_15min_rsi_14",
            "tf_15min_macd_hist",
            "tf_15min_adx_14",
            "tf_1hour_rsi_14",
            "tf_1hour_macd_hist",
            "tf_1hour_adx_14",
            "tf_1day_rsi_14",
            "tf_1day_macd_hist",
            "tf_1day_sma_200",
            "tf_1min_last_rsi_7",
            "tf_1min_mean_5_vwap_session",
            "tf_1min_delta_5_ema_9"
        ]
        selected = [col for col in key_patterns if col in df.columns]
        if not selected:
            return all_features[:10]
        return selected
    else:
        raise ValueError(f"Unknown feature mode: {mode}")


def _build_folds(n_rows: int, min_train_size: int, test_size: int, step_size: int, min_last_test_size: int) -> list[tuple[int, int, int]]:
    folds: list[tuple[int, int, int]] = []
    fold_id = 1
    train_end = min_train_size

    while train_end < n_rows:
        test_end = min(train_end + test_size, n_rows)
        if test_end - train_end < min_last_test_size:
            break
        folds.append((fold_id, train_end, test_end))
        fold_id += 1
        train_end += step_size

    return folds


def _metrics_payload(y_true: pd.Series, y_pred: pd.Series, y_prob: pd.Series) -> dict[str, Any]:
    payload: dict[str, Any] = {
        "rows": int(len(y_true)),
        "accuracy": float(accuracy_score(y_true, y_pred)),
        "precision_up": float(precision_score(y_true, y_pred, zero_division=0)),
        "recall_up": float(recall_score(y_true, y_pred, zero_division=0)),
        "f1_up": float(f1_score(y_true, y_pred, zero_division=0)),
    }
    try:
        payload["roc_auc"] = float(roc_auc_score(y_true, y_prob))
    except ValueError:
        payload["roc_auc"] = None
    return payload


def _confusion_frame(y_true: pd.Series, y_pred: pd.Series) -> pd.DataFrame:
    matrix = confusion_matrix(y_true, y_pred, labels=[0, 1])
    return pd.DataFrame(
        matrix,
        index=["actual_down_0", "actual_up_1"],
        columns=["pred_down_0", "pred_up_1"],
    )


def _get_model(name: str) -> Any:
    if name == "lr":
        return LogisticRegression(C=0.1, max_iter=1000, random_state=42)
    elif name == "rf":
        return RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=2, random_state=42)
    elif name == "svm":
        return SVC(C=1.0, kernel="rbf", probability=True, random_state=42)
    else:
        raise ValueError(f"Unknown model name: {name}")


def _evaluate_model_symbol(
    symbol: str,
    model_name: str,
    df: pd.DataFrame,
    feature_df: pd.DataFrame,
    target: pd.Series,
    folds: list[tuple[int, int, int]],
    symbol_dir: Path,
) -> tuple[pd.DataFrame, dict[str, Any]]:
    prediction_parts = []

    for fold_id, train_end, test_end in folds:
        train_index = feature_df.index[:train_end]
        test_index = feature_df.index[train_end:test_end]

        X_train = feature_df.loc[train_index]
        y_train = target.loc[train_index]
        X_test = feature_df.loc[test_index]
        y_test = target.loc[test_index]

        # Standardize for models that need it (LR, SVM)
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        model = _get_model(model_name)

        # Fit models
        if model_name in ["lr", "svm"]:
            model.fit(X_train_scaled, y_train)
            probs = model.predict_proba(X_test_scaled)[:, 1]
            preds = model.predict(X_test_scaled)
        else:
            model.fit(X_train, y_train)
            probs = model.predict_proba(X_test)[:, 1]
            preds = model.predict(X_test)

        part = pd.DataFrame(
            {
                "time": df.loc[test_index, "time"].values,
                "symbol": symbol,
                "fold_id": fold_id,
                "y_true": y_test.values,
                "y_pred": preds,
                "y_prob_up": probs,
            }
        )
        prediction_parts.append(part)

    predictions = pd.concat(prediction_parts, ignore_index=True).sort_values("time").reset_index(drop=True)
    metrics = _metrics_payload(predictions["y_true"], predictions["y_pred"], predictions["y_prob_up"])

    # Save predictions & metrics
    predictions.to_parquet(symbol_dir / f"historical_predictions_{model_name}.parquet", index=False)

    summary = {
        "symbol": symbol,
        "model": model_name,
        "predicted_rows": int(len(predictions)),
        **metrics
    }
    _write_json(symbol_dir / f"summary_{model_name}.json", summary)

    return predictions, summary


def _analyze_importances(
    symbol: str,
    feature_df: pd.DataFrame,
    target: pd.Series,
    symbol_dir: Path,
) -> None:
    """Train models on full dataset to analyze feature importances / coefficients."""
    log.info("Analyzing feature importances on full dataset for %s...", symbol)

    # Scale data for Logistic Regression
    scaler = StandardScaler()
    feature_scaled = scaler.fit_transform(feature_df)

    # Logistic Regression Analysis
    lr = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
    lr.fit(feature_scaled, target)

    coef_df = pd.DataFrame({
        "feature": feature_df.columns,
        "coefficient": lr.coef_[0],
        "abs_coefficient": np.abs(lr.coef_[0])
    }).sort_values("abs_coefficient", ascending=False).reset_index(drop=True)

    coef_df.to_csv(symbol_dir / "coefficients_lr.csv", index=False)
    log.info("Logistic Regression top variables for %s:\n%s", symbol, coef_df.head(5).to_string())

    # Random Forest Analysis
    rf = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=2, random_state=42)
    rf.fit(feature_df, target)

    rf_df = pd.DataFrame({
        "feature": feature_df.columns,
        "importance": rf.feature_importances_
    }).sort_values("importance", ascending=False).reset_index(drop=True)

    rf_df.to_csv(symbol_dir / "feature_importances_rf.csv", index=False)
    log.info("Random Forest top variables for %s:\n%s", symbol, rf_df.head(5).to_string())


def main() -> None:
    parser = argparse.ArgumentParser(description="Evaluate baseline models (LR, RF, SVM).")
    parser.add_argument("--symbols", default="BTC,ETH,XRP", help="Comma-separated symbols.")
    parser.add_argument("--min-train-size", type=int, default=120, help="Initial training size.")
    parser.add_argument("--test-size", type=int, default=24, help="Test fold size.")
    parser.add_argument("--step-size", type=int, default=24, help="Step size.")
    parser.add_argument("--min-last-test-size", type=int, default=12, help="Min last fold size.")
    parser.add_argument("--feature-mode", default="key", choices=["key", "all"], help="Feature selection mode.")
    args = parser.parse_args()

    symbols = _parse_symbols(args.symbols)
    OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)

    run_summary: dict[str, Any] = {
        "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
        "symbols": {}
    }

    for symbol in symbols:
        log.info("Starting baseline evaluations for %s...", symbol)
        df = _load_dataset(symbol)
        feature_columns = _select_features(df, args.feature_mode)
        feature_df = df[feature_columns].copy()
        target = df["target_up_binary"].astype(int)

        folds = _build_folds(
            n_rows=len(df),
            min_train_size=args.min_train_size,
            test_size=args.test_size,
            step_size=args.step_size,
            min_last_test_size=args.min_last_test_size,
        )

        symbol_dir = OUTPUT_ROOT / symbol
        symbol_dir.mkdir(parents=True, exist_ok=True)

        run_summary["symbols"][symbol] = {}

        # Run evaluations
        for model_name in ["lr", "rf", "svm"]:
            log.info("Running walk-forward for %s model on %s...", model_name.upper(), symbol)
            _, summary = _evaluate_model_symbol(
                symbol=symbol,
                model_name=model_name,
                df=df,
                feature_df=feature_df,
                target=target,
                folds=folds,
                symbol_dir=symbol_dir,
            )
            run_summary["symbols"][symbol][model_name] = summary
            log.info(
                "%s model on %s complete: Accuracy = %.4f",
                model_name.upper(), symbol, summary["accuracy"]
            )

        # Run feature importance analysis
        _analyze_importances(symbol, feature_df, target, symbol_dir)

    _write_json(OUTPUT_ROOT / "run_summary.json", run_summary)
    log.info("Baseline evaluations complete. Outputs saved in %s", OUTPUT_ROOT)


if __name__ == "__main__":
    main()

Click to view generate_analysis.py (Plotting & Analysis Script)

"""
Generate feature analysis visualizations:
- Logistic Regression Coefficients Bar Plot
- Random Forest Feature Importances Bar Plot
- SHAP Summary (Beeswarm) Plot
- Correlation Heatmap

Outputs:
  polymarket_blogposts/results/analysis/<SYMBOL>/coefficients_lr.png
  polymarket_blogposts/results/analysis/<SYMBOL>/feature_importances_rf.png
  polymarket_blogposts/results/analysis/<SYMBOL>/shap_summary_rf.png
  polymarket_blogposts/results/analysis/<SYMBOL>/feature_correlation.png
"""

import logging
import sys
from pathlib import Path

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
import shap

# Ensure clean plotting on headless systems
matplotlib_logger = logging.getLogger("matplotlib")
matplotlib_logger.setLevel(logging.WARNING)

# Paths to datasets
TRADING_PROJECT_DIR = Path("C:/Users/MattO/Documents/projects_V2/2026/trading_polymarket_crypto_prediction")
DATA_ROOT = TRADING_PROJECT_DIR / "xgboost" / "data"
OUTPUT_ROOT = Path(__file__).resolve().parent / "results" / "analysis"
SYMBOLS = ["BTC", "ETH", "XRP"]

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)
log = logging.getLogger(__name__)


def _load_dataset(symbol: str) -> pd.DataFrame:
    path = DATA_ROOT / symbol / "combined" / "modeling.parquet"
    if not path.exists():
        raise FileNotFoundError(f"Missing modeling dataset: {path}")

    df = pd.read_parquet(path)
    if df.empty:
        return df

    df["time"] = pd.to_datetime(df["time"], errors="coerce", utc=True)
    df["target_up_binary"] = pd.to_numeric(df["target_up_binary"], errors="coerce")
    df = df.dropna(subset=["time", "target_up_binary"]).sort_values("time").drop_duplicates(subset=["time"], keep="last").reset_index(drop=True)
    df = df.ffill().bfill()
    return df


def _select_features(df: pd.DataFrame) -> list[str]:
    key_patterns = [
        "tf_15min_rsi_14",
        "tf_15min_macd_hist",
        "tf_15min_adx_14",
        "tf_1hour_rsi_14",
        "tf_1hour_macd_hist",
        "tf_1hour_adx_14",
        "tf_1day_rsi_14",
        "tf_1day_macd_hist",
        "tf_1day_sma_200",
        "tf_1min_last_rsi_7",
        "tf_1min_mean_5_vwap_session",
        "tf_1min_delta_5_ema_9"
    ]
    selected = [col for col in key_patterns if col in df.columns]
    if not selected:
        all_features = [col for col in df.columns if col not in ["time", "target_up_binary"]]
        return all_features[:12]
    return selected


def _plot_lr_coefficients(coef_df: pd.DataFrame, out_path: Path, symbol: str) -> None:
    plt.figure(figsize=(10, 6))
    # Sort by coefficient value
    sorted_df = coef_df.sort_values("coefficient", ascending=True)

    # Custom color palette: red for negative, green for positive
    colors = ["#e74c3c" if c < 0 else "#2ecc71" for c in sorted_df["coefficient"]]

    sns.barplot(
        x="coefficient",
        y="feature",
        data=sorted_df,
        palette=colors,
        hue="feature",
        legend=False
    )
    plt.axvline(x=0, color="#2c3e50", linestyle="--", linewidth=1)
    plt.title(f"Logistic Regression Coefficients - {symbol}\n(Predicting Market Resolving UP)", fontsize=14, pad=15)
    plt.xlabel("Coefficient Value (Feature Weight)", fontsize=12)
    plt.ylabel("Technical Indicator Feature", fontsize=12)
    plt.grid(axis="x", linestyle=":", alpha=0.6)
    plt.tight_layout()
    plt.savefig(out_path, dpi=150)
    plt.close()


def _plot_rf_importances(importance_df: pd.DataFrame, out_path: Path, symbol: str) -> None:
    plt.figure(figsize=(10, 6))
    sorted_df = importance_df.sort_values("importance", ascending=True)

    sns.barplot(
        x="importance",
        y="feature",
        data=sorted_df,
        palette="viridis",
        hue="feature",
        legend=False
    )
    plt.title(f"Random Forest Gini Feature Importances - {symbol}", fontsize=14, pad=15)
    plt.xlabel("Gini Importance Score", fontsize=12)
    plt.ylabel("Technical Indicator Feature", fontsize=12)
    plt.grid(axis="x", linestyle=":", alpha=0.6)
    plt.tight_layout()
    plt.savefig(out_path, dpi=150)
    plt.close()


def _plot_shap_summary(rf: RandomForestClassifier, X: pd.DataFrame, out_path: Path, symbol: str) -> None:
    try:
        explainer = shap.TreeExplainer(rf)
        shap_values = explainer.shap_values(X)

        # Determine correct indexing for binary classification SHAP values
        if isinstance(shap_values, list):
            # Old SHAP format (list of two arrays, one per class)
            # Use class 1 (UP)
            shap_vals_to_use = shap_values[1]
        elif isinstance(shap_values, np.ndarray) and len(shap_values.shape) == 3:
            # Shape is (samples, features, classes)
            shap_vals_to_use = shap_values[:, :, 1]
        else:
            shap_vals_to_use = shap_values

        plt.figure(figsize=(10, 6))
        # shap.summary_plot generates its own figure elements
        shap.summary_plot(shap_vals_to_use, X, show=False)
        plt.title(f"SHAP Summary (Random Forest UP Class) - {symbol}", fontsize=14, pad=15)
        plt.tight_layout()
        plt.savefig(out_path, dpi=150)
        plt.close()
    except Exception as e:
        log.error("Failed to generate SHAP summary for %s: %s", symbol, e)


def _plot_correlation_matrix(feature_df: pd.DataFrame, out_path: Path, symbol: str) -> None:
    plt.figure(figsize=(12, 10))
    corr = feature_df.corr()

    # Generate a mask for the upper triangle
    mask = np.triu(np.ones_like(corr, dtype=bool))

    sns.heatmap(
        corr,
        mask=mask,
        cmap="coolwarm",
        vmax=1.0,
        vmin=-1.0,
        center=0,
        square=True,
        linewidths=0.5,
        cbar_kws={"shrink": 0.8},
        annot=True,
        fmt=".2f",
        annot_kws={"size": 8}
    )
    plt.title(f"Technical Indicators Correlation Heatmap - {symbol}", fontsize=14, pad=15)
    plt.tight_layout()
    plt.savefig(out_path, dpi=150)
    plt.close()


def main() -> None:
    OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
    sns.set_theme(style="whitegrid")

    for symbol in SYMBOLS:
        log.info("Processing visualization analysis for %s...", symbol)
        df = _load_dataset(symbol)
        feature_columns = _select_features(df)
        feature_df = df[feature_columns].copy()
        target = df["target_up_binary"].astype(int)

        symbol_dir = OUTPUT_ROOT / symbol
        symbol_dir.mkdir(parents=True, exist_ok=True)

        # Scale data for Logistic Regression
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(feature_df)

        # 1. Fit Logistic Regression
        lr = LogisticRegression(C=0.1, max_iter=1000, random_state=42)
        lr.fit(X_scaled, target)

        coef_df = pd.DataFrame({
            "feature": feature_df.columns,
            "coefficient": lr.coef_[0]
        })
        _plot_lr_coefficients(coef_df, symbol_dir / "coefficients_lr.png", symbol)

        # 2. Fit Random Forest
        rf = RandomForestClassifier(n_estimators=100, max_depth=5, min_samples_leaf=2, random_state=42)
        rf.fit(feature_df, target)

        importance_df = pd.DataFrame({
            "feature": feature_df.columns,
            "importance": rf.feature_importances_
        })
        _plot_rf_importances(importance_df, symbol_dir / "feature_importances_rf.png", symbol)

        # 3. Plot SHAP Summary
        _plot_shap_summary(rf, feature_df, symbol_dir / "shap_summary_rf.png", symbol)

        # 4. Plot Correlation Matrix
        _plot_correlation_matrix(feature_df, symbol_dir / "feature_correlation.png", symbol)

        log.info("Visualizations for %s generated successfully!", symbol)


if __name__ == "__main__":
    main()

Limitations

This is a research screen based on a 2-week sample window and a single historical run, not a validated trading strategy. Accuracies in the low-to-mid 50s are directionally interesting but have not been tested against Polymarket’s fee structure, execution slippage, or a longer out-of-sample period. Nothing in this article is investment advice.

Research disclaimer

This material is provided for research and educational purposes only. It is not investment advice, a recommendation, or an offer to buy or sell any security or strategy.

Filed under: foundation models, machine learning, polymarket, prediction markets, quantitative research

Can Google’s TabFM Predict Polymarket Crypto Markets? A Zero-Shot Backtest

Can Google’s TabFM Predict Polymarket Crypto Markets? A Zero-Shot Backtest

Executive Summary

1. The Prediction Challenge & Data Alignment

2. Foundation Models for Tabular Data: TabFM vs. Text LLMs

Architecture and Inference Model

3. Walk-Forward Backtesting: Folds & Runtime

Understanding Walk-Forward Folds

TabFM Computational Overhead

4. Performance Comparison

5. Economic & Quantitative Interpretation of the Models

5.1 Logistic Regression Coefficients: A Quantitative Rationale

5.2 Random Forest Feature Importances

5.3 SHAP Summary (Beeswarm Analysis)

6. Correlation Analysis & Decoupling

7. Computational Hardware Environment

8. Conclusion

New stock analysis in your inbox.

9. Replication Code

Limitations

New stock analysis in your inbox.

Leave a Comment Cancel reply