Data in. Decisions out. End-to-end reinforcement learning.
Scroll to explore
Traditional quantitative approaches follow a two-step pipeline: predict future returns, then optimize a portfolio. Each step introduces error. The prediction model doesn't know about transaction costs. The optimizer doesn't know about model uncertainty.
What if the allocation decision itself could be learned end-to-end? An agent that ingests raw data and directly outputs portfolio weights, trained not on prediction accuracy but on actual portfolio performance.
Raw data comes in. Portfolio weights come out. In between: regime detection figures out what kind of market we're in, nine ML models independently rank the equity universe, an RL agent decides how much capital goes where, and multiple LLMs review the final allocation before anything executes.
Click any node below to see what it does.
System Architecture
The RL agent uses a Transformer policy network with self-attention across assets. This means the model learns which sectors move together and adapts its correlations as market conditions shift.
The heatmap below shows synthetic attention weights between sector ETFs, illustrating the kind of cross-asset relationships the model captures. High values indicate the model attends strongly to one sector when making decisions about another.
Cross-Asset Attention Weights (Synthetic)
Synthetic illustration. Sector ETFs as proxy assets. Brighter = stronger learned correlation. Hover for values.
In reinforcement learning, what you optimize for determines what you get. The reward function balances competing objectives: maximizing returns, limiting drawdowns, and controlling turnover (transaction costs).
Adjust the weights below to see how different reward configurations produce different agent behaviors.
Reward Shaping Explorer
Synthetic simulation. Outputs illustrate directional behavior, not actual model performance.
What works in a bull market will get you killed in a crash. The system detects four regimes — expansion, contraction, high volatility, recovery — using VIX levels, trend indicators, and yield curve signals. The RL agent knows which regime it's in and adjusts accordingly. Different market, different policy.
S&P 500 with Regime Overlays (2020–2025)
Public market data. Regime classifications are illustrative. The agent shifts allocation strategy as regimes change.
No single model sees the whole picture. Gradient-boosted trees are good at nonlinear feature interactions. Transformers learn which assets move together across the universe. Regularized linear models keep things grounded and resist overfitting. Nine architectures, nine different ways of looking at the same data.
Each model ranks the investable universe independently. When they converge on the same names, that agreement means something. When they disagree, the disagreement itself is a signal — a measure of how uncertain the system should be, which feeds directly into how much capital it's willing to commit.
Ensemble Model Diversity
Nine heterogeneous architectures contribute independent rankings. Animated pulses represent each model's signal flowing into the combined ensemble output.
Before any capital moves, the portfolio agent consults. Proposed allocations are sent to multiple large language models, each analyzing the portfolio through a different lens. One flags macroeconomic headwinds that quantitative models might miss. Another questions sector concentration or identifies crowded trades. A third stress-tests the reasoning against recent market context.
Every consultation is logged with full context: the proposed portfolio, each analyst's response, the final decision rationale. This creates an auditable chain of reasoning — a decision journal that the system can reference in future sessions, building institutional memory one trade at a time.
Multi-LLM Consultation Flow
The portfolio agent orchestrates consultations with multiple LLMs before executing any rebalance. All exchanges are logged to a persistent decision journal.
Experienced portfolio managers develop intuition over years: a layered understanding where yesterday's market action exists in vivid detail while last quarter collapses into a narrative of key themes. This system replicates that process through hierarchical episodic memory.
Recent observations are stored at full resolution: every data point, every trade, every market condition. As entries age, an LLM compresses them into progressively higher-level summaries. The result is a memory structure divorced from traditional timeframes. The system doesn't just remember that a quarter was volatile; it remembers why, at exactly the level of abstraction that matters for future decisions.
Hierarchical Memory Compression
Memory entries are progressively compressed by an LLM. Recent events retain full detail; older events are preserved as summaries of summaries.
Nine models screen the universe. An RL agent allocates capital across eight independent policy seeds. Three LLM analysts review every trade before execution. A hierarchical memory compresses experience into something that looks a lot like intuition. Every decision logged. Every rationale preserved. The system gets sharper with every market day it survives.