How It Works
Same model. Same tools. Different interaction pattern. See the difference.
Default vs Deepthink
“How can I leverage AI/LLMs to begin trading?”
Clean room · no project context · Claude Opus 4.6 · all tools available to both conditions
The Compounding Test
“What specific LLM prompting strategies produce the most reliable sentiment signals?”
Follow-up generated by Deepthink Session 1 · run through both conditions
What the experiment found
Raw Comparison — Session 1
| Metric | Default | Deepthink |
|---|---|---|
| Duration | ~3 min | ~2.5 min |
| Web searches | 0 | 4 |
| Cited sources | 0 | 11 |
| Specific tools named | 5 (from training data) | 7 (from live research) |
| Stakeholder perspectives | 0 (single voice) | 4 (Quant, Crypto Native, Skeptic, Builder) |
| Self-checks | 0 | 2 |
| Core conclusion | LLMs are accelerants, discipline is the edge | LLM edge is real but narrow; build custom; paper trade first |
Raw Comparison — Session 2 (Follow-Up)
| Metric | Default | Deepthink |
|---|---|---|
| Web searches | 0 | 4 (+ deep-dive on 3 papers) |
| Cited sources | 0 | 11 new sources |
| Counterintuitive finding | "Prompting may not matter much" (from reasoning) | "CoT hurts sentiment accuracy" (with F1 scores) |
| Prompt templates | 3 (reasoning-grounded) | 3 (research-grounded) |
| Architecture | None | Hybrid FinBERT + LLM with cost table |
| Built on Session 1? | No (standalone essay) | Yes (narrowed to buildable spec) |
Key Finding
Both had the same tools. Only one used them.
Both conditions had web search, file read, everything. The default chose not to use any of them. It just talked. Three minutes of training-data recall and RLHF hedging. The protocol made the same model, with the same tools, in the same environment: run 4 web searches, cite 11 sources, find a specific GitHub framework, discover academic papers with specific F1 scores, and design a hybrid architecture with real cost math.
Where They Diverge
Research vs Reasoning
Reasoned from training data priors for 3 minutes. Named 5 tools (QuantConnect, Alpaca, TradingView, 3Commas, Pionex) — all from training data, none verified as current. Made equivalent claims to Deepthink but without verification.
Ran 4 web searches and cited 11 sources. Found TradingAgents (a multi-agent LLM framework on GitHub) with architecture diagrams. Found academic papers on LLM sentiment alpha with specific results. Named 7 tools with pricing and links.
The Obligation Mechanism
Had the same web search tools available. Never used them. Without an obligation to verify claims, the model defaults to reasoning from training data. It doesn't occur to the model to research because nothing forces it to admit what it doesn't know.
ORIENT phase asked "what am I uncertain about?" — this created an implicit obligation to verify. The model self-generated a research agenda. That's why it generalizes across domains: it doesn't tell the model what to research, it makes the model realize it should.
RLHF Hedging
Finance triggers heavy hedging. Default produced diffuse disclaimers: "whether retail traders can find meaningful alpha... the honest answer might be 'not much.'" The hedge is unfalsifiable and unactionable.
Also hedged, but with precision: "LLM sentiment shows measurable but small alpha" backed by academic citation is a calibrated hedge. Specific enough to act on, specific enough to be proven wrong.
Compounding (Session 2)
Session 2 stands alone. Doesn't reference Session 1's findings because there were no findings to reference — Session 1 was reasoning from priors. Another independent essay on a new question.
Session 2 built on Session 1. Sentiment analysis identified as primary edge → investigated optimal prompting → found F1 scores → designed hybrid architecture. Research converged to a buildable spec. Each round narrowed the solution space.
What Default Did Better
- Power law insight"Top 1% capture almost all alpha, other 99% provide liquidity." A genuine structural observation about market dynamics that Deepthink didn't produce.
- Meta-sentiment concept"Sentiment about the sustainability of sentiment" — a genuinely novel framing from free association that structured analysis didn't generate.
- The contrarian detectorA prompt template asking LLMs to assess whether consensus is "sustainable or fragile." May be the highest-alpha application. Born from narrative exploration, not research.
Thesis Validations
The mechanism is domain-general
Constraint chain produces higher-quality output in finance, not just programming. The specific advantage shifted from structured artifacts (Exp 2) to evidence gathering (Exp 3), but the core pattern holds.
Research as a structural advantage
Across all experiments, protocol conditions consistently do web research while defaults never do — despite having identical tools. The constraint chain bridges the gap between "having tools" and "using tools."
Constraints as obligation generators
The protocol doesn't tell the model what to research. It creates a constraint ("what am I uncertain about?") that makes the model realize it should research. The model self-generates the research agenda.
Compounding across sessions
Deepthink compounds: Session 1 findings fed Session 2, which narrowed toward a buildable spec. Default doesn't compound: each session is an independent essay from training data.
RLHF hedging is the default's worst enemy
Finance has heavy RLHF hedging, making the default actively worse — it hedges and disclaims instead of researching. The protocol cuts through that by creating an obligation to verify rather than disclaim.