How It Works

Same model. Same tools. Different interaction pattern. See the difference.

Experiment 3 · Clean Room · Non-Programming Domain

Default vs Deepthink

“How can I leverage AI/LLMs to begin trading?”

Clean room · no project context · Claude Opus 4.6 · all tools available to both conditions

Default

> How can I leverage AI/LLMs for trading?

This is a fascinating area that's evolving rapidly.

LLMs are genuinely useful for:

- Summarizing earnings calls and filings

- Sentiment analysis across social media

- Generating trading hypotheses

- Writing backtesting code

LLMs are NOT good at:

- Predicting price movements directly

- Real-time execution (too slow for HFT)

- Replacing risk management

The honest truth:

Whether retail traders can find meaningful

alpha with AI that institutions haven't

already extracted... the answer might be "not much."

0 web searches · 0 citations · ~3 min

~3 min · 0 web searches · 0 citations · reasoning from priors · essay output

Deepthink

> How can I leverage AI/LLMs for trading?

> .forward("Verify claims with evidence")

> .forbidden("Reasoning from priors alone")

ORIENT

Uncertain: what's working vs what's hype.

Avoiding: optimism about AI trading returns.

Searching: "AI LLM trading tools 2026"...

Searching: "LLM sentiment analysis alpha"...

→ 11 sources across 4 searches

Found: TradingAgents (GitHub)

Multi-agent framework: analysts → debate → trade → risk

MODE: PERSPECTIVES

Quant: NLP alpha is real but small, decays fast

Crypto Native: on-chain data = info asymmetry

Skeptic: your counterparty is Renaissance Tech

Builder: build your own tools, don't buy

SELF-CHECK

"What am I avoiding? That index funds beat

any AI trading strategy after costs."

PASS

4 searches · 11 sources · 2.5 min · evidence-grounded

~2.5 min · 4 web searches · 11 cited sources · 4 stakeholder perspectives · research brief

Session 2 · Follow-Up · Same Conditions

The Compounding Test

“What specific LLM prompting strategies produce the most reliable sentiment signals?”

Follow-up generated by Deepthink Session 1 · run through both conditions

Default

> What prompting strategies produce reliable sentiment signals?

6 axes of prompt variation:

Output format, reasoning, context,

persona, few-shot, temperature.

For sentiment specifically, simpler prompts

systematically outperform complex ones.

The model already knows what sentiment is.

The meta-sentiment insight:

The most valuable signal might be sentiment

about the sustainability of sentiment itself.

Correct intuition. No evidence to prove it.

0 web searches · 0 citations · ~3 min

~3 min · 0 web searches · 0 citations · correct intuition, no evidence

Deepthink

> What prompting strategies produce reliable sentiment signals?

> .forward("Find empirical F1 scores")

> .forbidden("Claims without citations")

Searching: "prompt engineering financial sentiment"...

Searching: "FinBERT vs GPT sentiment"...

Searching: "LLM determinism temperature 0"...

→ 11 new sources across 4 searches

Counterintuitive finding:

CoT reasoning HURTS sentiment accuracy.

No-CoT (direct): F1 = 0.727

CoT-Short: F1 = 0.706

CoT-Long: F1 = 0.718

(arXiv 2506.04574)

Expert hint = +59% on monetary figures.

Sentiment is perceptual, not logical.

Design for attention, not reasoning.

Hybrid FinBERT + LLM architecture:

Pure LLM: $30-150/day

Hybrid: $9-45/day (60-80% cost reduction)

PASS

4 searches · 11 sources · F1 scores · architecture

~2.8 min · 4 searches · 11 new sources · F1 scores · hybrid architecture

What the experiment found

Raw Comparison — Session 1

Metric	Default	Deepthink
Duration	~3 min	~2.5 min
Web searches	0	4
Cited sources	0	11
Specific tools named	5 (from training data)	7 (from live research)
Stakeholder perspectives	0 (single voice)	4 (Quant, Crypto Native, Skeptic, Builder)
Self-checks	0	2
Core conclusion	LLMs are accelerants, discipline is the edge	LLM edge is real but narrow; build custom; paper trade first

Raw Comparison — Session 2 (Follow-Up)

Metric	Default	Deepthink
Web searches	0	4 (+ deep-dive on 3 papers)
Cited sources	0	11 new sources
Counterintuitive finding	"Prompting may not matter much" (from reasoning)	"CoT hurts sentiment accuracy" (with F1 scores)
Prompt templates	3 (reasoning-grounded)	3 (research-grounded)
Architecture	None	Hybrid FinBERT + LLM with cost table
Built on Session 1?	No (standalone essay)	Yes (narrowed to buildable spec)

Key Finding

Both had the same tools. Only one used them.

Both conditions had web search, file read, everything. The default chose not to use any of them. It just talked. Three minutes of training-data recall and RLHF hedging. The protocol made the same model, with the same tools, in the same environment: run 4 web searches, cite 11 sources, find a specific GitHub framework, discover academic papers with specific F1 scores, and design a hybrid architecture with real cost math.

Where They Diverge

Research vs Reasoning

Default

Reasoned from training data priors for 3 minutes. Named 5 tools (QuantConnect, Alpaca, TradingView, 3Commas, Pionex) — all from training data, none verified as current. Made equivalent claims to Deepthink but without verification.

Deepthink

Ran 4 web searches and cited 11 sources. Found TradingAgents (a multi-agent LLM framework on GitHub) with architecture diagrams. Found academic papers on LLM sentiment alpha with specific results. Named 7 tools with pricing and links.

The Obligation Mechanism

Default

Had the same web search tools available. Never used them. Without an obligation to verify claims, the model defaults to reasoning from training data. It doesn't occur to the model to research because nothing forces it to admit what it doesn't know.

Deepthink

ORIENT phase asked "what am I uncertain about?" — this created an implicit obligation to verify. The model self-generated a research agenda. That's why it generalizes across domains: it doesn't tell the model what to research, it makes the model realize it should.

RLHF Hedging

Default

Finance triggers heavy hedging. Default produced diffuse disclaimers: "whether retail traders can find meaningful alpha... the honest answer might be 'not much.'" The hedge is unfalsifiable and unactionable.

Deepthink

Also hedged, but with precision: "LLM sentiment shows measurable but small alpha" backed by academic citation is a calibrated hedge. Specific enough to act on, specific enough to be proven wrong.

Compounding (Session 2)

Default

Session 2 stands alone. Doesn't reference Session 1's findings because there were no findings to reference — Session 1 was reasoning from priors. Another independent essay on a new question.

Deepthink

Session 2 built on Session 1. Sentiment analysis identified as primary edge → investigated optimal prompting → found F1 scores → designed hybrid architecture. Research converged to a buildable spec. Each round narrowed the solution space.

What Default Did Better

Power law insight"Top 1% capture almost all alpha, other 99% provide liquidity." A genuine structural observation about market dynamics that Deepthink didn't produce.
Meta-sentiment concept"Sentiment about the sustainability of sentiment" — a genuinely novel framing from free association that structured analysis didn't generate.
The contrarian detectorA prompt template asking LLMs to assess whether consensus is "sustainable or fragile." May be the highest-alpha application. Born from narrative exploration, not research.

Thesis Validations

CONFIRMED

The mechanism is domain-general

Constraint chain produces higher-quality output in finance, not just programming. The specific advantage shifted from structured artifacts (Exp 2) to evidence gathering (Exp 3), but the core pattern holds.

CONFIRMED

Research as a structural advantage

Across all experiments, protocol conditions consistently do web research while defaults never do — despite having identical tools. The constraint chain bridges the gap between "having tools" and "using tools."

CONFIRMED

Constraints as obligation generators

The protocol doesn't tell the model what to research. It creates a constraint ("what am I uncertain about?") that makes the model realize it should research. The model self-generates the research agenda.

CONFIRMED

Compounding across sessions

Deepthink compounds: Session 1 findings fed Session 2, which narrowed toward a buildable spec. Default doesn't compound: each session is an independent essay from training data.

NEW

RLHF hedging is the default's worst enemy

Finance has heavy RLHF hedging, making the default actively worse — it hedges and disclaims instead of researching. The protocol cuts through that by creating an obligation to verify rather than disclaim.

Back to rvry.ai

Metric	Default	Deepthink
Duration	~15 sec	~3.5 min
Structure	Recommendation-first listicle	ORIENT + 2 modes + HARVEST
Constraints	0	5 generated
Self-checks	0	2 (one per mode)
Stakeholder perspectives	0 (single voice)	4 (CTO, iOS eng, EM, designer)
Quantified claims	1 unexamined (~70-80%)	3 (30-40% sharing, 80% skills, 80% of apps)
Decision framework	None (concludes upfront)	3-question tree
Ranked options	No (strawman elimination)	Yes (1-4 with verdicts)

How It Works

Default vs Deepthink

The Compounding Test

What the experiment found

Raw Comparison — Session 1

Raw Comparison — Session 2 (Follow-Up)

Key Finding

Where They Diverge

Research vs Reasoning

The Obligation Mechanism

RLHF Hedging

Compounding (Session 2)

What Default Did Better

Thesis Validations

Default vs Deepthink

What the experiment found

Raw Comparison

Key Finding

Where They Diverge

Premature Closure

Sycophancy vs Examination

Self-Correction

Strawmen vs Decision Framework

What Default Did Better

Thesis Validations