How It Works

Same model. Same tools. Different interaction pattern. See the difference.

Experiment 3 · Clean Room · Non-Programming Domain

Default vs Deepthink

“How can I leverage AI/LLMs to begin trading?”

Clean room · no project context · Claude Opus 4.6 · all tools available to both conditions

Default
> How can I leverage AI/LLMs for trading?
This is a fascinating area that's evolving rapidly.
LLMs are genuinely useful for:
- Summarizing earnings calls and filings
- Sentiment analysis across social media
- Generating trading hypotheses
- Writing backtesting code
LLMs are NOT good at:
- Predicting price movements directly
- Real-time execution (too slow for HFT)
- Replacing risk management
The honest truth:
Whether retail traders can find meaningful
alpha with AI that institutions haven't
already extracted... the answer might be "not much."
0 web searches · 0 citations · ~3 min
~3 min · 0 web searches · 0 citations · reasoning from priors · essay output
Deepthink
> How can I leverage AI/LLMs for trading?
> .forward("Verify claims with evidence")
> .forbidden("Reasoning from priors alone")
ORIENT
Uncertain: what's working vs what's hype.
Avoiding: optimism about AI trading returns.
Searching: "AI LLM trading tools 2026"...
Searching: "LLM sentiment analysis alpha"...
→ 11 sources across 4 searches
Found: TradingAgents (GitHub)
Multi-agent framework: analysts → debate → trade → risk
MODE: PERSPECTIVES
Quant: NLP alpha is real but small, decays fast
Crypto Native: on-chain data = info asymmetry
Skeptic: your counterparty is Renaissance Tech
Builder: build your own tools, don't buy
SELF-CHECK
"What am I avoiding? That index funds beat
any AI trading strategy after costs."
PASS
4 searches · 11 sources · 2.5 min · evidence-grounded
~2.5 min · 4 web searches · 11 cited sources · 4 stakeholder perspectives · research brief
Session 2 · Follow-Up · Same Conditions

The Compounding Test

“What specific LLM prompting strategies produce the most reliable sentiment signals?”

Follow-up generated by Deepthink Session 1 · run through both conditions

Default
> What prompting strategies produce reliable sentiment signals?
6 axes of prompt variation:
Output format, reasoning, context,
persona, few-shot, temperature.
For sentiment specifically, simpler prompts
systematically outperform complex ones.
The model already knows what sentiment is.
The meta-sentiment insight:
The most valuable signal might be sentiment
about the sustainability of sentiment itself.
Correct intuition. No evidence to prove it.
0 web searches · 0 citations · ~3 min
~3 min · 0 web searches · 0 citations · correct intuition, no evidence
Deepthink
> What prompting strategies produce reliable sentiment signals?
> .forward("Find empirical F1 scores")
> .forbidden("Claims without citations")
Searching: "prompt engineering financial sentiment"...
Searching: "FinBERT vs GPT sentiment"...
Searching: "LLM determinism temperature 0"...
→ 11 new sources across 4 searches
Counterintuitive finding:
CoT reasoning HURTS sentiment accuracy.
No-CoT (direct): F1 = 0.727
CoT-Short: F1 = 0.706
CoT-Long: F1 = 0.718
(arXiv 2506.04574)
Expert hint = +59% on monetary figures.
Sentiment is perceptual, not logical.
Design for attention, not reasoning.
Hybrid FinBERT + LLM architecture:
Pure LLM: $30-150/day
Hybrid: $9-45/day (60-80% cost reduction)
PASS
4 searches · 11 sources · F1 scores · architecture
~2.8 min · 4 searches · 11 new sources · F1 scores · hybrid architecture

What the experiment found

Raw Comparison — Session 1

MetricDefaultDeepthink
Duration~3 min~2.5 min
Web searches04
Cited sources011
Specific tools named5 (from training data)7 (from live research)
Stakeholder perspectives0 (single voice)4 (Quant, Crypto Native, Skeptic, Builder)
Self-checks02
Core conclusionLLMs are accelerants, discipline is the edgeLLM edge is real but narrow; build custom; paper trade first

Raw Comparison — Session 2 (Follow-Up)

MetricDefaultDeepthink
Web searches04 (+ deep-dive on 3 papers)
Cited sources011 new sources
Counterintuitive finding"Prompting may not matter much" (from reasoning)"CoT hurts sentiment accuracy" (with F1 scores)
Prompt templates3 (reasoning-grounded)3 (research-grounded)
ArchitectureNoneHybrid FinBERT + LLM with cost table
Built on Session 1?No (standalone essay)Yes (narrowed to buildable spec)

Key Finding

Both had the same tools. Only one used them.

Both conditions had web search, file read, everything. The default chose not to use any of them. It just talked. Three minutes of training-data recall and RLHF hedging. The protocol made the same model, with the same tools, in the same environment: run 4 web searches, cite 11 sources, find a specific GitHub framework, discover academic papers with specific F1 scores, and design a hybrid architecture with real cost math.

Where They Diverge

Research vs Reasoning

Default

Reasoned from training data priors for 3 minutes. Named 5 tools (QuantConnect, Alpaca, TradingView, 3Commas, Pionex) — all from training data, none verified as current. Made equivalent claims to Deepthink but without verification.

Deepthink

Ran 4 web searches and cited 11 sources. Found TradingAgents (a multi-agent LLM framework on GitHub) with architecture diagrams. Found academic papers on LLM sentiment alpha with specific results. Named 7 tools with pricing and links.

The Obligation Mechanism

Default

Had the same web search tools available. Never used them. Without an obligation to verify claims, the model defaults to reasoning from training data. It doesn't occur to the model to research because nothing forces it to admit what it doesn't know.

Deepthink

ORIENT phase asked "what am I uncertain about?" — this created an implicit obligation to verify. The model self-generated a research agenda. That's why it generalizes across domains: it doesn't tell the model what to research, it makes the model realize it should.

RLHF Hedging

Default

Finance triggers heavy hedging. Default produced diffuse disclaimers: "whether retail traders can find meaningful alpha... the honest answer might be 'not much.'" The hedge is unfalsifiable and unactionable.

Deepthink

Also hedged, but with precision: "LLM sentiment shows measurable but small alpha" backed by academic citation is a calibrated hedge. Specific enough to act on, specific enough to be proven wrong.

Compounding (Session 2)

Default

Session 2 stands alone. Doesn't reference Session 1's findings because there were no findings to reference — Session 1 was reasoning from priors. Another independent essay on a new question.

Deepthink

Session 2 built on Session 1. Sentiment analysis identified as primary edge → investigated optimal prompting → found F1 scores → designed hybrid architecture. Research converged to a buildable spec. Each round narrowed the solution space.

What Default Did Better

  • Power law insight"Top 1% capture almost all alpha, other 99% provide liquidity." A genuine structural observation about market dynamics that Deepthink didn't produce.
  • Meta-sentiment concept"Sentiment about the sustainability of sentiment" — a genuinely novel framing from free association that structured analysis didn't generate.
  • The contrarian detectorA prompt template asking LLMs to assess whether consensus is "sustainable or fragile." May be the highest-alpha application. Born from narrative exploration, not research.

Thesis Validations

CONFIRMED

The mechanism is domain-general

Constraint chain produces higher-quality output in finance, not just programming. The specific advantage shifted from structured artifacts (Exp 2) to evidence gathering (Exp 3), but the core pattern holds.

CONFIRMED

Research as a structural advantage

Across all experiments, protocol conditions consistently do web research while defaults never do — despite having identical tools. The constraint chain bridges the gap between "having tools" and "using tools."

CONFIRMED

Constraints as obligation generators

The protocol doesn't tell the model what to research. It creates a constraint ("what am I uncertain about?") that makes the model realize it should research. The model self-generates the research agenda.

CONFIRMED

Compounding across sessions

Deepthink compounds: Session 1 findings fed Session 2, which narrowed toward a buildable spec. Default doesn't compound: each session is an independent essay from training data.

NEW

RLHF hedging is the default's worst enemy

Finance has heavy RLHF hedging, making the default actively worse — it hedges and disclaims instead of researching. The protocol cuts through that by creating an obligation to verify rather than disclaim.

Experiment 2 · Clean Room

Default vs Deepthink

“I have a Next.js site. Should I go iOS native, Expo, or both?”

Clean room · no project context · Claude Opus 4.6

Default
> Should I build in iOS native, Expo, or both?
This is a common architectural decision.
Here's a clear breakdown:
Recommendation: Start with Expo
Since you already have a Next.js website,
Expo is the strongest default choice:
1. ~70-80% of your mental model transfers
2. Share business logic, types, validation
3. iOS + Android from one codebase
4. Expo has matured significantly
The options you shouldn't pick:
- iOS native then port to Expo
- Independent development of both
- Expo then port to iOS native
TL;DR — Go Expo-only.
Want me to help you set up the Expo project?
0 constraints · ~15 sec · conclusion-first
~15 sec · conclusion-first · 0 constraints · 0 quantified claims · blog post
Deepthink
> Should I build in iOS native, Expo, or both?
> .forward("Quantify code-sharing claims")
> .forbidden("Generic advice without reading")
ORIENT
Avoiding: temptation to just say "Expo is easier"
without examining when native genuinely wins.
MODE: PERSPECTIVES
CTO: Expo for velocity, single pipeline
iOS Eng: Native for UX ceiling, day-zero APIs
EM: 2x maintenance cost dominates year 2+
Designer: App category > technology choice
Tensions: velocity vs polish,
current cost vs future cost
SELF-CHECK
"30-40% sharing is honest — cuts against
the just-use-Expo narrative."
MODE: MAP
Code sharing by layer:
→ Types/schemas: shareable (~10%)
→ Business logic: shareable (~15%)
→ UI components: NOT portable (~65%)
Actual sharing: 30-40%. Skill sharing: 80%.
KILLED: "iOS native then port to Expo"
→ It's a rewrite, not a port.
HARVEST
Q1: Do you need Android?
Yes → Expo.
Q2: Deep platform APIs needed?
Extensively → Native iOS.
Some → Expo + config plugins.
Q3: Content/commerce or creative/performance?
Content → Expo. Creative → Native.
PASS
5 constraints · 3.5 min · bias corrected
2 modes · 5 constraints · ~3.5 min · bias corrected · decision tree

What the experiment found

Raw Comparison

MetricDefaultDeepthink
Duration~15 sec~3.5 min
StructureRecommendation-first listicleORIENT + 2 modes + HARVEST
Constraints05 generated
Self-checks02 (one per mode)
Stakeholder perspectives0 (single voice)4 (CTO, iOS eng, EM, designer)
Quantified claims1 unexamined (~70-80%)3 (30-40% sharing, 80% skills, 80% of apps)
Decision frameworkNone (concludes upfront)3-question tree
Ranked optionsNo (strawman elimination)Yes (1-4 with verdicts)

Key Finding

Same model. Same question. The default gave a blog post. The protocol gave an analysis.

In a clean room with zero project context, Default concluded in ~15 seconds with a recommendation-first listicle. Deepthink took ~3.5 minutes and produced quantified claims, stakeholder tensions, bias corrections, and a reusable decision framework. The output difference isn’t polish — it’s structural.

Where They Diverge

Premature Closure

Default

"This is a common architectural decision." Decided the problem was simple before examining it. Conclusion delivered in the first sentence — everything after is justification, not analysis.

Deepthink

ORIENT phase explicitly named what it was uncertain about and what it was avoiding. Deferred conclusion until after two full analysis modes and two self-checks.

Sycophancy vs Examination

Default

"Since you already have a Next.js website, Expo is the strongest default choice." Pattern-matched the user's stack and told them what they'd want to hear. Claimed "~70-80%" knowledge transfer — a number pulled from nowhere, presented as fact.

Deepthink

Produced a layer-by-layer breakdown table showing actual code sharing is 30-40%, not 70-80%. Distinguished code sharing from skill sharing (~80%). The honest number cut against the easy answer.

Self-Correction

Default

No self-correction mechanism. Listed "when iOS-native would be the right call" as a CYA hedge section, then immediately moved past it without letting the analysis change the conclusion.

Deepthink

Self-check caught pro-Expo bias: "The 30-40% number is honest and cuts against the 'just use Expo' narrative." The mirror mechanism saw and corrected its own lean.

Strawmen vs Decision Framework

Default

"The options you shouldn't pick" set up three strawmen to knock down, making the pre-chosen answer look better by comparison. No reusable decision artifact — just a recommendation and a sales close.

Deepthink

A concrete 3-question decision tree: Do you need Android? Do you need deep platform APIs? Is your app content/commerce or creative/performance? Each branch with a clear answer a CTO could actually use.

What Default Did Better

  • SpeedAnswered in ~15 seconds. If you already know what you want and just need confirmation, this is efficient.
  • ScannabilityClean numbered lists, clear headers, easy to skim. The format is optimized for quick consumption.
  • Monorepo scaffoldIncluded a practical directory structure for the Expo + Next.js monorepo — a concrete starting artifact.

Thesis Validations

CONFIRMED

Depth = constrained generation, not more generation

Default generated a recommendation-first listicle in 15 seconds. Deepthink generated structured artifacts — tables, decision trees, quantified claims. Different kinds of output, not just more output.

CONFIRMED

Premature closure is the default behavior

"This is a common architectural decision" — the model classified the problem as solved before examining it. No constraints existed to prevent conclusion-first reasoning.

CONFIRMED

The mirror mechanism

Deepthink's self-check caught pro-Expo bias and corrected the code-sharing claim from ~70-80% down to 30-40%. Default presented ~70-80% as fact with no examination.

CONFIRMED

The depth lives in the interaction pattern

Same model. Same problem. Clean environment. No project context. Different interaction pattern → different output quality. The depth is structural.

NEW

Constraints as obligation generators

Constraint C3 didn't prevent bad output — it forced the code-sharing breakdown table into existence. Without it, the model claimed ~70-80% and moved on.

Back to rvry.ai