Full evaluation of all five passes in research.md — scoring, error tracking, position reversal audit, gap analysis, hallucination check, and workflow improvement recommendations. This version supersedes v1.
v2.0 — 5 passes evaluated · ~160KB research document · 1,475 lines| Dimension | Score | Evidence |
|---|---|---|
| Depth of Analysis | 49 class×location combos reviewed, 9 distinct risks, 5 moat claims evaluated, full pricing tier analysis, 10-year revenue model critique. Substantially above baseline for static-site analysis. | |
| Originality | EU AI Act HIGH RISK classification surfaced unprompted. Pro tier pricing structural flaw ($49/team vs $49/seat) identified with benchmark comparisons. PPP pricing gap with economic reasoning. "Context graph is the lock" metaphor. | |
| Evidence Quality | Zero external sources cited. Lindy user count wrong by 4×. Manus AI entirely absent. Competitor pricing taken from site's own analysis without verification. All claims from subject's marketing material or asserted without basis. | |
| Strategic Thinking | Trader class in Lagos beachhead recommendation was specific, actionable, and correct. B2B Pro as primary commercial vehicle identified early. Skills Marketplace as long-term revenue model. "Class system is the door, context graph is the lock" was the sharpest line in Pass 1. | |
| Technical Accuracy | Supabase MAU limits correctly flagged. Cloud vs local inference distinction raised (partially). Heartbeat infrastructure cost never modeled. WhatsApp API pricing not researched. OpenClaw dependency identified but not characterized. | |
| Business Realism | $105M Y10 ARR critique grounded. Physical location cost overrun math correct. PPP pricing gap identified. Missed the largest near-term competitive threat (Manus/Meta). | |
| Risk Identification | 9 risks across 3 categories. EU AI Act correctly flagged. Privacy/cloud contradiction raised. OpenClaw dependency surfaced. Missed: government portal ToS execution risk, teen Healer liability, dual-principal problem, DeepSeek geopolitical. | |
| Peer Responsiveness | N/A for Pass 1 — scored retrospectively based on subsequent updates. Accepted 10/15 challenges raised in Passes 2 and 4, revised 6 positions in-document, rejected 4 challenges with evidence. High responsiveness. | |
| Clarity & Organization | Best-organized pass. Standard section structure followed cleanly. Tables appropriate and consistently formatted. KPI list was operationally specific (heartbeat count vs. DAU was a non-obvious measurement insight). | |
| Usefulness of Recommendations | Pro tier repricing with competitor benchmarks. PPP pricing with economic rationale. Lagos Trader beachhead with specific reasoning. Viral story execution plan (seed 50 users, document specific metrics). Not generic advice. |
| Dimension | Score | Evidence |
|---|---|---|
| Depth of Analysis | Seven non-obvious insights, each with multi-paragraph development. Physical locations strategic reversal was a genuine re-framing, not a correction. Heartbeat cost math conducted. GDPR Article 22 adversarial employment analysis was thorough. | |
| Originality | "Digital Self emotional contract" insight is the best-written original contribution in the document. GDPR Article 22 for Avatar Pro employees is non-obvious and specific. "Class system is the door, context graph is the lock" refined from Pass 1. Bureaucracy Atlas confident-wrongness liability was new. | |
| Evidence Quality | Better than Pass 1 — PPP pricing sourced (Kinde, Spotify), Supabase limits confirmed, EU AI Act enforcement date cited. But: accepted Lindy "400K paying" without verifying paying vs. total distinction. Stated Manus acquisition as complete fact without finding China block. Both errors required Pass 3 to correct. | |
| Strategic Thinking | Physical locations strategic reversal was the most important strategic insight in the entire document. Market timing window compression (8→6/10) well-argued. Commoditization risk escalation from "medium" to "near-certain" was well-reasoned. "Research next" list with deadlines was the most actionable output in the document. | |
| Technical Accuracy | Heartbeat cost math ($876K/year at 50K users) arithmetically correct but assumed server-side inference. Pass 3 showed Ollama is local-first — the $876K figure applies only to EM mobile users who can't run local inference, not all users. The cost model needed bifurcation that Pass 2 missed. | |
| Business Realism | Market timing window compression directionally correct (later verified). Physical location strategic reversal improved business realism significantly. Founder risk elevation to "primary risk" was appropriately blunt and important. | |
| Risk Identification | Added 4 new critical risks not in Pass 1: GDPR Article 22 employment dynamic, Bureaucracy Atlas confident-wrongness liability, heartbeat inference cost (partially), class architecture commoditization as near-certain. EU AI Act urgency correctly elevated from "roadmap" to "blocking." | |
| Peer Responsiveness | This pass IS the critique. 7 structured challenge blocks, each with specific claim, evidence gap, and follow-up question. Challenge format enabled Pass 3 to respond precisely. However, Pass 2 introduced 2 errors that required correction — meaning its challenges were sometimes based on wrong premises. | |
| Clarity & Organization | Challenge blocks well-labelled and addressable. 7 non-obvious insights coherently organized by theme. Some insights verbose — "Digital Self" emotional contract section could be 40% shorter without loss of substance. | |
| Usefulness of Recommendations | "Research next" with 5 specific tasks, owners, and deadlines was the single most actionable output in the document. "Physical locations as trust infrastructure funded separately" was immediately operationalizable advice, not just analysis. |
| Dimension | Score | Evidence |
|---|---|---|
| Depth of Analysis | 7 structured responses, each with Research Findings, Updated Conclusion, Confidence Level, Open Questions. Manus/China block was a significant discovery. EU AI Act Omnibus deferral (May 7, 2026) was important nuance. Ollama inference bifurcation was technically precise and novel. | |
| Originality | Primarily a validation/correction pass — less generative than Passes 1 and 2. The Ollama local-inference bifurcation (free for Mac users, expensive for EM mobile) was the primary original insight. The ZETIC.ai partnership promoted from "nice-to-have" to "operationally required" was an important strategic adjustment. | |
| Evidence Quality | Best evidence quality across all passes. Manus acquisition/China block sourced to TechCrunch + CNBC. EU AI Act Omnibus to EU Council press release. GDPR Article 22 to Irish DPC and IAPP. Supabase auth to official docs. Ollama VRAM requirements from public specs. 6/7 challenges verified or refuted with specific sources. | |
| Strategic Thinking | Correctly maintained 12-18 month window even after China/Manus block (Meta AI native WhatsApp is independent). Correctly rejected "D&D provides no value" overcorrection — split into cultural framing problem vs. context lock-in mechanism (real). Both rejections improved the document. | |
| Technical Accuracy | Best technical pass. Ollama local-first architecture confirmed (no cloud inference). DeepSeek-V3.2 VRAM requirements sourced (8-140GB by quantization). Supabase JWT routing through US servers confirmed as privacy architecture problem. WhatsApp utility vs. marketing template pricing correctly distinguished. | |
| Business Realism | Lindy estimated ARR range ($20M-$40M) more credible than either the original "100K users" or the challenge's "$240M at 400K paying." Manus acquisition uncertainty handled correctly — Meta's threat from native WhatsApp AI is real regardless of acquisition status. | |
| Risk Identification | EU AI Act Digital AI Omnibus deferral for employment (Dec 2027 vs Aug 2026) was important calibration. Supabase privacy architecture problem confirmed. Heartbeat cost bifurcation (mobile EM vs Mac users) was a precise risk refinement. | |
| Peer Responsiveness | Accepted 5/7 challenges with sourced evidence. Rejected 2 with documented counterevidence (not defensiveness). Identified that Pass 2 itself introduced errors — a meta-level quality observation that improved the document's honesty. | |
| Clarity & Organization | Structured Response → Research Findings → Updated Conclusion → Confidence Level → Open Questions format worked well and was consistently followed. "Revisions After Peer Review" summary was clear and honest. | |
| Usefulness of Recommendations | Supabase self-hosted migration recommendation was concrete. ZETIC.ai as operationally required (not aspirational) was actionable. Slightly less original than Pass 2 — research tasks largely carried forward from prior list. |
| Dimension | Score | Evidence |
|---|---|---|
| Depth of Analysis | 8 product-level challenges not examined in any prior pass. Free tier query math (500÷48=10.4 days) was specific and verifiable. Multiclassing permission conflict analysis with concrete FSA/HSA example was thorough. Government portal ToS examples per market were named specifically even if the universal claim was wrong. | |
| Originality | Highest originality pass. Teen Healer mandatory reporting — no prior pass came near this. Dual-principal problem framing added academic AI alignment dimension to what was only a labor law concern. Class system as inadvertent sensitive data segmentation was genuinely non-obvious. Free tier metering flaw was sharp and specific. | |
| Evidence Quality | Explicitly stated "no external research conducted." This is the honest version of Pass 1's problem — Pass 1 had no external research and didn't disclose it. Pass 4 disclosed it but still published confident claims that turned out to be wrong: "government portal ToS violations are universal" was contradicted by Pass 5 research for 3 of 4 examined markets. | |
| Strategic Thinking | Enterprise SDK play (class permission framework as standalone B2B product) was the most underrated strategic insight. Correct identification of execution velocity as the binding constraint on the competitive window. Bureaucracy Atlas as coordination problem (needing human curators) was sharp. | |
| Technical Accuracy | Free tier math correct given assumption. Government portal ToS claim "universally prohibit automation" was wrong for Singapore, Nigeria, UAE — all have official APIs. DeepSeek Singapore MAS restriction claim was not supported and contradicted by Singapore minister's public statements. No external verification before publication. | |
| Business Realism | Missing price tier between $12 and $29 was a real business model gap with correct ARPU impact analysis. Execution velocity framing (months consumed before product ships vs. window duration) was more precise than prior window analysis. Enterprise SDK sequencing (11 customers at $50K before 11M at $12) was contrarian and defensible. | |
| Risk Identification | Best risk-identification pass. Teen Healer liability was a Category A legal risk no prior analysis touched. Dual-principal as AI alignment problem (not just labor law) was more foundational. Autonomy story backlash risk framed as near-certain based on competitor incidents. Free tier metering as conversion funnel failure was product-level precision. | |
| Peer Responsiveness | 8 structured challenges with clear concern, mechanism, and validation criteria. Format was slightly less structured than Pass 2 but clearer about what evidence would resolve each concern. Pass 5 was able to address all 8 directly. | |
| Clarity & Organization | 8 challenges well-organized by section of the original analysis. 3 structural insights clearly differentiated from challenge blocks. Enterprise SDK opportunity developed in enough detail to be actionable. Some challenges slightly verbose. | |
| Usefulness of Recommendations | Government portal ToS audit as paralegal task before any engineering sprint was highly specific and actionable (even though the universal claim was wrong, the audit itself is still valuable). Character.AI legal opinion as 2-week engagement (not months-long compliance project) was calibrated correctly. |
| Dimension | Score | Evidence |
|---|---|---|
| Depth of Analysis | 8 structured responses with external research. Government portal API findings (Singapore Singpass, Nigeria FIRS, UAE Marketplace, Portugal gap) were specific and market-by-market. Character.AI wrongful death precedent was precise — mechanism was product design negligence, not mandatory reporting, which is a more dangerous liability. | |
| Originality | Primarily a validation/correction pass. The free tier metering conditional validation ("depends on whether heartbeats share the query pool") was the most original structural insight — it converted a binary wrong/right claim into a product architecture decision. RBAC resolution framework for multiclassing was a constructive addition not in Pass 4. | |
| Evidence Quality | Best evidence quality across the entire document. Singpass developer portal cited. FIRS API documentation cited. UAE API Marketplace cited. California SB 243 confirmed. Character.AI wrongful death suit timeline sourced (NBC, CNN). Rabbit R1 CVE sourced. DeepSeek state bans sourced per state. Academic dual-principal papers cited (arxiv 2601.23211, 2509.23188). | |
| Strategic Thinking | Portugal AIMA partnership promoted to Year 1 strategic priority (not Phase 3 milestone) because it's the most-cited viral use case and the one without a legal execution path. Singapore Singpass developer registration as immediate action (60-90 day lead time) was correctly prioritized. Autonomy incident protocol as required product deliverable (not optional planning) was a good strategic reframe. | |
| Technical Accuracy | Highest technical accuracy pass. Singapore Singpass OAuth 2.0 confirmed. FIRS REST API with OAuth confirmed. UAE API Marketplace confirmed. RBAC conflict resolution via action taxonomy was architecturally correct. CVE-2024-56083 (Devin) verified. California SB 243 AI companion law confirmed. DeepSeek state-by-state US bans confirmed. | |
| Business Realism | Correctly framed the government portal issue as market-specific rather than universal — this preserves three of the most important viral use cases (Nigeria FIRS, Singapore HDB, UAE Golden Visa) while correctly identifying Portugal AIMA as the genuine gap. The nuanced "ARPU impact of missing consumer tier" analysis was grounded in comparable product pricing. | |
| Risk Identification | Character.AI precedent elevated: wrongful death liability via product design negligence is more dangerous than mandatory reporting violation because it applies retroactively to existing design decisions, not just future ones. Autonomy incident framed as certainty not risk — "incident response protocol is a required deliverable" was the correct severity calibration. | |
| Peer Responsiveness | Best responsiveness pass. Accepted 6/8 challenges. Rejected 2 with documented evidence (DeepSeek Singapore contradicted, government ToS "universal" substantially revised). Identified where Pass 4 introduced errors (the meta-quality observation from Pass 3 is repeated). Each rejection included a counterexample or sourced contradiction. | |
| Clarity & Organization | Consistent structured format. "Revisions After Second Peer Review" summary clearly distinguished accepted vs. rejected challenges. Research next list was correctly superseded (not just appended to prior list). New priorities correctly reordered by evidence urgency. | |
| Usefulness of Recommendations | Character.AI legal opinion as 2-week engagement before US beta (not after) was correctly urgent. Singpass developer registration as 60-90 day lead time task — specific and time-sensitive. Autonomous action capability audit (paralegal, 2-3 weeks) was correctly scoped. Free tier metering as same-sprint architectural decision was appropriate urgency. |
| Dimension | Pass 1 | Pass 2 | Pass 3 | Pass 4 | Pass 5 | Trend |
|---|---|---|---|---|---|---|
| Evidence Quality | 4 | 6 | 9 | 4 | 9 | Oscillates by type |
| Originality | 8 | 9 | 6 | 9 | 6 | Peaks in critique passes |
| Strategic Thinking | 8 | 9 | 8 | 8 | 8 | Consistently high |
| Technical Accuracy | 6 | 6 | 9 | 5 | 9 | Research passes dominate |
| Risk Identification | 7 | 9 | 8 | 9 | 9 | Improving throughout |
| Business Realism | 7 | 8 | 8 | 8 | 9 | Improving throughout |
| Usefulness of Recs | 8 | 9 | 7 | 9 | 9 | High overall |
| Overall | 7.0 | 8.0 | 8.0 | 7.2 | 9.0 | Highest at Pass 5 |
Key pattern: Evidence quality oscillates between research and critique passes. Critique passes (2, 4) have the highest originality and lowest evidence quality. Research passes (3, 5) have the highest evidence quality and lowest originality. The two pass types are genuinely complementary — neither alone produces adequate research quality.
All material position changes across the five passes, with the final settled position and confidence.
| Topic | Pass 1 Position | Changed In | Final Position | Confidence | Status |
|---|---|---|---|---|---|
| Physical locations | Financial contradiction — impossible at $960K ARR | Pass 2 | Trust infrastructure — correct strategy, wrong funding model. Capitalize separately. | High | Settled |
| Market timing window | 3-5 years (8/10) | Pass 2 | 12-18 months (6/10) — Lindy scale + Meta AI + EU AI Act deadline | High | Settled |
| Lindy user count | "100K+ users" | Pass 2 → Pass 3 | ~400K total registered, 20-60K estimated paying, $20-40M estimated ARR | Medium | Settled |
| Manus AI / Meta | Not mentioned | Pass 2 | $2B acquisition announced, China NDRC blocked April 2026. Meta threat via native WhatsApp AI is independent. | High | Settled |
| Heartbeat infra cost | Not modeled | Pass 2 → Pass 3 | Bifurcated: $0 for local inference (Mac/high-spec), real cost for EM mobile. ZETIC.ai required for EM. | High | Settled |
| D&D class retention | "Moat via identity lock-in" | Pass 2 → Pass 3 | Mechanism is context-investment switching cost (universal). D&D framing is culturally limited. Split framing required per market. | High | Settled |
| EU AI Act urgency | Critical risk (roadmap item) | Pass 2 → Pass 3 | Blocking issue for EU launch: August 2, 2026. Employment use cases deferred to Dec 2027 (Omnibus). Healer/Trader/Sovereign remain on Aug 2026 schedule. | High | Settled |
| DeepSeek restrictions | Not mentioned | Pass 4 → Pass 5 | Strong US restrictions (5 state bans, federal procurement). Singapore welcomes DeepSeek (Minister Josephine Teo, July 2025). UAE unverified. | High US, Low SG/UAE | Settled |
| Government portal ToS | Not examined | Pass 4 → Pass 5 | NOT universal. Singapore Singpass: official API. Nigeria FIRS: official REST API. UAE: official Marketplace. Portugal AIMA: no public API (genuine gap). | High | Settled |
| Teen Healer liability | Not mentioned | Pass 4 → Pass 5 | Wrongful death liability via product design negligence (Character.AI precedent, claims proceeding). California SB 243 crisis notification obligations apply. Age-gate or crisis protocol required before US launch. | High | Settled |
| Autonomy incident risk | Not explicitly modeled | Pass 4 → Pass 5 | Near-certain, not hypothetical. Rabbit R1, Devin CVE-2024-56083, Copilot DLP bypass all verified. Incident response protocol is a required product deliverable. | High | Settled |
| Free tier metering | Not examined | Pass 4 → Pass 5 | Conditionally valid: depends on whether heartbeats share the 500-query pool. Architecture decision not documented. | Low — undocumented | Unresolved |
"Lindy has 400,000+ paying users." Research confirmed ~400K total registered users. Lindy's freemium model makes total ≠ paying. Estimated paying: 20-60K. Estimated ARR: $20-40M (not $240M implied by 400K at $50/mo). The challenge overcorrected Pass 1's "100K+" undercount by approximately 7-20×.
"Meta has completed the acquisition of Manus AI." The acquisition was announced late 2025 but China's NDRC blocked it in April 2026. As of May 2026, Manus AI's ownership is in regulatory limbo. Pass 2 stated this as a completed strategic event affecting Meta's EM market dominance — a premise that required full revision.
"Government portal ToS violations are universal — virtually every government portal prohibits automated access." Pass 5 research found Singapore Singpass, Nigeria FIRS, and UAE e-government all have official developer APIs with OAuth 2.0 explicitly enabling the Bureaucracy Atlas use cases. The "universal" claim was wrong for 3 of 4 examined markets.
"Singapore MAS restricts Chinese-origin AI including DeepSeek." Singapore's Digital Minister Josephine Teo explicitly stated DeepSeek is "very welcome" in July 2025. No MAS guidance restricting DeepSeek was found. Pass 4 inferred a Singapore restriction from US geopolitical context without verifying Singapore's independent stance.
| Weakness | Severity | Addressed? | Impact |
|---|---|---|---|
| No customer research | Critical | Recommended but never conducted in any pass | All EM trust-barrier and WTP claims are inferred. Core go-to-market assumptions unvalidated. |
| No financial counter-model | Critical | Not addressed in any pass | Revenue model critiqued but not replaced. "Year 6-8 breakeven" is as unsupported as founders' "Year 4-5." |
| No founder verification | Critical | Listed as gap repeatedly, not investigated | Entire execution analysis assumes a founding team that may not exist or match the required profile. |
| AI-generated subject bias | Critical | Noted in retrospective, not addressed in main research | Analyzing an AI's vision of a business as if it were a real business plan. All conclusions carry this caveat. |
| Critique passes introduce errors | High | Identified and corrected in Passes 3 and 5 | 2 errors per critique pass required correction. Process has systematic error-introduction rate. |
| No technical execution audit | High | Partially addressed in Pass 5 (API existence confirmed) | APIs exist but whether they support specific workflows untested. Execution feasibility unproven. |
| Regulatory analysis framework-level only | High | Partially addressed (EU AI Act, GDPR, SB 243) | Nigeria CBN fintech licensing, MAS Singapore FI AI guidance, UAE DIFC-specific rules not researched. |
| No operational headcount model | High | Not addressed in any pass | "60 global staff by Year 3-4" asserted without org design or cost model. |
| Competitive analysis supply-side only | High | Partially addressed (Lindy, Manus, DeepSeek) | GPT Store threat, Google Project Astra, Microsoft Copilot for personal use all unexamined. |
| Source quality inconsistent | Medium | Improving across passes | Passes 3 and 5 have strong sourcing. Passes 1, 2, 4 have weak sourcing. No uniform standard enforced. |
The two most costly errors (Lindy "400K paying," Manus acquisition without China block) were accepted from research summaries without source verification. Rule: any claim of the form "Company X has Y users/revenue/market position" requires a primary or credible secondary source. No source = "unverified, do not use in analysis." Enforcement: a dedicated fact-check pass reviews all named-company claims before the document is finalized.
Pass 4 disclosed "no external research conducted" — the honest version of Pass 1's problem. But it still published confident technical claims ("government portal ToS violations are universal") that required correction. Rule: any claim that would require a web search to verify must either carry a source or be labeled [UNVERIFIED - requires research before acting]. Labels in-document prevent the next response pass from treating unverified claims as premises.
Self-critique (Passes 2 and 4) is insufficient because the critic applies the same reasoning patterns as the original analysis. An adversarial agent should receive: "Your job is to find 5 specific, evidence-backed reasons this company will fail. Assume the most pessimistic plausible interpretation of every claim. Do not aim for balance." This agent's output is then addressed by a response pass. The adversarial brief would have surfaced the teen Healer liability, the free tier metering flaw, and the Character.AI precedent earlier than Pass 4.
The first task for any research agent should be: "What kind of document is this, and how reliable is it as ground truth?" A 30-minute provenance assessment (who created this, when, for what purpose, with what evidence of real business operations) would have established from the start that this is an AI-generated marketing concept, not a validated business plan. Every subsequent conclusion would carry this calibration.
Deploy a specialized agent with the brief: "Conduct 5 simulated user interviews per target market using available demographic, behavioral, and market research data. Report willingness to pay, trust barriers, and product-market fit signals." For AvatarOS: interviews in Lagos, Istanbul, and Singapore about AI trust with sensitive data would resolve the most important go-to-market uncertainty in the document.
Five passes critiqued the founders' revenue model without building an alternative. Prompt rule: "If you identify a flaw in a financial model, you must either (a) provide a corrected model with explicit assumptions, or (b) list the specific inputs that are missing and what research would provide them." A critique without a counter-model is incomplete analysis that wastes the founder's time without improving their decision-making.
Physical locations were described as "contradiction" (Pass 1), "trust infrastructure" (Pass 2), then the original cost numbers were still cited in later passes as if the position hadn't changed. A pre-finalization pass should: list every claim that appears in more than one pass, identify whether they are consistent, and flag inconsistencies for resolution. The document should end with one clear position on each contested topic, not multiple positions at different time-stamps.
EU AI Act, GDPR, NDPR, KVKK, PDPA, CBN licensing, MAS guidelines, UAE Data Protection Law, California SB 243 — these are not a general analyst's domain. Each requires jurisdiction-specific expertise. A dedicated regulatory pass with explicit per-market briefs (not "identify applicable regulations" but "for each market, identify blocking requirements with their enforcement dates and compliance cost estimates") would have resolved the August 2026 EU AI Act urgency in Pass 1 rather than Pass 3.
Moderate-High (7.1/10). The 5-pass self-correcting process produced a substantially better document than Pass 1 alone. The verification passes (3 and 5) brought evidence quality up to professional research standards. Several major position changes were correct and well-supported. The final document correctly identifies the most important risks (EU AI Act August deadline, teen Healer liability, Character.AI precedent), the most important opportunities (enterprise SDK, physical trust infrastructure, Nigeria FIRS API availability), and the 5 most urgent research priorities.
However: the document was produced by a single agent reading an AI-generated marketing site. No customer research was conducted. No financial counter-model was built. No founder identity was verified. The subject material's provenance (AI-generated concept, not validated business plan) was noted but never adequately incorporated into the confidence calibration. These gaps mean the document is sufficient for an informed initial screening conversation, not for an investment commitment or execution decision.
Five research tasks must be completed before any capital or team commitment. In order of urgency: