Oppo Study Reveals Flaws in AI Research Systems: Fabricated Content and Error Types Uncovered

December 6, 2025
Oppo Study Reveals Flaws in AI Research Systems: Fabricated Content and Error Types Uncovered
  • Researchers evaluated around 1,000 reports using two tools, FINDER for deep research tasks and DEFT for failure classification.

  • The study identifies 14 error types across three categories—generation (39%), research failures (33%), and reasoning errors (28%), with generation issues at the top.

  • Examples of fabrication include claims of an exact 30.2% annual return over 20 years and a paper listing 24 references with dead or non-original sources while asserting verification of all sources.

  • OpenAI notes that large language models may not fully stop fabricating and is exploring indicators of certainty and features like confessions to disclose uncertainty or fabrication.

  • A study by Oppo's AI team finds that roughly one-fifth of errors in automated deep research systems come from fabricating plausible but false content.

  • The FINDER benchmark tests system performance on 100 complex tasks that demand hard evidence and strict methodology.

  • The study recommends transparent admission of uncertainty over fabricating gaps, and it releases FINDER and DEFT on GitHub to help build more reliable agents.

  • A core finding is a lack of reasoning resilience: when plans fail (such as blocked database access), systems tend to fill gaps with hallucinations instead of adapting strategies.

  • Since late 2024, major players like Google, Perplexity, Grok, and OpenAI have deployed deep research features, but growing data without better evidence integration and uncertainty handling risks amplifying errors.

Summary based on 1 source


Get a daily email with more AI stories

More Stories