<?xml version="1.0"?>
<News hasArchived="false" page="1" pageCount="1" pageSize="10" timestamp="Sun, 19 Apr 2026 17:02:39 -0400" url="https://dev.my.umbc.edu/groups/csee/posts.xml?tag=openai">
  <NewsItem contentIssues="true" id="149205" important="false" status="posted" url="https://dev.my.umbc.edu/groups/csee/posts/149205">
  <Title>Popular AIs head-to-head: OpenAI beats DeepSeek on sentence-level reasoning</Title>
  <Tagline>Article by UMBC Prof. Manas Gaur from The Conversation</Tagline>
  <Body>
    <![CDATA[
    <div class="html-content"><div><img src="https://images.theconversation.com/files/659376/original/file-20250402-56-rft1hz.jpg?ixlib=rb-4.1.0&amp;rect=0%2C360%2C7086%2C3977&amp;q=45&amp;auto=format&amp;w=754&amp;fit=clip" style="max-width: 100%; height: auto;">
            <br>DeepSeek’s language AI rocked the tech industry, but it comes up short on one measure.
              <span><a href="https://www.gettyimages.com/detail/news-photo/this-illustration-photograph-shows-screens-displaying-the-news-photo/2195925950" rel="nofollow external" class="bo">Lionel Bonaventure/AFP via Getty Images</a></span>
            <br>
        
    
      <h4><span><hr></span></h4><h4><span><a href="https://theconversation.com/profiles/manas-gaur-2312608" rel="nofollow external" class="bo">Manas Gaur</a></span></h4></div><div><em><br></em>
    
      <p>ChatGPT and other AI chatbots based on large language models are known to occasionally make things up, including <a href="https://doi.org/10.1001/jamanetworkopen.2023.27647" rel="nofollow external" class="bo">scientific and</a> <a href="https://doi.org/10.48550/arXiv.2405.20362" rel="nofollow external" class="bo">legal citations</a>. It turns out that measuring how accurate an AI model’s citations are is a good way of assessing the model’s reasoning abilities.</p>
    
    <p>An AI model “reasons” by breaking down a query into steps and working through them in order. Think of how you learned to solve math word problems in school.</p>
    
    <p>Ideally, to generate citations an AI model would understand the key concepts in a document, generate a ranked list of relevant papers to cite, and provide convincing reasoning for how each suggested paper supports the corresponding text. It would highlight specific connections between the text and the cited research, clarifying why each source matters.  </p>
    
    <p>The question is, can today’s models be trusted to make these connections and provide clear reasoning that justifies their source choices? The answer goes beyond citation accuracy to address how useful and accurate large language models are for any information retrieval purpose.</p>
    
    <p>I’m a <a href="https://scholar.google.co.in/citations?hl=en&amp;user=VJ8ZdCEAAAAJ&amp;view_op=list_works&amp;sortby=pubdate" rel="nofollow external" class="bo">computer scientist</a>. My colleagues − researchers from the AI Institute at the University of South Carolina, Ohio State University and University of Maryland Baltimore County − and I have developed the <a href="https://doi.org/10.48550/arXiv.2405.02228" rel="nofollow external" class="bo">Reasons benchmark</a> to test how well large language models can automatically generate research citations and provide understandable reasoning.</p>
    
    <p>We used the benchmark to <a href="https://doi.org/10.48550/arXiv.2405.02228" rel="nofollow external" class="bo">compare the performance</a> of two popular AI reasoning models, DeepSeek’s R1 and OpenAI’s o1. Though DeepSeek <a href="https://www.theguardian.com/business/2025/jan/27/tech-shares-asia-europe-fall-china-ai-deepseek" rel="nofollow external" class="bo">made headlines</a> with its stunning <a href="https://theconversation.com/why-building-big-ais-costs-billions-and-how-chinese-startup-deepseek-dramatically-changed-the-calculus-248431" rel="nofollow external" class="bo">efficiency and cost-effectiveness</a>, the Chinese upstart has a way to go to match OpenAI’s reasoning performance.</p>
    
    <h2>Sentence specific</h2>
    
    <p>The accuracy of citations has a lot to do with whether the AI model is reasoning about information <a href="https://doi.org/10.48550/arXiv.2405.17980" rel="nofollow external" class="bo">at the sentence level</a> rather than paragraph or document level. Paragraph-level and document-level citations can be thought of as throwing a large chunk of information into a large language model and asking it to provide many citations. </p>
    
    <p>In this process, the large language model overgeneralizes and misinterprets individual sentences. The user ends up with citations that <a href="https://doi.org/10.48550/arXiv.2409.02897" rel="nofollow external" class="bo">explain the whole paragraph or document</a>, not the relatively fine-grained information in the sentence.</p>
    
    <p>Further, reasoning suffers when you ask the large language model to read through an entire document. These models mostly rely on memorizing patterns that they typically are better at finding at the beginning and end of longer texts <a href="https://doi.org/10.48550/arXiv.2307.03172" rel="nofollow external" class="bo">than in the middle</a>. This makes it difficult for them to fully understand all the important information throughout a long document.</p>
    
    <p>Large language models get confused because paragraphs and documents hold a lot of information, which affects citation generation and the reasoning process. Consequently, reasoning from large language models over paragraphs and documents becomes more like <a href="https://doi.org/10.48550/arXiv.2411.17375" rel="nofollow external" class="bo">summarizing or paraphrasing</a>.</p>
    
    <p>The Reasons benchmark addresses this weakness by examining large language models’ citation generation and reasoning. </p>
    
    
                <div class="embed-container"><iframe src="https://www.youtube.com/embed/kQZzYMHre0U?wmode=transparent&amp;start=0" frameborder="0" webkitAllowFullScreen="webkitAllowFullScreen" mozallowfullscreen="mozallowfullscreen" allowFullScreen="allowFullScreen">[Video]</iframe></div>
                <span>How DeepSeek R1 and OpenAI o1 compare generally on logic problems.</span>
              
    
    <h2>Testing citations and reasoning</h2>
    
    <p>Following the release of DeepSeek R1 in January 2025, we wanted to examine its accuracy in generating citations and its quality of reasoning and compare it with OpenAI’s o1 model. We created a paragraph that had sentences from different sources, gave the models individual sentences from this paragraph, and asked for citations and reasoning. </p>
    
    <p>To start our test, we developed a small test bed of about 4,100 research articles around four key topics that are related to human brains and computer science: neurons and cognition, human-computer interaction, databases and artificial intelligence. We evaluated the models using two measures: F-1 score, which measures how accurate the provided citation is, and hallucination rate, which measures how sound the model’s reasoning is − that is, how often it <a href="https://theconversation.com/what-are-ai-hallucinations-why-ais-sometimes-make-things-up-242896" rel="nofollow external" class="bo">produces an inaccurate or misleading response</a>. </p>
    
    <p>Our testing revealed <a href="https://doi.org/10.48550/arXiv.2405.02228" rel="nofollow external" class="bo">significant performance differences</a> between OpenAI o1 and DeepSeek R1 across different scientific domains. OpenAI’s o1 did well connecting information between different subjects, such as understanding how research on neurons and cognition connects to human-computer interaction and then to concepts in artificial intelligence, while remaining accurate. Its performance metrics consistently outpaced DeepSeek R1’s across all evaluation categories, especially in reducing hallucinations and successfully completing assigned tasks. </p>
    
    <p>OpenAI o1 was better at combining ideas semantically, whereas R1 focused on making sure it generated a response for every attribution task, which in turn increased hallucination during reasoning. OpenAI o1 had a hallucination rate of approximately 35% compared with DeepSeek R1’s rate of nearly 85% in the attribution-based reasoning task.</p>
    
    <p>In terms of accuracy and linguistic competence, OpenAI o1 scored about 0.65 on the F-1 test, which means it was right about 65% of the time when answering questions. It also scored about 0.70 on the BLEU test, which measures how well a language model writes in natural language. These are pretty good scores. </p>
    
    <p>DeepSeek R1 scored lower, with about 0.35 on the F-1 test, meaning it was right about 35% of the time. However, its BLEU score was only about 0.2, which means its writing wasn’t as natural-sounding as OpenAI’s o1. This shows that o1 was better at presenting that information in clear, natural language.</p>
    
    <h2>OpenAI holds the advantage</h2>
    
    <p>On other benchmarks, DeepSeek R1 <a href="https://doi.org/10.1038/d41586-025-00229-6" rel="nofollow external" class="bo">performs on par</a> with OpenAI o1 on math, coding and scientific reasoning tasks. But the substantial difference on our benchmark suggests that o1 provides more reliable information, while R1 struggles with factual consistency. </p>
    
    <p>Though we included other models in our comprehensive testing, the performance gap between o1 and R1 specifically highlights the current competitive landscape in AI development, with OpenAI’s offering maintaining a significant advantage in reasoning and knowledge integration capabilities.</p>
    
    <p>These results suggest that OpenAI still has a leg up when it comes to source attribution and reasoning, possibly due to the nature and volume of the data it was trained on. The company recently announced its <a href="https://doi.org/10.1038/d41586-025-00377-9" rel="nofollow external" class="bo">deep research tool</a>, which can create reports with citations, ask follow-up questions and provide reasoning for the generated response. </p>
    
    <p>The jury is still out on the tool’s value for researchers, but the caveat remains for everyone: Double-check all citations an AI gives you.</p>
    
      <p><span><a href="https://theconversation.com/profiles/manas-gaur-2312608" rel="nofollow external" class="bo">Manas Gaur</a>, Assistant Professor of Computer Science and Electrical Engineering, <em><a href="https://theconversation.com/institutions/university-of-maryland-baltimore-county-1667" rel="nofollow external" class="bo">University of Maryland, Baltimore County</a></em></span></p>
    
      <p>This article is republished from <a href="https://theconversation.com" rel="nofollow external" class="bo">The Conversation</a> under a Creative Commons license. Read the <a href="https://theconversation.com/popular-ais-head-to-head-openai-beats-deepseek-on-sentence-level-reasoning-249109" rel="nofollow external" class="bo">original article</a>.</p>
    </div></div>
]]>
  </Body>
  <Summary>DeepSeek’s language AI rocked the tech industry, but it comes up short on one measure.           Lionel Bonaventure/AFP via Getty Images                       Manas Gaur         ChatGPT and other...</Summary>
  <Website>https://theconversation.com/popular-ais-head-to-head-openai-beats-deepseek-on-sentence-level-reasoning-249109</Website>
  <TrackingUrl>https://dev.my.umbc.edu/api/v0/pixel/news/149205/guest@my.umbc.edu/c453928d24259c1806390603f2878a73/api/pixel</TrackingUrl>
  <Tag>ai</Tag>
  <Tag>chatgpt</Tag>
  <Tag>deepseek</Tag>
  <Tag>large-language-model</Tag>
  <Tag>llm</Tag>
  <Tag>openai</Tag>
  <Group token="csee">Computer Science and Electrical Engineering</Group>
  <GroupUrl>https://dev.my.umbc.edu/groups/csee</GroupUrl>
  <AvatarUrl>https://assets3-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/xsmall.png?1314043393</AvatarUrl>
  <AvatarUrl size="original">https://assets1-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/original.png?1314043393</AvatarUrl>
  <AvatarUrl size="xxlarge">https://assets1-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/xxlarge.png?1314043393</AvatarUrl>
  <AvatarUrl size="xlarge">https://assets4-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/xlarge.png?1314043393</AvatarUrl>
  <AvatarUrl size="large">https://assets3-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/large.png?1314043393</AvatarUrl>
  <AvatarUrl size="medium">https://assets1-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/medium.png?1314043393</AvatarUrl>
  <AvatarUrl size="small">https://assets2-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/small.png?1314043393</AvatarUrl>
  <AvatarUrl size="xsmall">https://assets3-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/xsmall.png?1314043393</AvatarUrl>
  <AvatarUrl size="xxsmall">https://assets3-dev.my.umbc.edu/system/shared/avatars/groups/000/000/099/d117dca133c64bf78a4b7696dd007189/xxsmall.png?1314043393</AvatarUrl>
  <Sponsor>Computer Science and Electrical Engineering</Sponsor>
  <PawCount>0</PawCount>
  <CommentCount>0</CommentCount>
  <CommentsAllowed>true</CommentsAllowed>
  <PostedAt>Thu, 17 Apr 2025 16:28:01 -0400</PostedAt>
  <EditAt>Thu, 17 Apr 2025 16:44:15 -0400</EditAt>
</NewsItem>
</News>
