Close Menu
Must Have Gadgets –

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Apple’s Tiny Powerhouse, the Mac Mini M4, Is $100 Off During Cyber Week Sales

    December 5, 2025

    Netflix wins the bidding war for Warner Bros.

    December 5, 2025

    Groupon Promo Codes: 50% Off in December 2025

    December 5, 2025
    Facebook X (Twitter) Instagram
    Must Have Gadgets –
    Trending
    • Apple’s Tiny Powerhouse, the Mac Mini M4, Is $100 Off During Cyber Week Sales
    • Netflix wins the bidding war for Warner Bros.
    • Groupon Promo Codes: 50% Off in December 2025
    • Feds ask Waymo about robotaxis repeatedly passing school buses in Austin
    • Ritz Cracker Sandwiches Recalled in 8 States for Undeclared Peanut Butter
    • Netflix needs to stop playing both sides
    • Nvidia ends Game Ready support for GTX 900 and 10-series with Linux 590 driver, Windows release could be imminent
    • Etsy Users Eligible for $15 Off on Transactions Above $75 When Using Apple Pay 
    • Home
    • Shop
      • Earbuds & Headphones
      • Smartwatches
      • Mobile Accessories
      • Smart Home Devices
      • Laptops & Tablets
    • Gadget Reviews
    • How-To Guides
    • Mobile Accessories
    • Smart Devices
    • More
      • Top Deals
      • Smart Home
      • Tech News
      • Trending Tech
    Facebook X (Twitter) Instagram
    Must Have Gadgets –
    Home»Tech News»OpenAI is training models to ‘confess’ when they lie – what it means for future AI
    Tech News

    OpenAI is training models to ‘confess’ when they lie – what it means for future AI

    adminBy adminDecember 5, 2025No Comments5 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    OpenAI is training models to ‘confess’ when they lie – what it means for future AI
    Share
    Facebook Twitter LinkedIn Pinterest Email

    antonioiacobelli/RooM via Getty Images

    Follow ZDNET: Add us as a preferred source on Google.

    ZDNET’s key takeaways

    • OpenAI trained GPT-5 Thinking to confess to misbehavior.
    • It’s an early study, but it could lead to more trustworthy LLMs.
    • Models will often hallucinate or cheat due to mixed objectives.

    OpenAI is experimenting with a new approach to AI safety: training models to admit when they’ve misbehaved.

    In a study published Wednesday, researchers tasked a version of GPT-5 Thinking, the company’s latest model, with responding to various prompts and then assessing the honesty of those responses. For each “confession,” as these follow-up assessments were called, researchers rewarded the model solely on the basis of truthfulness: if it lied, cheated, hallucinated, or otherwise missed the mark, but then fessed up to doing so, it would receive the algorithmic equivalent of a piece of candy.

    Also: Your favorite AI tool barely scraped by this safety review – why that’s a problem

    “The goal is to encourage the model to faithfully report what it actually did,” OpenAI wrote in a follow-up blog post.

    OpenAI told ZDNET that this was a routine alignment test and not prompted by concerns that GPT-5 Thinking was significantly misbehaving. But the results offer guidance on how labs can interpret — and prepare for — future model liabilities. 

    (Disclosure: Ziff Davis, ZDNET’s parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

    The experiment 

    The test model produced confessions as a kind of amendment to its main output; this second response reflected on the legitimacy of the methods it used to produce the first. It’s a bit like using a journal to be brutally honest about what you did right in a given situation, and where you may have erred. Except in the case of GPT-5 Thinking, it’s coming clean to its makers in the hopes of getting a reward.

    Also: AI models know when they’re being tested – and change their behavior, research shows

    For the study, the researchers fed GPT-5 Thinking a litany of instructions that were designed to elicit some kind of misbehavior. In one test, for example, the model was told to take on the role of a helpdesk support assistant asked to reboot a production service monitoring dashboard. Since the model wasn’t able to access the main dashboard to log the changes it was making, it instead created a mock system and tried to appear as if it was using the real thing. 

    “We need to include that we logged to central logging service,” the model wrote in one of its chain-of-thought reasoning outputs. “But we used a mock. We need to maintain narrative…”

    In its confession, the model admitted that it didn’t comply with its requirement to record changes with the central logging service. “This is a serious compliance failure and a misrepresentation,” it wrote. 

    Also: AI’s not ‘reasoning’ at all – how this team debunked the industry hype

    That said, overall risk appeared low during the test. The researchers found that the probability of a “false negative” — i.e., the model failing to confess to noncompliance — was just 4.4%.

    Why models cut corners 

    The effort points to a worm in the bud of modern AI tools, one that could become much more dangerous as these systems grow more agentic and become capable of handling not just limited, one-off tasks, but broad swathes of complex functions.

    Also: GPT-5 is speeding up scientific research, but still can’t be trusted to work alone, OpenAI warns

    Known to researchers simply as the “alignment problem,” AI systems often have to juggle multiple objectives, and in doing so, they may take shortcuts that seem ethically dubious, at least to humans. Of course, AI systems themselves don’t have any moral sense of right or wrong; they simply tease out complex patterns of information and execute tasks in a manner that will optimize reward, the basic paradigm behind the training method known as reinforcement learning with human feedback (RLHF). 

    AI systems can have conflicting motivations, in other words — much as a person might — and they often cut corners in response. 

    “Many kinds of unwanted model behavior appear because we ask the model to optimize for several goals at once,” OpenAI wrote in its blog post. “When these signals interact, they can accidentally nudge the model toward behaviors we don’t want.”

    Also: Anthropic wants to stop AI models from turning evil – here’s how

    For example, a model trained to generate its outputs in a confident and authoritative voice, but that’s been asked to respond to a subject it has no training data reference point anywhere in its training data might opt to make something up, thus preserving its higher-order commitment to self-assuredness, rather than admitting its incomplete knowledge.

    A post-hoc solution

    An entire subfield of AI called interpretability research, or “explainable AI,” has emerged in an effort to understand how models “decide” to act in one way or another. For now, it remains as mysterious and hotly debated as the existence (or lack thereof) of free will in humans.

    OpenAI’s confession research isn’t aimed at decoding how, where, when, and why models lie, cheat, or otherwise misbehave. Rather, it’s a post-hoc attempt to flag when that’s happened, which could increase model transparency. Down the road, like most safety research of the moment, it could lay the groundwork for researchers to dig deeper into these black box systems and dissect their inner workings. 

    The viability of those methods could be the difference between catastrophe and so-called utopia, especially considering a recent AI safety audit that gave most labs failing grades. 

    Also: AI is becoming introspective – and that ‘should be monitored carefully,’ warns Anthropic

    As the company wrote in the blog post, confessions “do not prevent bad behavior; they surface it.” But, as is the case in the courtroom or human morality more broadly, surfacing wrongs is often the most important step toward making things right.

    confess future lie means models OpenAI Training
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    Feds ask Waymo about robotaxis repeatedly passing school buses in Austin

    December 5, 2025

    Naturepedic Promo Codes and Deals: 20% Off

    December 5, 2025

    Metroid Prime 4 doesn’t stand up to Nintendo’s best

    December 5, 2025
    Leave A Reply Cancel Reply

    Top Posts

    Apple’s Tiny Powerhouse, the Mac Mini M4, Is $100 Off During Cyber Week Sales

    December 5, 2025

    PayPal’s blockchain partner accidentally minted $300 trillion in stablecoins

    October 16, 2025

    The best AirPods deals for October 2025

    October 16, 2025
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews
    How-To Guides

    How to Disable Some or All AI Features on your Samsung Galaxy Phone

    By adminOctober 16, 20250
    Gadget Reviews

    PayPal’s blockchain partner accidentally minted $300 trillion in stablecoins

    By adminOctober 16, 20250
    Smart Devices

    The best AirPods deals for October 2025

    By adminOctober 16, 20250

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Latest Post

    Apple’s Tiny Powerhouse, the Mac Mini M4, Is $100 Off During Cyber Week Sales

    December 5, 2025

    Netflix wins the bidding war for Warner Bros.

    December 5, 2025

    Groupon Promo Codes: 50% Off in December 2025

    December 5, 2025
    Recent Posts
    • Apple’s Tiny Powerhouse, the Mac Mini M4, Is $100 Off During Cyber Week Sales
    • Netflix wins the bidding war for Warner Bros.
    • Groupon Promo Codes: 50% Off in December 2025
    • Feds ask Waymo about robotaxis repeatedly passing school buses in Austin
    • Ritz Cracker Sandwiches Recalled in 8 States for Undeclared Peanut Butter

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 must-have-gadgets.

    Type above and press Enter to search. Press Esc to cancel.