Reward Hacking Mitigation using Verifiable Composite Rewards

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Proceedings of the 16th ACMInternational Conference on Bioinformatics, Computational Biology, and Health Informatics (

Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that utilizing RLVR with our proposed reward model leads to betterformatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR1.

Description

This article was originally published in Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. The version of record is available at: https://doi.org/10.1145/3765612.3767230 This work is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/ BCB ’25, Philadelphia, PA, USA ©2025 Copyright held by the owner/author(s).

Citation

Mirza Farhan Bin Tarek and Rahmatollah Beheshti. 2025. Reward Hacking Mitigation using Verifiable Composite Rewards. In Proceedings of the 16th ACMInternational Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’25), October 11–15, 2025, Philadelphia, PA, USA. ACM, NewYork, NY, USA, 6 pages. https://doi.org/10.1145/3765612.3767230

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Except where otherwised noted, this item's license is described as Attribution 4.0 International