Reward Hacking Mitigation using Verifiable Composite Rewards
Date
2025-10-11
Journal Title
Journal ISSN
Volume Title
Publisher
Proceedings of the 16th ACMInternational Conference on Bioinformatics, Computational Biology, and Health Informatics (
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that utilizing RLVR with our proposed reward model leads to betterformatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR1.
Description
This article was originally published in Proceedings of the 16th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. The version of record is available at: https://doi.org/10.1145/3765612.3767230
This work is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.org/licenses/by/4.0/ BCB ’25, Philadelphia, PA, USA ©2025 Copyright held by the owner/author(s).
Keywords
Reinforcement Learning, Reward Hacking, Large Language Models
Citation
Mirza Farhan Bin Tarek and Rahmatollah Beheshti. 2025. Reward Hacking Mitigation using Verifiable Composite Rewards. In Proceedings of the 16th ACMInternational Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB ’25), October 11–15, 2025, Philadelphia, PA, USA. ACM, NewYork, NY, USA, 6 pages. https://doi.org/10.1145/3765612.3767230
