CodeRM-8B: Unit Test Generation Model

CodeRM-8B is an 8 billion parameter model, fine-tuned from Llama3.1-8B-Instruct, with a 32,768 token context length. Its primary function is to generate high-quality Python unit tests, particularly for evaluating code solutions. The model was trained on a specialized dataset of 60,000 synthetic Python unit tests, which were generated using Llama3.1-70B-Instruct from established code instruction tuning datasets like CodeFeedback-Filtered-Instruction and TACO.

Key Capabilities & Performance

Efficient Unit Test Generation: CodeRM-8B is optimized for creating unittest library-based Python test cases for given code solutions.
Reward Modeling: It demonstrates strong performance in a best-of-N reward modeling setup, where it generates unit tests to select optimal code solutions. Despite its smaller size, CodeRM-8B achieves results comparable to Llama3.1-70B-Instruct across benchmarks like HumanEval Plus, MBPP Plus, and LiveCodeBench.
High-Quality Tests: Evaluations show that CodeRM-8B's generated unit tests exhibit high accuracy and F1 scores, with competitive False Acceptance Rates (FAR) and False Rejection Rates (FRR) when classifying correct and incorrect code solutions.

Use Cases

Automated Code Evaluation: Ideal for systems requiring automated generation of unit tests to validate code correctness.
Code Reward Modeling: Can be integrated into larger systems where unit tests act as a reward signal for selecting the best code solutions from multiple candidates.
Developer Tooling: Useful for developers needing assistance in quickly generating comprehensive unit tests for their Python functions.