A comprehensive benchmark to evaluate and improve the fundamental numerical reasoning abilities of large language models using diverse synthetic and real-world datasets.
Latest commits.
Builders behind this project.