Other

understanding-rlhf

Learning from preferences is a common paradigm for fine-tuning language models. Yet, many algorithmic design decisions come into play. Our new work finds that approaches employing on-policy sampling or negative gradients outperform offline, maximum likelihood objectives.

OtherEmerging

GitHub Website

Stars

—

Forks

—

Contributors

Last push

26mo ago

Recent commits

Latest commits.

Update README.md
9d5844eAnikait Singh26mo ago
license
2c0823fAnikait Singh26mo ago
llm experiments
1a24ca8Anikait Singh26mo ago
bandit experiments
6af8169Anikait Singh26mo ago

Top contributors

Builders behind this project.

Asap7772

4 commits