Training language models to follow instructions with human feedback - Summary
Blog post from Portkey
The paper explores a method to align language models with user intent by fine-tuning them using human feedback, resulting in models known as InstructGPT. These models demonstrate improvements in truthfulness and a reduction in the generation of toxic outputs, with only minimal performance regressions on public NLP datasets. The study highlights the misalignment between traditional language modeling objectives and the goal of adhering to user instructions, emphasizing that public NLP datasets do not accurately represent real-world usage of language models. InstructGPT models, despite having significantly fewer parameters than GPT-3, produce outputs that labelers prefer, showcasing the potential of human feedback in enhancing language model alignment.