Rephrasing the Net: A Recipe for Compute and Knowledge-Environment friendly Language Modeling


This paper has been accepted on the Knowledge Issues for Basis Models workshop at ICLR 2024.

Massive language fashions are skilled on large scrapes of the online, which are sometimes unstructured, noisy, and poorly phrased. Present scaling legal guidelines present that studying from such knowledge requires an abundance of each compute and knowledge, which grows with the scale of the mannequin being skilled. That is infeasible each due to the big compute prices and period related to pre-training, and the upcoming shortage of high-quality knowledge on the net. On this work, we proposeWebRephrase Augmented Pre-training (WRAP) that makes use of an off-the-shelf instruction-tuned mannequin prompted to paraphrase paperwork on the net in particular types equivalent to “like Wikipedia” or in “question-answer format” to collectively pre-train LLMs on actual and artificial rephrases. First, we present that utilizing WRAP on the C4 dataset, which is of course noisy, accelerates pre-training by ~3 occasions. On the similar pre-training compute price range, it improves perplexity by greater than 10% on common throughout totally different subsets of the Pile, and improves zero-shot query reply accuracy throughout 13 duties by greater than 2%. Second, we examine the influence of the re-phrasing model on the efficiency of the mannequin, providing insights into how the composition of the coaching knowledge can influence the efficiency of LLMs in OOD settings. Our features are attributed to the truth that re-phrased artificial knowledge (i) incorporates model variety that carefully displays downstream analysis model, and (ii) has larger “high quality” than web-scraped knowledge.

Leave a Reply

Your email address will not be published. Required fields are marked *