Why Pay More? Open-Source LLMs Are Catching Up to the AI Giants

5/19/24

Editorial team at Bits with Brains

Proprietary models like GPT-4 have traditionally been considered the gold standard for high-quality outputs, but open source LLMs like WizrdLM-2 are continuing to close the gap.

Proprietary models often benefit from extensive resources and advanced training techniques, enabling them to excel in tasks requiring precise instruction-following and large context windows. However, open-source LLMs are catching up, particularly in summarization and fine-tuning. For instance, models like Llama3 70B have shown competitive performance in these areas, beginning to challenge the dominance of proprietary models.

One key difference between open and proprietary models is the size of the context window, which is the amount of text or data the model can consider at one time when generating responses. This is effectively the model's "memory span" for a given interaction. A larger context window allows the model to process and retain more information, generating more coherent and contextually relevant outputs over longer conversations or documents.

Another critical difference is their ability to follow instructions precisely. Proprietary models often incorporate reinforcement learning through human feedback, enhancing their obedience to explicit instructions. In contrast, open LLMs may require additional fine-tuning to achieve similar levels of precision. However, innovative approaches, such as using task specific LLMs to instruct other LLMs, have been proposed to bridge this gap.

WizardLM is a series of open-source large language models developed by Microsoft that demonstrate state-of-the-art performance on complex tasks. The latest version, WizardLM-2, released in April 2024, includes three models: WizardLM-2 8x22B, WizardLM-2 70B, and WizardLM-2 7B.

WizardLM-2 8x22B, a mixture of expert model, is the most advanced model, rivaling the performance of leading proprietary models like GPT-4 on highly complex tasks. WizardLM-2 70B has top-tier reasoning capabilities, while the smaller WizardLM-2 7B is the fastest and achieves comparable performance to open-source models ten times its size.

WizardLM uses a novel training method called Evol-Instruct. Starting with an initial set of instructions, Evol-Instruct rewrites them step-by-step into more complex instructions using operations like adding constraints, deepening, concretizing, increasing reasoning steps, and complicating input. This allows the generation of a large amount of high-quality instruction data of varying difficulty levels using AI rather than manual human effort. The instruction expansions are generated using models like ChatGPT.

Microsoft built a fully AI-powered synthetic training system that utilizes progressive learning to train WizardLM-2. An "AI Align AI" (AAA) framework fosters collaborative learning among cutting-edge LLMs through simulated interactions and peer learning to enhance each other's capabilities.

In human evaluations on a complexity-balanced test set, WizardLM's outputs were preferred to OpenAI's ChatGPT on high-complexity instructions, demonstrating Evol-Instruct's ability to improve LLMs' handling of complex tasks. A human preferences evaluation found WizardLM-2's capabilities very close to cutting-edge models like GPT-4 and significantly ahead of other open-source models. On GPT-4 automatic evaluations, WizardLM achieves over 90% of ChatGPT's performance on 17 out of 29 skills.

On other benchmarks, WizardLM-2 8x22B demonstrates highly competitive performance compared to advanced proprietary models like GPT-4 and Claude-3. WizardLM-2 7B and 70B are currently the top-performing models in the 7B-70B size range, allowing these models to run locally on many workstations.

WizardLM excels at complex reasoning, knowledge-based question answering, and mathematical problem-solving. Its strong performance on coding tasks also makes it well-suited for assisting software developers. The context window is limited to 16K tokens, or around 12K words, but this is adequate for many use cases, including:

Complex data analysis and insight generation
Developing virtual assistants and chatbots
Automating customer support
Intelligent code generation and bug fixing
Advanced text summarization and report writing
Creative writing aid for storytelling and worldbuilding
Solving complex mathematical and scientific problems

One of the most compelling arguments for open LLMs is their cost-effectiveness. Open-source models are not just marginally cheaper but can be radically more affordable than proprietary models. This cost advantage makes certain applications economically feasible that would otherwise be financially prohibitive using proprietary models. For example, fine-tuning open-source models can be significantly less expensive, allowing organizations to customize models to their specific needs without incurring high costs.

Further, open-source models offer transparency into their underlying code, enabling organizations to inspect and validate the model's functionality. This transparency fosters innovation by allowing developers to experiment and build upon existing models. Additionally, the active community support surrounding open-source projects provides a collaborative environment for problem-solving and quicker issue resolution.

These factors are contributing to the growing popularity of open LLMs, particularly among startups and smaller organizations with limited budgets and those keen to maximize their margins. Entire communities of researchers and developers are actively working on improving the quality of open LLMs, expanding their context windows, and integrating function templates. These advancements are expected to bring open LLMs on par with, or possibly surpass, their proprietary counterparts in certain aspects.

Moreover, the use of synthetic data for training LLMs is likely to become more prevalent as human-generated data becomes increasingly hard to obtain. This approach has already shown promising results, and further refinement could lead to even more powerful models. Additionally, the hybrid approach of using both open-source and proprietary LLMs can provide a good cost/performance tradeoff, allowing organizations to leverage the strengths of both types of models.

Sources:

[1] WizardLM 2 - First Open Model Outperforming GPT-4" https://youtu.be/OvzSyaN3zKg?si=bAhE-navYYX8VmD9

[2] https://favtutor.com/articles/wizardlm-2-benchmarks/

[3] https://deepinfra.com/microsoft/WizardLM-2-7B

[4] https://openrouter.ai/models/microsoft/wizardlm-2-8x22b

[5] https://agi-sphere.com/wizardlm/

[6] https://www.ankursnewsletter.com/p/the-impact-of-wizard-and-falcon-on

[7] https://huggingface.co/WizardLM

[8] https://huggingface.co/posts/WizardLM/329547800484476

[9] https://sapling.ai/llm/llama3-vs-wizard

[10] https://ai.plainenglish.io/wizardlm-large-pre-trained-language-models-to-follow-complex-instructions-0004337de34e?gi=39c33735b5c1

[11] https://www.marktechpost.com/2024/04/16/wizardlm-2-an-open-source-ai-model-that-claims-to-outperform-gpt-4-in-the-mt-bench-benchmark/