• Turing Post
  • Posts
  • How to Leverage Open-Source LLMs in Your Project

How to Leverage Open-Source LLMs in Your Project

Practical Advice from Experts: Fine-Tuning, Deployment, and Best Practices

In the previous deep dive, we explored the leading open-source LLM models: LLaMA, Falcon, Llama 2, and their chatbot counterparts: Falcon-40B-Instruct, Llama 2-Chat, and FreeWilly 2. Now, the question is: how can you integrate these impressive models into your projects?

To get varied insights, we asked the practitioners who work with LLMs daily about how to effectively utilize existing models, fine-tune and deploy them efficiently, and avoid mistakes and obstacles. You will get perspective from:

  1. Edward Beeching: Co-creator of the Hugging Face Open LLMs leaderboard, a machine learning research scientist at Hugging Face;

  2. Rajiv Shah, a machine learning engineer at Hugging Face;

  3. Aniket Maurya, a developer advocate at Lightning AI;

  4. Lianmin Zheng, a Ph.D. student at UC Berkeley, one of the contributors to Vicuna;

  5. Devis Lucato, a principal architect at Microsoft at Semantic Kernel.

Highlights

Fine-tuning

When you’re thinking about fine-tuning an LLM, the most common advice is to define whether you need to fine-tune it or you can use existing models. Using an existing model may save you much time and resources.

In case fine-tuning is necessary, there are several moments you need to remember:

  • It is crucial to have a deep understanding of the dataset you will use for fine-tuning including its nuances and biases.

  • The most frequent problem is fitting the model on a single GPU. That’s why it’s important to use parameter-efficient fine-tuning techniques like Low-Rank Adaptation of Large Language Models (LoRA) and LLM Adapters. Also, use low precision like Brain Floating Point Format (bfp16), or 4-bit precision from QLoRA paper. The Parameter-Efficient Fine-Tuning (PEFT) package is a starting point.

  • Play around with the LLM configurations like max seq length, temperature and so. Then learn why it affects you in more detail.

Advice for beginners:
  • Be aware of what LLM is good for your particular task.

  • Start small.

  • Engage with the community to stay updated on the latest developments and best practices.

  • Experiment with various techniques, hyperparameters, and datasets to gain a deeper understanding of the model's behavior and performance.

  • Document your work.

Now, let’s dive deeper into the insights from each expert!

Edward Beeching, a Machine Learning Research Scientist and Co-Creator of the Hugging Face Open LLMs Leaderboard

  • What essential tips and tricks would you recommend to those who wish to fine-tune open-source LLMs? What about deploying?

When fine-tuning open-source LLMs, it is crucial to have a deep understanding of the dataset being used, including its nuances and biases. Most users opt for state-of-the-art open-source models like llama-2 for their fine-tuning tasks. Recently, an SFT trainer has been integrated into TRL, which might be of interest to readers. More information about this can be found at: https://huggingface.co/docs/trl/sft_trainer

Deploying fine-tuned models can present challenges, especially concerning hardware limitations within organizations. In some cases, using smaller, overtrained models might be a better option to reduce inference costs. An insightful discussion on this topic can be found at: https://www.harmdevries.com/post/model-size-vs-compute-overhead/

  • What are the most common challenges that practitioners encounter while utilizing open-source LLMs, and what advice can you offer to overcome these issues?

The most common obstacles practitioners encounter when utilizing open-source LLMs often stem from a lack of awareness about the model's limitations, biases, and issues related to contextual understanding and out-of-distribution data. To address these challenges, it is essential to develop a comprehensive evaluation suite that thoroughly tests the model against such cases before it is deployed for users.

  • What recommendations would you give to individuals who are just beginning their journey into working with open-source LLMs?

For those who are new to working with open-source LLMs, building a strong foundation in machine learning, natural language processing, and Python programming is vital. Hugging Face offers a dedicated course on this subject.

To stay up-to-date with the latest advancements in open-source LLMs, the Open LLM leaderboard, created by the author, serves as the primary source for ranking the most recent LLM models.

To connect with Edward Beeching and learn more about their work, you can use the information from their personal website.

Rajiv Shah, a Machine Learning Engineer at Hugging Face

  • What essential tips and tricks would you recommend to those who wish to fine-tune open-source LLMs? What about deploying?

The first is, do you need to fine-tune a model?

The Hugging Face hub has hundreds of thousands of pre-trained models for various tasks. Could this be solved with prompting using an LLM?

For fine-tuning LLMs, several efficient strategies take less computing and work faster. The PEFT package is typically my starting point. It includes techniques like Low-Rank Adaption of LLMs which is gaining popularity.

My final suggestion is to use learning curves as you tune your model. Keep track of the methods you use, how much data you are using, and the model performance. By tracking all of this, you can make a better quantitative assessment of whether to keep fine-tuning, get more data or if you have hit a point of diminishing returns and can stop.

  • To refer to the question of one of our subscribers, they mention latency issues while using Hugging Face and AWS JumpStart. Do you have any suggestions about it?

The latency brings up a great point - when using an API from a provider, the model is latent itself and the network connection. If you want more control over this, you can host your model where you can reduce network latency and provide more computing to ensure faster response times.

(Everyone complains about latency whether it’s using AWS, OpenAI, or Hugging Face – it’s never fast enough).

  • What recommendations would you give to individuals who are just beginning their journey into working with open-source LLMs?

There is a diverse set of LLMs. Each has its background and strengths. Be aware of them when deciding what LLMs to use for a particular task. If you want to work on using agents, you will want a large, sophisticated LLM. If you are learning fine-tuning, a smaller LLM will let you work faster. If you are interested in code generation, choosing an LLM that has been pre-trained on code is a great starting point.

To connect with Rajiv Shah and learn more about their work, you can use their social media tag, @rajistics which works everywhere. He mostly does TikTok videos.

Aniket Maurya, a Developer Advocate at Lightning AI

  • What are the general tips and tricks you can give those who want to fine-tune open-source LLMs? What about deploying?

For fine-tuning open source LLMs, the best way today is to use parameter-efficient fine-tuning techniques like LoRA and Adapter which reduces the GPU memory for finetuning the model. Also, it is important to use low precision if the model can't find on a single GPU. One can use bfp16 and from QLoRA paper they can also use 4-bit precision.

  • What are the most frequent problems that practitioners face when trying to use open-source LLMs? What advice could you give about overcoming these problems?

The most frequent problem is fitting the model on a single GPU. The solution is to use low precision like bf16, and even 4-bit from QLoRA paper.

  • Could you give any recommendations for those who are just starting to work with open-source LLMs?

Start fine-tuning on a custom dataset and play around with the LLM configurations like max seq length, temperature and so. And check how it affects the generation quality. Then learn why it affects you in more detail. I had a blog post on finetuning LLMs on custom datasets.

To connect with Aniket Maurya and learn more about their work, you can use this link.

Devis Lucato, a Principal Architect at Microsoft at Semantic Kernel

At Semantic Kernel, we primarily focus on foundation models like OpenAI GPT, which offer a wide range of capabilities such as handling text, planning actions, and understanding intent. However, the SDK is also open to integrating models like Bart, Claude, HuggingFace models, and fine-tuned models. While our team doesn't work extensively with fine-tuned models, I can share some insights based on past experiences and discussions with customers and the open-source community.

  • What are the general tips and tricks you can give those who want to fine-tune open-source LLMs? What about deploying?

    • Start with a pre-trained model, e.g. utilize models like GPT-3.5 and GPT-4 as a starting point to save time and computational resources.

    • Choose appropriate datasets: select a dataset that is representative of your task and large enough to cover various scenarios without causing overfitting.

    • Monitor progress: track the model's performance on a validation set to ensure effective learning and avoid overfitting.

    • Gradual unfreezing: unfreeze the model layers one by one, starting with the last/external layers, to retain pre-trained knowledge while adapting to the new task.

    • Optimize the model: compress the model using quantization or pruning techniques to reduce its size and improve inference speed, while balancing compression with quality loss.

    • Ensure scalability: design a deployment architecture that can handle variable workloads and scale in/out as needed.

    • Monitor and update: Continuously monitor the deployed model's performance and update it with new data or fine-tune it as required.

  • What are the most frequent problems that practitioners face when trying to use open-source LLMs? What advice could you give about overcoming these problems?

Frequent problems with OSS LLMs and how to overcome them. Some common issues practitioners face when using open-source LLMs include:

  • Reduced model computational capacity, e.g. size of attention window and ability to understand intent.

  • Cost/complexity of IT operations to deploy a model and run it for a wide audience.

  • High computational requirements, e.g. need of special hardware.

  • Insufficient or biased data.

  • Overfitting: models may learn to perform well on the training data but struggle to generalize to new examples.

  • Unclear licensing and usage rights: open-source LLMs may have varying usage restrictions and licenses, which can limit their application in certain scenarios.

To overcome these problems:

  • Leverage cloud-based platforms or GPU clusters to address computational limitations.

  • Focus on collecting high-quality, diverse, and representative data for your task.

  • Use a validation set to monitor model performance and adjust parameters to avoid overfitting.

  • Familiarize yourself with the licensing terms and usage rights of the open-source LLMs in use.

  • Could you give any recommendations for those who are just starting to work with open-source LLMs?

Recommendations for beginners:

  • Learn the basics: understand the fundamentals of deep learning, NLP, and the specific LLM architecture, including details like latency, throttling, cost, and privacy.

  • Start small: begin with smaller models and tasks before progressing to larger and more complex LLMs.

  • Experiment: try various techniques, hyperparameters, and datasets to gain a deeper understanding of the model's behavior and performance.

  • Engage with the community: join forums, attend conferences, and collaborate with others in the field to stay updated on the latest developments and best practices.

  • Document your work: keep detailed records of your experiments, results, and insights to facilitate future work and collaboration.

To connect with Devis Lucato and learn more about their work, you can use their LinkedIn page.

Lianmin Zheng, a Ph.D. student at UC Berkeley, one of the contributors to Vicuna

  • How can researchers and practitioners efficiently leverage the power of Vicuna? Could you provide any tips and tricks?

Vicuna can be used for multi-modal research and safety research. You can find the following cool projects that are built on top of Vicuna:

  • What problems did you face while fine-tuning Vicuna? I think the answer to this question may help other practitioners in their job.

  1. How to get good data and clean them? We use shared conversation data from ShareGPT.

  2. How to evaluate the models? We use MT-bench and Chatbot Arena.

  • What are the tasks Vicuna cracks the best?

Vicuna is good at general chat and not good at math/coding. See some analysis here.

To connect with Lianmin Zheng and learn more about their work, you can use the information from their personal website. More information about Vicuna model can be found on the official website.

Conclusion

In this episode, we shared some practical bits of advice from ML experts who build, fine-tune and integrate LLMs into various projects. The insights shared by them shed light on the nuances of fine-tuning, efficient deployment, and best practices for beginners. Whether you are a seasoned professional or just starting in the field, we hope you found some helpful information/links for you and your projects. Please feel free to share your insights so we can update this guide: you can reply to this email, leave a comment, or send us a note at [email protected].

The next episode will conclude our current series on LLMs. We will dive into the topic of the AI bubble. What does it mean and are we in it? Stay tuned.

Be sure you are subscribed and shared this historical series with everyone who can benefit from it. You can do it via our referral system 👇 Your referrals will be growing, and eventually, it will lead to some great gifts 🤍 Thank you for your support!

So far, in this fascinating story of LLMs and their predecessors, we’ve covered:

Subscribe to keep reading

This content is free, but you must be subscribed to Turing Post to continue reading.

Already a subscriber?Sign In.Not now

Join the conversation

or to participate.