Function Calling: Meta AI Llama 3 70B on Groq vs GPT-3.5 and GPT-4

Apr 30, 2024

Explore the capabilities of Meta AI Llama 3 integrated with Groq's and discover Llama 3's speed versus OpenAI models in function calling.

Introduction

Hello everyone, I'm Artem, the founder of ChatLabs. I'm excited to share with you the latest developments from Meta. On April 18th, they launched their latest AI model, Meta AI Llama 3, that now can be enhanced with Groq’s advanced computing solutions. We quickly took the opportunity to test this model to evaluate its performance, particularly focusing on access to Llama 3 for real-world applications. This is our second test of Llama 3, and this time around we are testing function calling performance focused on comparing the speed of Llama 3 70B versus the most most popular LLMs by Open AI, GPT-3.5 Turbo and GPT-4 Turbo.

What is Meta AI Llama 3?

Meta AI Llama 3 is the newest large language model in Meta’s AI portfolio, engineered to balance performance effectively across several metrics. It ranks third in intelligence among its counterparts but is notably superior in terms of Llama 3 speed and cost-efficiency. This makes it a good choice for those seeking quick and economical AI solutions.

The model is available in two configurations, one with 8 billion tokens and another with 70 billion. Here, "billion" means the model's complexity and learning potential. Currently, Llama 3 is geared primarily towards text generation, and Meta has marked this iteration as a significant enhancement over previous versions. The model not only delivers more diverse responses but also has improved refusal rates, better reasoning abilities, and enhanced code-writing precision. For the test, we took more advanced Llama 3 70B model.

What is Groq?

Groq is a key but lesser-known player in the AI hardware field. Its technology boosts the efficiency and speed of AI operations, perfectly complementing the capabilities of Llama 3. This integration allows AI platforms using Groq’s hardware to achieve faster processing, which is essential for real-time applications and extensive operational needs.

Testing Methodology

At ChatLabs, we look beyond the basic chatting capabilities of AI; we also assess how efficiently and quick they can operate. This test is designed to focus on Llama 3 comparison in terms of how much time ad it takes to deal with function calling compared to competitors, which help us evaluate Llama 3 speed.

We utilized two challenging prompts focused on function calling for our testing. The first prompt required the models to draft an engaging travel blog about a trip to Hawaii. In the second, we asked AI to provide answer to the question "What is Microsoft Phi-3?" with links to trustworthy resources. Each prompt was tested four times to ensure the robustness and consistency of our data.

What is function calling?

In this test, we decided to test how Llama and Groq handle function calling: in other words, how quickly LLMs can identify that a request is not just a simple text query but one that requires calling an additional external service (such as internet search) and image generation, and then provide a comprehensive response.

Performance Comparison

Now, let’s explore how Llama 3 70B compares to other well-known AI models on the market, GPT 3.5 Turbo and GPT-4 Turbo:

This graph compares how quickly three different AI APIs respond to tasks. The AIs are GPT-4-Turbo, Llama 3 70B Instruct, and GPT-3.5-Turbo. The graph shows three types of response times for each AI:

Average Response Time: This is the typical speed at which each AI responds.
Median Response Time (p50): This shows the middle value of response times, meaning half the responses are faster and half are slower than this value.
90th Percentile Response Time (p90): This tells us the speed at which 90% of the responses are faster, and 10% are slower. It helps us understand the worst-case scenario for slower responses.

Here's what we can learn from the graph:

GPT-4-Turbo and GPT-3.5-Turbo: These two AIs are pretty quick and consistent. Their average, median, and 90th percentile times are close to each other, which means you can expect similar speeds most of the time.
Llama 3 70B Instruct: This AI is generally quick, but sometimes it can be much slower, especially in more complex tasks, as shown by the high spike in its 90th percentile time.

Overall Comparison:

Consistency: GPT-4-Turbo and GPT-3.5-Turbo are more reliable for quick responses regularly.
Occasional Delays: Llama 3 might have occasional slower responses but generally performs well.

In simple terms, the GPT models are like fast and reliable sports cars, while Llama 3 is like a fast car that occasionally gets stuck in traffic. This is helpful to know depending on how important speed and consistency are for what you need the AI to do.

Read details in the document.

Test Lab Description

To provide a clearer picture of our testing environment, we conducted these evaluations using a MacBook Pro 13-inch equipped with an M1 Pro chip, and a standard consumer internet connection from Comcast with a download speed of 42 MB/s. It's important for us to test these models in conditions that mimic everyday use, so you can understand how they might perform in your own projects.

Llama 3 with ChatLabs

One of the advantages of using ChatLabs is our commitment to making cutting-edge AI technology accessible. Meta AI Llama 3, enhanced by Groq's technology, is now available on our platform alongside over 30 other advanced large language models such as GPT4, Claude 3, Mistral, Gemini Pro, Perplexity, and others. This access allows you to experiment and find the best fit for your needs, whether you're developing a chatbot, analyzing text data, or creating other innovative applications.

Conclusion

During our testing, the p90 response time was very high in Groq, which makes the model unstable for production use, and currently, OpenAI appears to be the winner, despite being much slower and more expensive.

However, we understand that the Groq team is facing very high demands now due to its popularity and has devoted all resources to fix the situation. Once the delay issue is resolved, we can confidently state that Llama3 on Groq is the most efficient model in terms of price, quality, and speed.

At ChatLabs, we’re always on top of the latest advancements, providing you with the tools and knowledge you need to make the best choices for your AI projects. We’ll keep testing these models and sharing what we learn with you, so look out for more updates!

I hope this overview of our new test with Llama 3 70B and Groq’s technology has been useful. If you’re interested in the specifics and want to see all our test results, check out our detailed Google Spreadsheet. You can find the code in a ChatLabs pull request. And as always, if you have any questions or just want to talk about AI, feel free to reach out.

See you next time, Artem

Apr 30, 2024