Table of Contents

Unify API

This is also a heading
This is a heading

Engineering

Static LLM Benchmarks Are Not Enough

Guillermo Sanchez-Brizuela

February 6, 2024

5 min read

The LLM landscape is incredibly fast moving, with new models coming out every week. In the last few weeks alone, Mamba showed that structured state space models are more parameter efficient than transformers, Mixtral showed that mixture-of-experts achieve better performance than monolithic models such as Llama2, and MIQU (leaked from Mistral) suggests that we may be fast approaching GPT4 capabilities for open source LLMs. With so many providers offering different LLMs, and also offering different endpoints for the same LLMs, all with varying costs and runtime performance, there has never been a more urgent need for objective benchmarks.

‍

Valiant efforts have recently been made, such as Anyscale’s LLMPerf leaderboard and Martian Router’s LLM Inference Provider Leaderboard. However, these benchmarks take the form of static tables. In this post, we argue that static benchmarks are simply not enough, and we outline the necessity to present benchmarking data across time, in order to make any meaningful comparisons.

‍

Transient Systems

‍

When benchmarking hardware, it’s fair to assume that the runtime performance of the hardware will be the same whether you test it today or tomorrow. Static benchmarks such as MLPerf rely on this assumption. The metrics measured can be assigned to the hardware once, and then these static metrics and scores are considered intrinsic to the hardware.

‍

However, public endpoints for LLMs do not behave like this at all. From the perspective of an end user, the runtime performance varies drastically over time, and this is for a number of reasons. Unlike a piece of hardware, the endpoints do not represent a static system. Endpoints are instead a gateway into a black-box system, which is unbeknownst to the user. Factors which can (and will) change over time which affect the runtime performance of this black-box system include:

‍

Traffic going to the endpoint

Load balancing configuration

The number of devices currently reserved behind the scenes

Updates to the underlying software

Updates to the underlying hardware

Network speeds

‍

As a result, the runtime performance of these endpoints are best thought of as time series data, rather than a fixed static metric. To illustrate this point, consider the data presented below, which shows the tokens/second for several different providers of Llama2 70B throughout a single day.

‍

‍

Had we taken a single set of measurements at 03:30 AM, we would be concluding that Together AI had the fastest endpoints. However, had we instead taken the measurements at 12:30 PM on the very same day, we would be concluding that Together AI was slower than Anyscale, Perplexity, and Replicate.

‍

Each data point presented above is averaged over large input sequences and over several concurrent requests, and so these observed variations throughout the day are not measurement noise. The observed variation comes from transience in the underlying system itself throughout the day, affected by factors such as the overall traffic to the endpoint, the number of reserved GPUs at that moment in time, and the network speed etc.

‍

What About a Scoreboard?

‍

We observe similar trends to the graph above across all metrics, models, and providers. With such transient data being so common, it begs the question: do static runtime scoreboards make any sense at all? From our view, static scoreboards for runtime performance are not especially helpful, and they can disguise the fact that the metrics are constantly changing, sometimes on an hour-by-hour basis.

‍

Our benchmarks present the raw data across time for the key metrics: input cost, output cost, time-to-first-token (TTFT), output-tokens-per-second, inter-token-latency (ITL), end-to-end-latency (E2E), and cold-start time, with tables presenting the most recent values.

‍

Given the inherent transient and highly volatile nature of these metrics, we avoid scoreboards and we instead present the raw data and leave it to the user to leverage this data in order to make genuinely informed decisions about the endpoint they would like to use for their application.

‍

In reality, every application depends on each of these metrics to different extents. For some time-critical applications the ITL is of paramount importance. For other non time-critical applications, it’s only about minimizing the cost. Going further, for input-heavy applications such as document summarization, the input cost is most important to minimize. For output-heavy applications such as content creation, the output cost is most important to minimize. For other applications, the TTFT is most important, where there is a need to create a very responsive feeling for the end user. If these raw metrics are ever combined to create “scores”, then this should always be done in a task-specific manner. We therefore leave all such “scoring” to the user for now.

‍

Get Involved

‍

Our benchmarking logic, named AI Bench, is all fully open source. The full benchmarking methodology is also explained here in detail. We strongly welcome and encourage feedback from the community!

‍

As the next step, we plan on creating dynamic and customizable scoring systems, where the user can specify the relative importance of each metric, with soft and hard constraints, before we propose the best model for them based on their preferences and their specific use case.

Stay tuned for more updates as we work on our dynamic scoring and recommendation features! For feedback, please email us at hello@unify.ai or simply tag us on twitter if you have any feature requests, comments, or suggestions. We love to hear from you 😊

‍

About the Author

Guillermo Sanchez-Brizuela

Unify | ML Engineer, Head of Deployment‍

Guillermo has led predictive analytics and AI research projects, and earned a Master’s with a focus on Deep Learning, Big Data, and Machine Learning from UVa. His work bridges Deep Learning research and AI deployment.

More Reads

Agents: A New Paradigm or a Passing Phase?

If the last century has taught us anything about intelligence, it's that general intelligence is always an emergent property of an optimization algorithm. It is not hand crafted or hand engineered, it just pops out from a simple set of rules mixed with a lot of data and compute...

Daniel Lenton

April 29, 2024

5 mins

Tensorization: Breaking Through The Ranks

Tensorization is a model compression technique that breaks down weight tensors of deep neural networks into smaller, lower rank tensors to reveal underlying patterns and reduce...

Shah Anwaar Khalid

February 23, 2024

10 min read

Model Pruning: Keeping the Essentials

In the previous blog post of our model compression series we went over the available quantization libraries and their features. In a similar fashion, we will now go over the packages and...

Shah Anwaar Khalid

February 28, 2024

10 min read

Introducing the Unify LLM Hub

We’re very excited to announce The Unify LLM Hub: a collection of LLM endpoints, with live runtime benchmarks all plotted across time 📈 Knowing which LLM to use is very complex, and even after deciding which model to use, it’s equally complex to choose the right provider.

Guillermo Sanchez-Brizuela

April 2, 2024

5 min read

Quantization: A Bit Can Go a Long Way

Following up with our model compression blog post series, we will now delve into quantization, one of the more powerful compression techniques that we can leverage to reduce the size and memory footprint of our models.‍Going forward, we will assume that you have read the first blog post of the series, where we introduced the concept of quantization. Building on top of this introduction....

Shah Anwaar Khalid

February 28, 2024

10min read

Compilers: Talking to The Hardware

One of the key drivers behind the rapid expansion in machine learning growth is the technological progress made in the development of...

Jacob Goodale

February 28, 2024

15 min read

AI Deployment: A Cambrian Explosion

AI has been in an ever expanding renaissance for the past decade, with the success of AlexNet being a pivotal moment which sparked a new wave of deep learning research in 2012. Since then, there have been several step change moments, such as the success of reinforcement learning algorithms on...

Daniel Lenton

February 16, 2024

10 min read

Model Compression: A Survey of Techniques, Tools, and Libraries‍

Machine learning has witnessed a surge in interest in recent years driven by several factors. including the availability of large datasets, advancements in transfer learning...

Shah Anwaar Khalid

February 16, 2024

10 min read

Model Serving: A Multi-Layered Landscape

Machine learning (ML) models are becoming more complex and data-intensive, requiring more expensive hardware and infrastructure to train and run. As a result...

Guillermo Sanchez-Brizuela & Shyngyskhan Abilkassov

February 16, 2024

10 min read

Wish Your LLM Deployment Was
‍Faster, Cheaper and Simpler?

Use the Unify API to send your prompts to the best LLM endpoints and get your LLM applications flying

Static LLM Benchmarks Are Not Enough

Transient Systems

What About a Scoreboard?

Get Involved

More Reads

Wish Your LLM Deployment Was‍Faster, Cheaper and Simpler?

Wish Your LLM Deployment Was
‍Faster, Cheaper and Simpler?