Leaner large language models could enable efficient local use on phones and laptops

prof Andrea Goldsmith; Rajarshi Saha, EE Ph.D.; prof Mert Pilanci

Leaner large language models could enable efficient local use on phones and laptops

The new algorithm, developed by engineers at Princeton and Stanford Engineering, works by trimming redundancies and reducing the precision of an LLM’s layers of information.

Dec

2024

Large language models (LLMs) are increasingly automating tasks like translation, text classification and customer service. But tapping into an LLM’s power typically requires users to send their requests to a centralized server — a process that’s expensive, energy-intensive and often slow.

Now, researchers have introduced a technique for compressing an LLM’s reams of data, which could increase privacy, save energy and lower costs.

The new algorithm, developed by engineers at Princeton and Stanford Engineering, works by trimming redundancies and reducing the precision of an LLM’s layers of information. This type of leaner LLM could be stored and accessed locally on a device like a phone or laptop and could provide performance nearly as accurate and nuanced as an uncompressed version.

“Any time you can reduce the computational complexity, storage and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn’t handle such compute- and memory-intensive tasks,” said study coauthor Andrea Goldsmith, dean of Princeton’s School of Engineering and Applied Science and Arthur LeGrand Doty Professor of Electrical and Computer Engineering.

“When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said coauthor Rajarshi Saha, a Stanford Engineering Ph.D. student. “So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.” Saha’s graduate work is coadvised by Goldsmith and coauthor Mert Pilanci, an assistant professor at Stanford Engineering.

The researchers will present their new algorithm CALDERA, which stands for Calibration Aware Low precision DEcomposition with low Rank Adaptation, at the Conference on Neural Information Processing Systems (NeurIPS) in December. Saha and colleagues began this compression research not with LLMs themselves, but with the large collections of information that are used to train LLMs and other complex AI models, such as those used for image classification. This technique, a forerunner to the new LLM compression approach, was published in 2023.

Excerpted from Princeton Engineering's “Leaner large language models could enable efficient local use on phones and laptops,” by Molly Sharlach. Read full article.

research

student

Faculty

2024

Published : Dec 4th, 2024 at 03:28 pm

Updated : Dec 4th, 2024 at 03:32 pm