Research Article

LLM Serving Optimization Techniques: A Comprehensive Analysis

Authors

  • Venkata Siva Prasad Bharathula University of Florida, USA

Abstract

This article presents a comprehensive analysis of optimization techniques for serving Large Language Models (LLMs), addressing the critical challenges posed by their exponential growth in size and computational requirements. This paper examines four key areas of optimization: hardware acceleration, serving architecture design, model compression, and dynamic scaling strategies. The article synthesizes findings from multiple studies demonstrating significant improvements in memory efficiency, throughput, latency, and cost-effectiveness through innovative approaches, including parameter-centric memory management, near-storage processing, adaptive batching, model parallelism, quantization, pruning, and intelligent caching. Also explore promising future directions in hardware-software co-design and advanced compiler optimizations that could further democratize access to these powerful models. The collective impact of these techniques enables more efficient deployment of LLMs across diverse computing environments, from high-performance data centers to resource-constrained edge devices.

Article information

Journal

Journal of Computer Science and Technology Studies

Volume (Issue)

7 (5)

Pages

174-181

Published

2025-05-30

How to Cite

Venkata Siva Prasad Bharathula. (2025). LLM Serving Optimization Techniques: A Comprehensive Analysis. Journal of Computer Science and Technology Studies, 7(5), 174-181. https://doi.org/10.32996/jcsts.2025.7.5.23

Downloads

Views

1

Downloads

0

Keywords:

Memory optimization, hardware acceleration, quantization, dynamic batching, model parallelism