목차

Generation with LLMs
- Autoregressive generation
- transformer 구조
- LM decoding 방법
LLM 인퍼런스를 가속하기 위한 방법들
LLM 인퍼런스 성능 측정 결과 및 고찰

Generation with LLMs

Autoregressive generation

초기입력(prompt, prefix)을 기반으로 반복적으로 모델을 호출하는 inference-time 프로시저

https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_1_1080p.mov

이러한 확률 분포에서 다음 토큰을 어떻게 선택할지는 중요함

https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/assisted-generation/gif_2_1080p.mov

transformer 구조

Untitled

매 시퀀스마다 인풋에 대한 kv 행렬을 사용하여 query에 대해 연산함 → kv caching