LLMs Explained: DocLLM by JPMorgan AI Research

5 min readJan 10, 2024

Big for finance 🚀!

JPMorgan recently announced DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Their new LLM can understand documents, invoices, financial reports and contracts.

The key 🔑? It avoids expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure.

They demonstrate that DocLLM outperforms SOTA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

Previously:

LLMs Explained: LLaMA 2, Alpaca and Vicuna

Meta’s large language model took the AI research world by storm in February — followed by the commercial Llama 2 in…

chingisoinar.medium.com

Model Architecture

DocLLM is constructed upon the foundation of an auto-regressive transformer language model, following a causal decoder structure.

It is composed of

stacked transformer blocks, where each block contains a multi-head self-attention layer and a fully connected feed-forward network.

LLMs Explained: DocLLM by JPMorgan AI Research

LLMs Explained: LLaMA 2, Alpaca and Vicuna

Meta’s large language model took the AI research world by storm in February — followed by the commercial Llama 2 in…

Model Architecture

Written by Ching (Chingis)