LLMs Explained: DocLLM by JPMorgan AI Research

Ching (Chingis)
5 min readJan 10, 2024
DocLLM

Big for finance 🚀!

JPMorgan recently announced DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents, taking into account both textual semantics and spatial layout. Their new LLM can understand documents, invoices, financial reports and contracts.

The key 🔑? It avoids expensive image encoders and focuses exclusively on bounding box information to incorporate the spatial layout structure.

They demonstrate that DocLLM outperforms SOTA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.

Previously:

Model Architecture

DocLLM is constructed upon the foundation of an auto-regressive transformer language model, following a causal decoder structure.

It is composed of

  • stacked transformer blocks, where each block contains a multi-head self-attention layer and a fully connected feed-forward network.

--

--

Ching (Chingis)

I am a passionate student. I enjoy studying and sharing my knowledge. Follow me/Connect with me and join my journey.