LLMs Explained: LLaMA and Its Architecture (Part 1)
Meta’s large language model took the AI research world by storm in February — followed by the commercial Llama 2 in July and Code Llama in August. But with the introduction of LLaMA, the first major free ‘open source’ LLM, open-source AI began to have a moment 😵💫.
According to Meta, the open-source AI community has fine-tuned and released over 7,000 LLaMA derivatives on the Hugging Face platform since the model’s release 🚀.
Let’s delve deep into the workings of the groundbreaking LLaMA, a beacon of open-source AI 👀. This blog post will elucidate the technical design of LLaMA and highlight the key differences that set it apart from its counterparts 🔎.
Pre-training Data
The training dataset is a mixture of several sources to cover a diverse set of domains:
Comments: all data is open-source 👨💻. As seen, although LLaMA was trained on Wikipedia Data, which covers 20 languages, most of the data the model was exposed to is from CommonCrawl, which contains English only 🔤. Overall, the training dataset contains roughly 1.4T tokens after tokenization (BPE).
Architecture
LLaMA is based on the transformer architecture; however, the authors leverage various improvements that were proposed…