Meta’s large language model took the AI research world by storm in February — followed by the commercial Llama 2 in July and Code Llama in August. But with the introduction of LLaMA, the first major free ‘open source’ LLM, open-source AI began to have a moment 😵💫.
According to Meta, the open-source AI community has fine-tuned and released over 7,000 LLaMA derivatives on the Hugging Face platform since the model’s release 🚀.
Let’s delve deep into the workings of the groundbreaking LLaMA, a beacon of open-source AI 👀. This blog post will elucidate the technical design of LLaMA and highlight the key differences that set it apart from its counterparts 🔎.
The training dataset is a mixture of several sources to cover a diverse set of domains:
Comments: all data is open-source 👨💻. As seen, although LLaMA was trained on Wikipedia Data, which covers 20 languages, most of the data the model was exposed to is from CommonCrawl, which contains English only 🔤. Overall, the training dataset contains roughly 1.4T tokens after tokenization (BPE).
LLaMA is based on the transformer architecture; however, the authors leverage various improvements that were proposed and used in different models such as PaLM 🤖.
Note: Like GPT-3, LLaMA uses the Transformer’s decoder-only architecture 💡.
Pre-normalization [GPT3]. To improve the training stability, the authors use RMSNorm and normalize the input of each transformer sub-layer, instead of normalizing the output.
As seen above, RMSNorm simply performs normalization where each element of x is multiplied by the reciprocal of the square root of the mean of the squares (to avoid negative values) of elements of x, with a small number eps added for numerical stability. Then, it multiplies the normalized output with the learnable self.weight.