Large Language Models History

What are Language Models?

A language model (LM) is a tool that guesses the next word in a given sequence of terms.

The development of LMs can be broadly classified into five stages.

Grammatical rules of a specific language were used to predict the next word in a sentence.
E.g., in English I will be followed by am not are, and They can be followed by have or are like these grammatical rules.
However, there are many exceptions, and handling all the language rules is tricky.

In this method, a large set of texts was analyzed, and the word-level probability of a word after a bunch of words was determined statistically.
How many times does am appear after I that probability is compared with other words like are or is.
In an advanced SLM n-gram model, instead of finding probability from a previous single word, the last bi-gram (two words) and tri-grams (three terms) were used to find the possibility of the next word.
However, In English, a single word can have multiple meanings based on the context of the sentence. SLM can not able to determine the context of the sentence.

With Word2Vec (Word to Vector), these models calculate the probability of the following words by neural networks.
Example: RNN (Recurrent Neural Network), LSTM (Long Short Term Memory)

ELMo (Context-aware Word Embedding) and Self-Attention through Transformer architecture raised the performance bar of NLP tasks. Example: BERT and GPT-2
Models were trained with a large amount of text, and the context awareness increased.

There is a thin line between PLM and LLM.
Scaling model size and training data size of PLMs new emergent abilities of model discovered. Example: ChatGPT, LLaMA, Claude
LLM is different from PLM broadly in three ways:
- Emergent abilities
- Prompting/Conversational Interface
- To attend the scale, Engineering and Research problems must be solved.