Minimal explanation of tied embeddings in language models with a compact real-model demo.
Published
September 17, 2025
Tied embeddings (or weight tying) is a small but powerful trick in language models. The idea is to use the same matrix for both the input token embedding and the output softmax layer. This reduces parameters and often improves perplexity.
The trick was introduced in 2017 (Press & Wolf; Inan et al.), and it was already applied in the original Attention Is All You Need Transformer.
Many popular models such as GPT-2 and BERT also use tied embeddings. However, not all open-source LLMs adopt this practice. For example, LLaMA and Mistral ship with tie_word_embeddings = False in their reference implementations.
A real example
To see tied embeddings in action, let’s look at TinyLlama 1.1B.
It is small enough to load on a laptop, but large enough that parameter savings are obvious.
First, we load the model as-is:
from transformers import AutoTokenizer, AutoModelForCausalLMimport torchdef count_params(model):returnsum(p.numel() for p in model.parameters())def count_unique_params(model): seen =set() total =0for p in model.parameters(): ptr = p.data_ptr()if ptr notin seen: seen.add(ptr) total += p.numel()return totaldef embeddings_share_storage(model): out = model.get_output_embeddings() inp = model.get_input_embeddings()return out isnotNoneand inp isnotNoneand out.weight.data_ptr() == inp.weight.data_ptr()model_id ="TinyLlama/TinyLlama-1.1B-Chat-v1.0"baseline = AutoModelForCausalLM.from_pretrained(model_id)print("Baseline params:", count_params(baseline))print("Unique params :", count_unique_params(baseline))print("Embeddings tied?", embeddings_share_storage(baseline))
By default the model loads with separate weights for the input embeddings and the output head. Conceptually the large vocabulary matrix exists twice. Frameworks can report similar totals when modules share a parameter, so we also report a unique parameter count to make the saving explicit.
Let’s now tie the embeddings and check the parameter counts again.
tied = AutoModelForCausalLM.from_pretrained(model_id)# tie input and output embeddingstied.get_output_embeddings().weight = tied.get_input_embeddings().weightifhasattr(tied, "tie_weights"): tied.tie_weights()print("Tied params :", count_params(tied))print("Unique params :", count_unique_params(tied))print("Embeddings tied?", embeddings_share_storage(tied))
Now the input and output embeddings share the same storage.
The unique parameter count drops by about V × D, where V is the vocabulary size and D the embedding dimension.
In TinyLlama this means a saving of over 300 million parameters — the size of the embedding matrix — without changing the model’s behaviour. Let’s double check whether this assumption holds water:
V, D = baseline.get_input_embeddings().weight.shapeexpected = V * Dobserved = count_unique_params(baseline) - count_unique_params(tied)print(f"Vocabulary size V={V}, embedding dim D={D}")print(f"Expected saving : {expected:,}")print(f"Observed saving : {observed:,}")
This simple one line change makes the model leaner, saving memory and compute. The table below summarises the effect on parameter size:
Model
Total params
Unique params
Embeddings tied
Baseline
1,100,048,384
1,100,048,384
No
Tied
1,034,512,384
1,034,512,384
Yes
Saving: 65,536,000 parameters (= V × D = 32,000 × 2,048).
That is why tied embeddings are a default in many architectures, even if some modern open source models leave them disabled. In practice, teams may keep embeddings untied to allocate capacity differently and retain architectural flexibility.