What are tied Embeddings

nlp

embeddings

pytorch

Minimal explanation of tied embeddings in language models with a compact real-model demo.

Published

September 17, 2025

Tied embeddings (or weight tying) is a small but powerful trick in language models. The idea is to use the same matrix for both the input token embedding and the output softmax layer. This reduces parameters and often improves perplexity.

The trick was introduced in 2017 (Press & Wolf; Inan et al.), and it was already applied in the original Attention Is All You Need Transformer.

Many popular models such as GPT-2 and BERT also use tied embeddings. However, not all open-source LLMs adopt this practice. For example, LLaMA and Mistral ship with tie_word_embeddings = False in their reference implementations.

A real example

To see tied embeddings in action, let’s look at TinyLlama 1.1B.
It is small enough to load on a laptop, but large enough that parameter savings are obvious.

First, we load the model as-is:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def count_params(model):
    return sum(p.numel() for p in model.parameters())

def count_unique_params(model):
    seen = set()
    total = 0
    for p in model.parameters():
        ptr = p.data_ptr()
        if ptr not in seen:
            seen.add(ptr)
            total += p.numel()
    return total

def embeddings_share_storage(model):
    out = model.get_output_embeddings()
    inp = model.get_input_embeddings()
    return out is not None and inp is not None and out.weight.data_ptr() == inp.weight.data_ptr()

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

baseline = AutoModelForCausalLM.from_pretrained(model_id)
print("Baseline params:", count_params(baseline))
print("Unique params  :", count_unique_params(baseline))
print("Embeddings tied?", embeddings_share_storage(baseline))

Baseline params: 1100048384
Unique params  : 1100048384
Embeddings tied? False

By default the model loads with separate weights for the input embeddings and the output head. Conceptually the large vocabulary matrix exists twice. Frameworks can report similar totals when modules share a parameter, so we also report a unique parameter count to make the saving explicit.

Let’s now tie the embeddings and check the parameter counts again.

tied = AutoModelForCausalLM.from_pretrained(model_id)

# tie input and output embeddings
tied.get_output_embeddings().weight = tied.get_input_embeddings().weight
if hasattr(tied, "tie_weights"):
    tied.tie_weights()

print("Tied params   :", count_params(tied))
print("Unique params :", count_unique_params(tied))
print("Embeddings tied?", embeddings_share_storage(tied))

Tied params   : 1034512384
Unique params : 1034512384
Embeddings tied? True

Now the input and output embeddings share the same storage.
The unique parameter count drops by about V × D, where V is the vocabulary size and D the embedding dimension.
In TinyLlama this means a saving of over 300 million parameters — the size of the embedding matrix — without changing the model’s behaviour. Let’s double check whether this assumption holds water:

V, D = baseline.get_input_embeddings().weight.shape
expected = V * D
observed = count_unique_params(baseline) - count_unique_params(tied)
print(f"Vocabulary size V={V}, embedding dim D={D}")
print(f"Expected saving  : {expected:,}")
print(f"Observed saving  : {observed:,}")

Vocabulary size V=32000, embedding dim D=2048
Expected saving  : 65,536,000
Observed saving  : 65,536,000

This simple one line change makes the model leaner, saving memory and compute. The table below summarises the effect on parameter size:

Model	Total params	Unique params	Embeddings tied
Baseline	1,100,048,384	1,100,048,384	No
Tied	1,034,512,384	1,034,512,384	Yes

Saving: 65,536,000 parameters (= V × D = 32,000 × 2,048).

That is why tied embeddings are a default in many architectures, even if some modern open source models leave them disabled. In practice, teams may keep embeddings untied to allocate capacity differently and retain architectural flexibility.