Cross-Attention (跨注意力機制)

動機：為什麼需要 Cross-Attention？
在 Transformer 的 encoder-decoder 架構中：

Encoder 處理輸入序列（例如原文）
Decoder 要生成輸出序列（例如譯文）
問題：Decoder 要怎麼「參考」Encoder 的資訊？

→ 這就是 Cross-Attention 的用途
參考：Attention Is All You Need — Vaswani et al., 2017

Attention 機制：Q / K / V

所有 Attention 都圍繞三個向量：

符號	全名	意義
Q	Query	我在找什麼？
K	Key	每個位置提供什麼標籤
V	Value	實際要取回的內容

計算公式：

Attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V

Q 和 K 做 dot product → 衡量相關程度 → softmax 成權重 → 加權取 V

參考：Attention Is All You Need — Vaswani et al., 2017

Self-Attention（自注意力）

Q、K、V 全部來自同一序列

輸入序列 X
    ↓
Q = X·Wq
K = X·Wk     ← 三者都來自同一個 X
V = X·Wv
    ↓
Attention(Q, K, V)

效果：序列中每個 token 都能「看到」同一序列裡的其他 token

LLM（GPT、Claude 等）主要使用 self-attention 處理輸入 prompt
參考：Understanding and Coding Self-Attention in LLMs — Sebastian Raschka

Cross-Attention（跨注意力）

Q、K、V 來自兩個不同序列

序列 A（例如 Decoder 的隱狀態）  →  Q
序列 B（例如 Encoder 的輸出）    →  K、V
    ↓
Attention(Q, K, V)

→ A 去「查詢」B 的資訊
→ Q 問問題，K/V 來自另一個來源回答

如果 A = B，cross-attention 就退化成 self-attention
參考：Cross-Attention vs Self-Attention Explained — AIML.com

Self-Attention vs Cross-Attention 對比

	Self-Attention	Cross-Attention
Q 來源	同一序列	序列 A
K/V 來源	同一序列	序列 B
功能	序列內部關係建模	跨序列資訊融合
在 Transformer 中的位置	Encoder 和 Decoder 都有	只在 Decoder
代表應用	LLM 語言理解	翻譯、圖像生成

參考：Attention Is All You Need — Vaswani et al., 2017

在 Transformer Decoder 裡的位置

原始 Transformer（seq2seq 翻譯任務）：

Encoder
  └─ Self-Attention（看自己）

Decoder
  ├─ Masked Self-Attention（看自己，但只看左邊）
  ├─ Cross-Attention ← Q 來自 Decoder，K/V 來自 Encoder
  └─ Feed Forward

Cross-Attention 讓 Decoder 每一步生成時，都能參考整個 Encoder 的輸出
→ 機器翻譯中，生成每個詞都能「回頭看」原文

參考：Attention Is All You Need — Vaswani et al., 2017

Image Generation 的應用

Cross-Attention 在 Stable Diffusion 中

Text-to-image 生成（如 Stable Diffusion）用 cross-attention 把文字 prompt 注入圖像生成流程：

文字 prompt
    ↓ Text Encoder（CLIP）
文字 token embeddings → K、V

圖像的 latent 特徵（在 U-Net 裡）→ Q

Cross-Attention(Q=latent, K=text, V=text)
    ↓
讓每個圖像區域「關注」相關的文字 token

→ Q 來自圖片、K/V 來自文字 → 讓圖片生成受 prompt 引導

參考：Towards Understanding Cross and Self-Attention in Stable Diffusion — Liu et al., 2024

Cross-Attention Map 能做什麼？

Cross-attention map 記錄「哪個文字 token 影響了圖片哪個區域」

prompt: "a red cat sitting on a chair"

cross-attention map of "cat" → 圖片中貓的位置高亮
cross-attention map of "chair" → 椅子位置高亮

DAAM（Diffusion Attentive Attribution Maps）：
把 cross-attention word-pixel scores 上採樣 + 聚合
→ 產生像素級歸因圖，解釋 Stable Diffusion 每個詞影響了哪裡

參考：What the DAAM: Interpreting Stable Diffusion Using Cross Attention — Tang et al., 2022

Cross-Attention 的 Research 意義

Liu et al. (2024) 的分析發現：

Cross-attention maps 包含 object attribution 資訊
→ 容易導致圖像編輯失敗（物件屬性互相干擾）

Self-attention maps 才是保留幾何、形狀細節的關鍵
→ 做 text-guided image editing 時，修改 self-attention 效果更好

Cross-attention  → 決定「哪個詞影響哪個區域」（語意對齊）
Self-attention   → 決定「圖像結構和形狀」（幾何一致性）

參考：Towards Understanding Cross and Self-Attention in Stable Diffusion — Liu et al., 2024

總結

Self-Attention
  Q = K = V 來自同一序列
  → 序列內部建模（LLM 理解語言）

Cross-Attention
  Q 來自序列 A，K/V 來自序列 B
  → 跨模態/跨序列資訊融合
  → Transformer Decoder 參考 Encoder
  → Image Generation 讓圖像對齊文字 prompt