Mixture of Agents — Ensemble nhiều LLM như ensemble classifier

MoA dùng nhiều LLM đề xuất, rồi LLM khác tổng hợp — giống ensemble learning nhưng với models thay vì decision trees.

Ensemble learning là kỹ thuật cổ điển: thay vì dùng 1 decision tree, bạn dùng 100 cây rồi vote. Tại sao không làm tương tự với LLM? Thay vì hỏi GPT-4 một mình, bạn hỏi cả GPT-4, Claude, Llama cùng lúc — rồi để một model tổng hợp câu trả lời tốt nhất. Đó chính là Mixture of Agents (MoA).

MoA không phải là model mới. Nó là một architecture pattern — cách xếp đặt nhiều LLM để cộng tác. Kết quả: trên AlpacaEval 2.0, MoA với chỉ open-source models đạt 65.1% win rate, vượt GPT-4 Omni (57.5%).

Vấn đề

Single-model approach có một vấn đề cơ bản: no-free-lunch theorem. Không có LLM nào giỏi mọi thứ. GPT-4 có thể mạnh reasoning nhưng over-refuse một số prompts. Claude viết code sạch nhưng có thể trong creative writing. Llama open nhưng hallucinate nhiều hơn.

Thử nghiệm cho thấy: khi bạn cho LLM thấy câu trả lời của "competitors" — kể cả câu trả lời không hoàn hảo — output quality tăng lên. LLM có khả năng collaborative reasoning: học từ peer outputs như in-context examples.

Vấn đề là: làm sao orchestrate nhiều models một cách hệ thống, không phải copy-paste thủ công?

Ý tưởng cốt lõi

MoA chia pipeline thành layers. Mỗi layer là một nhóm LLM agents — gọi là "proposers" — chạy song song trên cùng một prompt. Output của layer này trở thành context cho layer tiếp theo.

Architecture

Layer 1 (Proposers):
├── Agent A (Qwen-72B)    → Response A
├── Agent B (Llama-70B)   → Response B
└── Agent C (Mistral-8x7B) → Response C

Layer 2 (Aggregator):
└── Agent D receives: [Prompt + Response A + Response B + Response C]
    → Synthesized Output

Layer 3 (optional): Final synthesis

Cơ chế này hoạt động như iterative refinement. Layer 1 generates diverse perspectives. Layer 2 thấy tất cả perspectives đó và tổng hợp. Mỗi layer, agent được "củng cố" bởi collective intelligence của layer trước.

Tại sao nó hoạt động

1. Bayesian Model Averaging in-context

Khi aggregator thấy nhiều responses, nó implicitly perform Bayesian model averaging. Consensus = likely correct. Disagreement = need deeper scrutiny. Aggregator học cả từ wrong answers (negative examples: "cách này không work").

2. Error Cancellation via Diversity

Giả sử Agent A hallucinate entity name, Agent B sai syntax code, Agent C miss context. Khi aggregator tổng hợp, errors không correlate — chúng triệt tiêu lẫn nhau. Signal remains, noise cancels.

3. Self-MoA: Same Model, Different Seeds

Phát hiện thú vị từ Self-MoA paper (2025): dùng cùng một model với different random seeds cũng work. Tại sao? Vì LLM sampling là stochastic. Mỗi generation là một "draw" từ posterior. Ensemble của K draws từ cùng distribution vẫn reduce variance.

Key Numbers

Configuration	AlpacaEval 2.0 LC Win Rate
GPT-4 Omni (single)	57.5%
MoA (open-source models only)	65.1%
Self-MoA (same model, 4 samples)	+6.6% over heterogeneous MoA

That's it. Ý tưởng đơn giản: nhiều models propose, một model aggregate. Layer có thể lặp lại. Diversity tạo ra quality.

Tại sao nó hoạt động

The Recognition vs Generation Asymmetry

LLM better at judging than generating. Đưa cho GPT-4 một câu trả lời, nó dễ dàng spot flaws. Đưa cho nó blank page và bảo "write something brilliant", khó hơn nhiều.

MoA exploits asymmetry này: proposers generate (hard task), aggregator judges và synthesizes (easier task). Aggregator không cần creativity — nó chỉ cần recognition.

In-Context Learning from Peers

Khi aggregator nhận được N responses từ proposers, nó treat chúng như few-shot examples. Mỗi response là một demonstration của "cách giải quyết problem này". Aggregator học:

Common patterns = likely correct
Unique claims = cần verify
Structured reasoning = follow structure

Mathematical Intuition

Giả sử mỗi proposer có accuracy $p$ . Nếu errors independent, probability CỦA ÍT NHẤT một proposer đúng: $1 - (1-p)^n$ . Với $p=0.7$ và $n=4$ : $1 - 0.3^4 = 99.2\%$ .

Tất nhiên, thực tế errors không hoàn toàn independent (cùng training data, similar architectures). Nhưng diversity cao hơn → correlation thấp hơn → ensemble effect mạnh hơn.

Ý nghĩa thực tế

Benchmark Results

Metric	Single Model	MoA	Improvement
AlpacaEval 2.0	57.5% (GPT-4)	65.1%	+7.6 pts
MT-Bench Reasoning	7.2	7.8	+0.6
Cost per query	1×	4-8×	—

Who Uses It

Together AI: Production MoA API ("50 lines of code" implementation)
RAG systems: MoA for multi-perspective retrieval synthesis
Research teams: Synthetic data generation với high-quality outputs

Trade-offs

Aspect	Single Model	MoA
Quality	Baseline	Higher
Latency	1 forward pass	4-8 forward passes
Cost	1×	4-8×
Use case	Real-time chat	Batch tasks, evaluation

Latency bottleneck: MoA không phù hợp cho real-time chat. Mỗi query cần 4-8 sequential hoặc parallel LLM calls. Best use cases: synthetic data generation, offline evaluation, high-stakes decisions where quality > speed.

Garbage in, garbage out: Nếu proposers đều yếu (multiple 7B models), aggregator amplifies noise chứ không phải signal.

Context window limits: Aggregator phải fit N responses trong context. Với 4 proposers × 1000 tokens = 4000 tokens chỉ cho responses. Scaling cần compression layers.

Self-MoA Insight

Phát hiện quan trọng từ 2025 research: homogeneous ensemble (cùng model, different samples) có thể outperform heterogeneous ensemble. Điều này counter-intuitive nhưng có lý:

Heterogeneous models có different biases → nhưng cũng có different tokenizers, different context understanding → aggregator phải "translate" between formats
Same model → same tokenizer, same "language" → aggregation seamless

Practical implication: Thay vì pay for GPT-4 + Claude + Llama API calls, bạn có thể sample từ GPT-4 bốn lần với temperature khác nhau.

Đào sâu hơn

Paper gốc

Mixture-of-Agents Enhances Large Language Model Capabilities (Wang et al., 2024) — arXiv:2406.04692 — MoA architecture gốc, AlpacaEval benchmarks

Bài liên quan TroiSinh

Cùng cụm

Agentic AI & Tool Use — LLM gọi function, plan multi-step. MoA là một dạng simple agent orchestration.
Multi-Agent Frameworks — AutoGen, CrewAI frameworks để build complex multi-agent systems. MoA là pattern đơn giản hơn, dễ implement.
LLM-as-Judge — Dùng LLM evaluate LLM. MoA aggregator thực chất đang perform LLM-as-Judge trên peer outputs.

Đọc tiếp

RAG — Retrieval-Augmented Generation. MoA có thể kết hợp với RAG: retrievers là "proposers", LLM là aggregator.
Mixture of Experts — MoE là architecture where model routes tokens to different expert weights. MoA routes queries to different models entirely.

External resources

Together AI MoA Guide — docs.together.ai/docs/mixture-of-agents — Implementation "50 lines of code"
Self-MoA Paper — arXiv:2502.00674 — Homogeneous ensemble analysis

Mixture of Agents — Ensemble nhiều LLM như ensemble classifier

On this page