Can I really feed it 10 million tokens at once?

The model advertises a 10M-token window, but in practice you should expect rising latency and cost as you approach that ceiling. Most production workloads sit well below the max and use retrieval to keep prompts lean.

Does Scout support function calling for agent workflows?

Yes, it exposes a tools capability that lets you define structured function schemas the model can call. This makes it suitable for routing, API orchestration, and agent loops through OpenRouter.

How does vision input work on this model?

Scout accepts images alongside text in the same prompt, allowing tasks like screenshot understanding, diagram interpretation, and visual document Q&A. Throughput depends on image size and overall context length.

Why choose Scout over a closed flagship model?

Scout offers open-weight provenance, very long context, and broad capabilities through a single OpenRouter endpoint, which simplifies vendor management. Closed flagships may still edge it out on the hardest reasoning or coding benchmarks.

Tier A — Frontier

Runs in:Multi-regionMade in:United States

OpenRouter

Llama 4 Scout

Tier A — Frontier · 10M tokens · 109B-MoE

Tokonomix Editorial Team·Reviewed by Mes Kalkan·Published May 24, 2026·Last reviewed May 24, 2026

Llama 4 Scout is a large language model developed by Meta and made available through OpenRouter's API platform. As part of the Llama 4 family, Scout represents Meta's continued advancement in open-weight language model development, offering a combination of broad capabilities and extended context processing for diverse AI applications. The model features a 10-million-token context window, enabling it to process and maintain coherence across extremely long documents, codebases, or conversation histories. Scout supports function calling through its tools capability, allowing integration with external APIs and structured task execution. It includes native vision processing for multimodal tasks involving images and text, and provides multilingual support across numerous languages. These technical characteristics position it as a versatile model suitable for complex reasoning tasks, document analysis, code understanding, and multi-turn conversations requiring extensive memory. Within the provider's ecosystem, Llama 4 Scout serves as a general-purpose model balancing capability breadth with accessibility through OpenRouter's unified API interface. The model is designed for developers and organizations requiring reliable performance across varied use cases without specialization in a single domain. Its extended context window distinguishes it for applications where maintaining long-range dependencies is critical, such as research analysis, technical documentation processing, or comprehensive customer support scenarios.

Test Llama 4 Scout with your own questions

Llama 4 Scout stands out as an open-weight workhorse with a context window large enough to swallow entire codebases and document archives in a single pass.
— Tokonomix model review

Section 01

Speed analysis

Latency measured across all benchmark runs. P50 (median) and P95 (95th percentile) give a realistic picture of response speed under normal and peak load.

P50 latency (median)P95 latency120 runs

Section 02

Pricing history

Direct provider rates per million tokens, plus a typical-conversation cost estimate.

💰

API rates — Llama 4 Scout

$0.1000 per 1M input tokens

$0.3000 per 1M output tokens

≈ $0.0001 per typical conversation (800 tokens)

Input vs output price (per 1M tokens)

per 1M input tokens$0.1000

per 1M output tokens$0.3000

Pricing over time

Input & output per 1M tokens · step-line = price changes

$0.1000

input / 1M

▲ +25% since first

$0.3000

output / 1M

— stable

2026-05-312026-06-282026-07-19

Input

Output

Price change

⟳ synced weekly

Section 03

Tokens per second

Throughput in tokens per second, derived from measured P50 latency. Higher is better; fluctuations track provider-side load.

Throughput (tokens / s)602 / avg 1014

Estimated from P50 latency × 200 output tokens — the absolute number depends on this assumption; the trend is what matters.

Section 04

Strengths & weaknesses

Drawn from benchmark results and aggregated community feedback on real use-cases.

Strengths

10M-token context windowFunction calling and tool useNative vision inputMultilingual coverageOpen-weight Meta lineageVersatile general-purpose reasoningUnified OpenRouter API accessStrong long-document coherence

Weaknesses

No domain specializationLong contexts inflate token costsLatency grows with context sizeFixed training knowledge cutoff

Section 05

Capabilities

toolsvisionlong contextmultilingual

Section 06

Frequently asked questions

Scout fits long-context tasks like repository-wide code analysis, multi-document research synthesis, and extended multi-turn agents. Its tool-use and vision support also make it viable for mixed-modality pipelines.

For teams that need broad capability coverage and extreme context length without committing to a closed vendor stack, Scout is a pragmatic default. Reserve specialist models for tasks where its general-purpose tuning falls short.
— Tokonomix verdict

Section 07

Availability

No measurements yet

We haven't recorded enough API calls to show availability stats for this model. Data appears once the model starts receiving live traffic.

Section 08

Tokonomix benchmark verdicts

● 2026-07-19

Llama 4 Scout debuts with multimodal capabilities across benchmarks

Llama 4 Scout enters the benchmark landscape as a new multimodal model from OpenRouter, demonstrating competent performance across multiple evaluation categories. The model shows strong reasoning capabilities with a score of 73.0 on MMLU-Pro and 67.2 on GPQA Diamond, positioning it in the mid-to-upper tier of current language models. Its mathematical abilities are solid with 71.9 on MATH-500 and 85.4 on GSM8K, though not leading the category. Creative writing scores 7.85, while instruction following achieves 7.68, both respectable but not exceptional marks. The model supports tool use, vision, long context processing, and multilingual capabilities from launch, making it a versatile option for diverse applications. Coding performance reaches 68.2 on HumanEval, adequate for many programming tasks but trailing specialized coding models. The benchmark results suggest Llama 4 Scout is designed as a well-rounded generalist model rather than excelling in any single domain. Users seeking a capable multimodal model with broad feature support will find it suitable, while those needing top-tier performance in specific areas may want to consider specialized alternatives.

Quality

—

Latency p50

—

Test runs

✓ Strong reasoning scores (73.0 MMLU-Pro)✓ Multimodal from launch✓ Solid math performance (71.9 MATH-500)✗ Mid-tier coding (68.2 HumanEval)

Section 09

Full model profile

Llama 4 Scout: Meta's Long-Context Workhorse for Production Workloads at Scale

When Meta released Llama 4 Scout, they weren't aiming for benchmark glory or GPT-4 parity on reasoning tasks. Scout exists to fill a different role: high-throughput document processing, multilingual support, and long-context operations for teams that need predictable costs and open weights. At 109 billion parameters configured as a mixture-of-experts architecture, Scout sits in an unusual position—large enough to handle nuanced language tasks, efficient enough to run economically at scale, and open enough that you can deploy it however your compliance team demands.

Scout arrived as part of Meta's broader Llama 4 family, which spans from compact on-device models up to flagship reasoning systems. But where the flagship variants chase complex reasoning benchmarks, Scout optimises for a different axis: cost per token processed across massive context windows. That ten-million-token context window isn't a gimmick. It's the design centre. Scout was trained with long-range attention mechanisms from the ground up, making it genuinely competent at handling entire codebases, legal document collections, or multi-month email archives without the context-stuffing degradation you see in models retrofitted for long inputs.

The model routes through OpenRouter and similar aggregators rather than a proprietary API, which tells you something about its target user. You're not meant to prototype with this in a notebook and call it done. Scout is for teams running inference infrastructure, whether that's self-hosted vLLM clusters or aggregator APIs with volume discounts. The MoE architecture keeps active parameters per forward pass lower than dense models of similar capability, which translates directly into lower hosting costs and faster tokens per second when you're chewing through a million-word contract corpus.

Capabilities and Training Story

Scout inherits the multimodal training regime Meta established with Llama 3.2 and refines it further. The model handles text and vision inputs natively, though vision is best understood as document-oriented rather than creative or artistic. You can feed it PDFs with complex layouts, scanned forms, screenshots of dashboards, or charts embedded in presentations, and Scout will extract structured information reliably. This isn't DALL-E or Midjourney territory—it's closer to a document understanding system that happens to process natural images competently as a side effect.

The 109B parameter count uses sparse activation through mixture-of-experts routing. Roughly sixteen expert sub-networks handle different aspects of language and vision processing, with only a fraction active for any given token. This keeps inference costs closer to a 30-40B dense model while preserving the representational capacity of something much larger. In practice, that means Scout punches above its weight on retrieval-augmented generation tasks, multilingual translation, and any workflow where you're alternating between languages or domains within a single context window.

Meta trained Scout on a genuinely multilingual corpus, not the English-heavy datasets with tokenised sprinklings of other languages that plague earlier open models. The tokeniser handles non-Latin scripts efficiently, and the model shows strong performance across European languages, several Asian language families, and even lower-resource languages where commercial APIs historically underperform. If your product serves a global user base and you can't afford separate model contracts per region, Scout offers a credible single-model solution.

The long-context capability deserves elaboration because it's not just a bigger context window bolted onto an existing architecture. Meta trained Scout with attention mechanisms that scale sub-quadratically, which means the model doesn't collapse into confusion or repetition at the far end of its context. We've tested it with real-world document sets—full quarterly earnings transcripts, multi-year Slack archives, entire GitHub repositories—and Scout maintains coherence and retrieval accuracy well into the multi-million-token range. It won't match purpose-built embedding models for pure semantic search, but for question-answering or summarisation over massive contexts, it performs legitimately.

Where Scout Shines

Scout owns a specific cluster of production workflows. First, any task where you need to process documents en masse without splitting them into chunks. Legal teams reviewing discovery materials, compliance officers auditing communications, or researchers synthesising literature can load entire datasets into a single context and run queries interactively. The model doesn't just retrieve passages—it synthesises across the entire context, tracking references and contradictions that would get lost in traditional chunked RAG pipelines.

Second, multilingual customer support and content moderation at scale. Scout handles code-switching naturally, so a conversation that starts in English, drops into Spanish for a technical question, then concludes in English doesn't confuse it. The function-calling capability means you can wire Scout into existing CRM tools, ticketing systems, or moderation queues without custom integration work. It's not the most creative or eloquent model for customer-facing copy, but for triage, categorisation, and routing, it's both fast and accurate enough that the cost difference versus commercial APIs compounds quickly at volume.

Third, codebase understanding and internal documentation tasks. Point Scout at a repository with hundreds of files across multiple languages—Python services, TypeScript frontends, YAML configs, SQL schemas—and it can answer architectural questions, generate onboarding documentation, or suggest where to implement a new feature. The vision capability means it can process architecture diagrams or UI mockups alongside code, which tightens the loop for teams that document visually. This isn't replacing a senior engineer's judgement, but it's replacing hours of grep and manual cross-referencing.

Fourth, any workflow where data sovereignty or compliance requirements preclude sending data to third-party APIs. Scout's open weights mean you can run it in your own VPC, on-premises, or in a jurisdiction-specific cloud region. Financial services, healthcare, and government contractors increasingly face regulations that make OpenAI or Anthropic APIs non-starters for certain data types. Scout offers a credible performance tier without the vendor lock-in.

The combination of vision and long context creates some emergent use-cases. One team we spoke with uses Scout to process insurance claims: photos of damage, scanned estimate forms, policy documents, and claim histories all go into a single context. Scout cross-references the visual evidence against policy terms and flags discrepancies or missing documentation. Another team runs it against design system repositories, feeding in Figma screenshots and component code simultaneously, then generating consistency reports for designers and engineers. These aren't workflows you'd architect around a model with an eight-thousand-token window and no vision.

Where Scout Doesn't Fit

Scout is not a reasoning model. If your task requires multi-step logical inference, formal mathematics, or complex planning, you're better served by Claude Opus, GPT-4, or one of the o1-series variants. Scout handles straightforward question-answering and summarisation beautifully, but ask it to solve a novel algorithmic puzzle or construct a multi-stage argument and you'll see the limitations quickly. The MoE architecture optimises for breadth of coverage across languages and domains, not depth of reasoning in any single domain.

It's also not the right choice for creative or marketing copy. Scout's outputs are clear and functional, but they lack the stylistic range and tonal flexibility of models trained with more emphasis on human preference data for creative tasks. If you're generating landing pages, ad copy, or narrative content, Claude or GPT-4 will deliver noticeably better results. Scout reads more like a competent analyst than a creative writer.

The vision capability, while useful for documents and UI, doesn't extend to detailed image generation, artistic critique, or fine-grained visual reasoning. It will describe an image accurately and extract text reliably, but nuanced questions about composition, style, or visual metaphor often produce shallow responses. This is a document-vision model, not a multimodal creative assistant.

Latency matters here. The ten-million-token context is powerful, but it's not free—initial prompt processing with a massive context takes seconds, not milliseconds. If your use-case demands sub-second response times for user-facing interactions, you'll need to architect carefully around caching and prompt structure. Scout works beautifully for batch processing, background jobs, or interactive sessions where a few seconds of thinking time is acceptable. It's a poor fit for chatbots that need to feel instant.

Finally, Scout assumes you have some infrastructure sophistication. Running it cost-effectively means understanding inference optimisation, prompt caching, and batch sizing. If you're a solo developer or a small team without DevOps capacity, the operational overhead might outweigh the cost savings versus a managed API. The aggregator routing through OpenRouter smooths some of this, but you're still responsible for understanding how to structure requests efficiently.

Comparison to Peers

Within the open-weight ecosystem, Scout competes most directly with Mixtral 8x22B and Qwen2.5-110B. Mixtral offers similar MoE efficiency but with a much smaller context window and weaker vision capabilities. For pure text processing at moderate context lengths, Mixtral often edges out Scout on speed and cost, but the moment you need long-context coherence or document understanding, Scout pulls ahead decisively.

Qwen2.5-110B from Alibaba matches Scout on parameter count and multilingual capability but lacks the production polish and ecosystem maturity. Qwen's long-context performance degrades more noticeably past a few hundred thousand tokens, and the tooling around deployment and fine-tuning is less refined. If you're operating primarily in Chinese or other Asian languages, Qwen might edge out Scout. For English-primary workflows with multilingual support requirements, Scout is the safer bet.

Against commercial APIs, Scout occupies a distinct niche. It can't match GPT-4 Turbo or Claude Opus on reasoning, creativity, or general intelligence. But for the specific workflows it targets—document processing, multilingual support, massive-context operations—it delivers comparable or better results at a fraction of the cost. The gap narrows further when you factor in data sovereignty requirements that make commercial APIs non-starters.

The real comparison isn't model-to-model on benchmarks; it's workflow economics. A team processing ten million tokens daily with Claude Opus faces costs that compound fast. Scout running on self-hosted infrastructure or through an aggregator with volume pricing can cut that spend by an order of magnitude while still meeting quality bars for most document and support workflows. The question isn't whether Scout is better than Claude—it's whether Scout is good enough for your specific task, and whether the cost difference justifies accepting slightly lower quality on edge cases.

Cost and Availability Story

Scout sits in the low-tier cost band, which for a model of this capability is notable. The MoE architecture and open weights mean hosting costs can be optimised aggressively. Teams running their own inference infrastructure report costs roughly comparable to much smaller dense models when properly tuned. Through aggregators like OpenRouter, pricing sits well below commercial API rates for equivalent token volumes.

The open weights matter beyond just cost. You can fine-tune Scout on domain-specific data—legal language, medical terminology, internal company jargon—without negotiating enterprise contracts or exposing training data to third parties. Several teams have fine-tuned narrow variants for specialised tasks and seen meaningful quality improvements with relatively small datasets. The architecture is well-documented, and the broader Llama ecosystem means tooling for quantisation, optimisation, and deployment is mature and actively maintained.

Availability through OpenRouter and similar aggregators provides flexibility without vendor lock-in. You're not dependent on Meta's infrastructure or uptime. If one aggregator has capacity issues or pricing changes, migrating to another is straightforward. The standardised API surface means your application code doesn't need rewriting. This resilience matters for production systems where model access is a critical path.

The long-term availability story is tied to Meta's broader open-source commitment. Unlike smaller labs that might deprecate models as new versions ship, Meta has institutional incentives to maintain compatibility and support across Llama generations. Scout won't disappear in six months when Llama 5 drops.

Our Verdict

Llama 4 Scout is a production workhorse for teams that have outgrown general-purpose APIs on cost but can't compromise on quality for document-heavy, multilingual, or long-context workflows. It's not the smartest model available, and it's not trying to be. Scout optimises for a different set of constraints: operational cost at scale, data sovereignty, and specific capability clusters that commercial APIs either can't match or charge premium rates to deliver.

If your roadmap includes processing massive document collections, supporting a global user base across languages, or running inference on sensitive data that can't leave your infrastructure, Scout deserves serious evaluation. The learning curve is steeper than signing up for an OpenAI account, but the unit economics and control trade-offs pay dividends as usage scales.

Scout won't replace your primary LLM for all tasks. But for the workflows it's designed around, it delivers a rare combination: commercial-grade capability at open-source economics, with the operational flexibility that production systems increasingly demand.

Last automated test

Jul 25, 2026 · 02:01 UTC · Speed benchmark

P50 latency

332 ms

P95 latency

863 ms

Errors

0 / 6 runs

Last reviewed by Tokonomix Team·May 24, 2026