I Tested Claude, GPT, and 5 Other LLMs on Real Supply Chain Problems. No Single Model Won.

Building a network optimization model used to take a team of analysts working together for days. Last month, I did it myself with Claude Opus in a couple of hours. But GPT performed better on carrier agreement review, and optimization code required iteration differently between models.

Standard benchmarks — ARC-AGI 2, SWE-Bench, multimodal rankings — don't predict real operational performance. March 2026 brought simultaneous releases from OpenAI (GPT-5.4), Anthropic (Claude Opus 4.6), and Google (Gemini 2.5 Pro), but benchmark leadership didn't correlate with supply chain task success.

Models Tested

Direct evaluation:

Claude Opus 4.6 (Anthropic)
GPT-5.4 (OpenAI)
Perplexity

Research-based comparison:

Gemini 2.5 Pro (Google)
DeepSeek-V3
Qwen3-235B (Alibaba)
Llama 4 (Meta)

Task 1: Demand Forecasting from SKU-Level Sales Data

Winner: Claude Opus

Claude excelled at writing analytical Python code for demand decomposition, SKU segmentation, and seasonal pattern extraction. The code ran correctly on first iteration with sound methodological choices. GPT-5.4 produced competent but formulaic output requiring more iteration. Perplexity's web-context integration added narrative value but sacrificed analytical precision.

The real value wasn't prediction itself but generating reusable analytical code that transforms raw data into actionable insights within hours rather than days.

Task 2: Carrier Discount Agreement Review

Winner: GPT-5.4

GPT demonstrated superior performance on complex commercial documents with layered conditional pricing tiers. It accurately captured discount structures, service-level dependencies, and volume-based incentives — details worth tens of thousands annually.

Perplexity's routing to Claude Sonnet (not Opus) missed granular details. The distinction matters: Sonnet's "mostly right" performance on high-stakes financial review creates unacceptable risk. Gemini 2.5 Pro showed promise on scanned/multimodal documents; Qwen3 performed well on bilingual contracts.

Task 3: Network Optimization Model Development

Winner: Claude Opus (clear gap)

Claude produced efficient, logically clean optimization code with correct constraint formulation on first attempts. Variable naming proved intuitive, constraints grouped logically, and business reasoning appeared in comments.

GPT-5.4 required iteration to fix indexing errors and scaling limitations. DeepSeek-V3 and Llama 4 generated functional code but needed more debugging cycles.

Performance Scorecard

Task	Winner	Runner-up
Demand Forecasting (Python)	Claude Opus 4.6	GPT-5.4
Document Review	GPT-5.4	Perplexity/Sonnet
Optimization Code	Claude Opus 4.6	GPT-5.4

Pricing Reality (Batch Rates per Million Tokens)

Model	Input Cost	Output Cost	Best For
DeepSeek-V3	$0.27	$1.10	High-volume routine analysis
Qwen3-235B	$0.30	$1.20	Multilingual document work
Gemini 2.5 Pro	$0.75	$3.50	Multimodal extraction
GPT-5.4	$1.50	$6.00	Commercial documents
Claude Opus 4.6	$2.00	$10.00	Deep analysis, optimization
Llama 4 (self-hosted)	~$0.50	~$2.00	Data-sensitive environments

Token prices dropped roughly 80% in one year. The most expensive model costs 10x the cheapest, making task-specific routing economically practical.

The Decision Framework

1. Map workflows with precision

"Improve operations" fails as a specification. "Extract discount tiers from carrier agreements and compare against current spend by service level" enables model selection.

2. Match models to task types

Analytical coding + optimization → Claude Opus 4.6
Commercial document review → GPT-5.4
Scanned/multimodal documents → Gemini 2.5 Pro
Multilingual operations → Qwen3-235B
High-volume routine tasks → DeepSeek-V3

3. Test on representative data

My scenarios reflect my career experience; your operation requires its own pilot using actual document formats, terminology, and data characteristics.

4. Use multiple models

Smart supply chain teams route tasks: Claude for optimization, GPT for documents, DeepSeek for bulk processing. The price collapse makes this practical.

5. Fix data governance first

I've never seen an AI project fail because of the model. I've seen dozens fail because of the data. Duplicate records, inconsistent coding, and format variations between systems represent the real bottleneck. Clean data precedes tool selection.

The Bottom Line

Fifteen years across four continents revealed a consistent pattern: organizations extracting maximum value from technology aren't those with the best tools but those completing foundational work first — data governance, process mapping, change management.

No single model will solve your supply chain. The answer is in the implementation. It always has been.

Originally published on LinkedIn