Building a network optimization model used to take a team of analysts working together for days. Last month, I did it myself with Claude Opus in a couple of hours. But GPT performed better on carrier agreement review, and optimization code required iteration differently between models.
Standard benchmarks — ARC-AGI 2, SWE-Bench, multimodal rankings — don't predict real operational performance. March 2026 brought simultaneous releases from OpenAI (GPT-5.4), Anthropic (Claude Opus 4.6), and Google (Gemini 2.5 Pro), but benchmark leadership didn't correlate with supply chain task success.
Models Tested
Direct evaluation:
- Claude Opus 4.6 (Anthropic)
- GPT-5.4 (OpenAI)
- Perplexity
Research-based comparison:
- Gemini 2.5 Pro (Google)
- DeepSeek-V3
- Qwen3-235B (Alibaba)
- Llama 4 (Meta)
Task 1: Demand Forecasting from SKU-Level Sales Data
Winner: Claude Opus
Claude excelled at writing analytical Python code for demand decomposition, SKU segmentation, and seasonal pattern extraction. The code ran correctly on first iteration with sound methodological choices. GPT-5.4 produced competent but formulaic output requiring more iteration. Perplexity's web-context integration added narrative value but sacrificed analytical precision.
The real value wasn't prediction itself but generating reusable analytical code that transforms raw data into actionable insights within hours rather than days.
Task 2: Carrier Discount Agreement Review
Winner: GPT-5.4
GPT demonstrated superior performance on complex commercial documents with layered conditional pricing tiers. It accurately captured discount structures, service-level dependencies, and volume-based incentives — details worth tens of thousands annually.
Perplexity's routing to Claude Sonnet (not Opus) missed granular details. The distinction matters: Sonnet's "mostly right" performance on high-stakes financial review creates unacceptable risk. Gemini 2.5 Pro showed promise on scanned/multimodal documents; Qwen3 performed well on bilingual contracts.
Task 3: Network Optimization Model Development
Winner: Claude Opus (clear gap)
Claude produced efficient, logically clean optimization code with correct constraint formulation on first attempts. Variable naming proved intuitive, constraints grouped logically, and business reasoning appeared in comments.
GPT-5.4 required iteration to fix indexing errors and scaling limitations. DeepSeek-V3 and Llama 4 generated functional code but needed more debugging cycles.
Performance Scorecard
| Task | Winner | Runner-up |
|---|---|---|
| Demand Forecasting (Python) | Claude Opus 4.6 | GPT-5.4 |
| Document Review | GPT-5.4 | Perplexity/Sonnet |
| Optimization Code | Claude Opus 4.6 | GPT-5.4 |
Pricing Reality (Batch Rates per Million Tokens)
| Model | Input Cost | Output Cost | Best For |
|---|---|---|---|
| DeepSeek-V3 | $0.27 | $1.10 | High-volume routine analysis |
| Qwen3-235B | $0.30 | $1.20 | Multilingual document work |
| Gemini 2.5 Pro | $0.75 | $3.50 | Multimodal extraction |
| GPT-5.4 | $1.50 | $6.00 | Commercial documents |
| Claude Opus 4.6 | $2.00 | $10.00 | Deep analysis, optimization |
| Llama 4 (self-hosted) | ~$0.50 | ~$2.00 | Data-sensitive environments |
Token prices dropped roughly 80% in one year. The most expensive model costs 10x the cheapest, making task-specific routing economically practical.
The Decision Framework
1. Map workflows with precision
"Improve operations" fails as a specification. "Extract discount tiers from carrier agreements and compare against current spend by service level" enables model selection.
2. Match models to task types
- Analytical coding + optimization → Claude Opus 4.6
- Commercial document review → GPT-5.4
- Scanned/multimodal documents → Gemini 2.5 Pro
- Multilingual operations → Qwen3-235B
- High-volume routine tasks → DeepSeek-V3
3. Test on representative data
My scenarios reflect my career experience; your operation requires its own pilot using actual document formats, terminology, and data characteristics.
4. Use multiple models
Smart supply chain teams route tasks: Claude for optimization, GPT for documents, DeepSeek for bulk processing. The price collapse makes this practical.
5. Fix data governance first
I've never seen an AI project fail because of the model. I've seen dozens fail because of the data. Duplicate records, inconsistent coding, and format variations between systems represent the real bottleneck. Clean data precedes tool selection.
The Bottom Line
Fifteen years across four continents revealed a consistent pattern: organizations extracting maximum value from technology aren't those with the best tools but those completing foundational work first — data governance, process mapping, change management.
No single model will solve your supply chain. The answer is in the implementation. It always has been.
Originally published on LinkedIn
