Tegy Model Cost & Intelligence Delta

Executive comparison of Qwen, current Tegy runtime, Gemini Flash-Lite, GLM 5.2, latest Mistral, and latest Anthropic Sonnet for Tegy's StrategyOS-heavy agent workload.

Date 2026-06-22 Current runtime qwen/qwen3.7-plus Provider route OpenRouter via Cloudflare AI Gateway Turn cap observed locally $30

Executive Read

Tegy is currently on qwen/qwen3.7-plus, not Gemini Lite. Qwen and Gemini Flash-Lite are effectively the low-cost tier. Gemini Flash-Lite is slightly cheaper on input but slightly more expensive on output; for output-heavy StrategyOS turns, Qwen can still be a little cheaper. GLM 5.2 sits between the cheap tier and the premium tier: roughly 3.1x Qwen token price, 1M context, but text-only in the current OpenRouter catalog.

Mistral Medium 3.5 is meaningfully more expensive, with a smaller 256K context window, but is positioned as stronger for agentic, coding, and multi-tool workflows. Sonnet 4.6 remains the quality benchmark and likely best behavioral fit, but it is roughly 9-12x Qwen's token price and should be treated as a benchmark/rescue model, not the default, unless we accept a major gross-margin hit.

Recommendation: keep Qwen as the production default for now, run a controlled bakeoff against Gemini 3.1 Flash-Lite and Mistral Medium 3.5 on real Tegy tasks, and use Sonnet 4.6 only as the gold-standard comparator or explicit premium path.

Model Cards

Qwen3.7 Plus

current

qwen/qwen3.7-plus

  • Price: $0.32 / $1.28 per 1M input/output tokens.
  • Context: 1M tokens.
  • Vision: yes, text+image input.
  • OpenRouter modalities: text, image → text.
Cost fit
Likely quality

Gemini 3.1 Flash-Lite

cheapest input

google/gemini-3.1-flash-lite

  • Price: $0.25 / $1.50 per 1M input/output tokens.
  • Context: 1.05M tokens.
  • Vision: yes, plus file/audio/video input.
  • Best use: high-volume, latency/cost-sensitive work.
Cost fit
Likely quality

Mistral Medium 3.5

agentic contender

mistralai/mistral-medium-3-5

  • Price: $1.50 / $7.50 per 1M input/output tokens.
  • Context: 256K tokens.
  • Vision: yes, text+image+file input.
  • Best use: focused agentic/tool tasks where context fits.
Cost fit
Likely quality

GLM 5.2

text-only challenger

z-ai/glm-5.2

  • Price: $1.00 / $4.00 per 1M input/output tokens.
  • Context: 1.05M tokens.
  • Vision: no in the current catalog; text-only input.
  • Best use: text-heavy long-context strategy runs if behavior is strong enough.
Cost fit
Likely quality

Claude Sonnet 4.6

premium

anthropic/claude-sonnet-4.6

  • Price: $3.00 / $15.00 per 1M input/output tokens.
  • Context: 1M tokens on current catalog/API surfaces.
  • Vision: yes, text+image+file input.
  • Best use: quality benchmark, premium/rescue mode, hard reasoning.
Cost fit
Likely quality

Cost Deltas

Model Input / 1M Output / 1M Input vs Qwen Output vs Qwen 1M in + 200K out 1M in + 500K out
Qwen3.7 Plus $0.32 $1.28 1.00x 1.00x $0.576 $0.960
Gemini 3.1 Flash-Lite $0.25 $1.50 0.78x 1.17x $0.550 $1.000
Mistral Medium 3.5 $1.50 $7.50 4.69x 5.86x $3.000 $5.250
GLM 5.2 $1.00 $4.00 3.13x 3.13x $1.800 $3.000
Claude Sonnet 4.6 $3.00 $15.00 9.38x 11.72x $6.000 $10.500

Tegy is token hungry. On long StrategyOS turns, output tokens and repeated context dominate. Qwen and Gemini Flash-Lite are in the same economic band; Mistral and Sonnet change the unit economics materially.

Capabilities & Vision

Model Catalog modality Vision File / PDF suitability Context risk
Qwen3.7 Plus text+image → text Yes Images yes; file/PDF handling depends on Tegy's extraction/attachment path. Low: 1M context.
Gemini 3.1 Flash-Lite text+image+file+audio+video → text Yes Best broad multimodal coverage in this comparison. Low: ~1.05M context.
Mistral Medium 3.5 text+image+file → text Yes Good fit for screenshots/docs if the turn fits 256K context. Medium/high: 256K context may be tight for StrategyOS-heavy runs.
GLM 5.2 text → text No Not suitable for image/screenshot turns unless Tegy extracts all content to text first. Low: ~1.05M context.
Claude Sonnet 4.6 text+image+file → text Yes Strong fit for docs, images, and nuanced strategy reasoning. Low: 1M context.

Intelligence / Behavior Delta

Practical ranking for Tegy

  1. Claude Sonnet 4.6: best expected instruction following, tool-use behavior, and consultant-quality narrative; too expensive as default.
  2. Mistral Medium 3.5: plausible agentic/tool-use upgrade over Qwen/Gemini; context window is the main concern.
  3. GLM 5.2: plausible long-context text-only challenger; useful only if attachment extraction covers the workflow because it lacks vision.
  4. Qwen3.7 Plus: current baseline; strong economics and 1M context; acceptable default if behavior is good enough.
  5. Gemini 3.1 Flash-Lite: best broad multimodal/cost candidate; likely weaker for nuanced strategy judgment than Sonnet and probably weaker than Mistral for long-horizon agentic work.

What this means operationally

  • If we optimize for gross margin: keep Qwen or test Gemini Flash-Lite.
  • If we optimize for text-only long-context runs: test GLM 5.2, but do not use it for image/screenshot workflows.
  • If we optimize for capability at moderate cost: test Mistral Medium 3.5, but only on tasks under 256K context.
  • If we optimize for best possible output: Sonnet 4.6 wins, but should be premium/rescue/benchmark because of price.
  • If vision/file support matters broadly: Gemini Flash-Lite has the widest listed input modalities.

Recommendation

  1. Do not switch default directly to Sonnet. Use it as the benchmark and maybe a paid premium/rescue tier.
  2. Keep Qwen as current default until a live bakeoff says otherwise. It is cheap, image-capable, and has 1M context.
  3. Run Gemini Flash-Lite as the first challenger. It is economically close to Qwen, has broader multimodal support, and could improve PDF/file-heavy flows if Claude SDK/OpenRouter compatibility is stable.
  4. Run GLM 5.2 as the text-only long-context challenger. It is cheaper than Mistral/Sonnet and has 1M context, but it cannot cover vision turns.
  5. Run Mistral Medium 3.5 as the quality challenger. It may improve tool-use/agentic reliability, but its 256K context makes it risky for Tegy's longest strategy runs.
  6. Measure on real Tegy tasks, not generic benchmarks. Use Sprinta PDF/DOCX, questionnaire emergence, StrategyOS tool use, multi-agent behavior, artifact generation, cost, latency, and CF AIG cache behavior.

Proposed Bakeoff Criteria

Criterion Why it matters Pass signal
StrategyOS usage Tegy is not a generic chatbot. Model naturally invokes/uses StrategyOS structures where appropriate.
Questionnaire behavior Consultant-like intake is core UX. Asks useful questions without fake/forced UI responses.
Document analysis PDF/DOCX workflows are central. Correctly handles Sprinta PDF and DOCX tests.
Vision User screenshots and deck images matter. Accepts image attachments through current route without provider errors.
Long-turn stability StrategyOS can run for many minutes and many tokens. No premature no-output, budget, or context failures under realistic cap.
Cost per successful answer Raw token price is not enough. Cheaper model is only better if success rate and retries stay acceptable.

Sources