When is Cloud API the cheapest option for LLM deployment?

Cloud API is most cost-effective for small models (8B parameters) with moderate query volumes (<300K queries/month). For example, with 500K queries/month, a small 8B model costs approximately €10K/3yr via Cloud API, compared to €17K for Cloud GPU and €27K for On-Premise.

When does On-Premise become more cost-effective than Cloud?

On-Premise wins for medium and large models (70B+ parameters) at 500K+ queries/month. The breakeven typically occurs at 18 months for 70B models. For large models (671B), On-Premise can save 62-82% over 3 years compared to Cloud alternatives.

What are the real GPU rental costs in 2025?

October 2025 pricing: RTX 5090 from $0.30/hr (consumer, small models), H100 from $1.85/hr with 1-3yr commitment (enterprise), H200 from $3.79/hr on-demand (datacenter). Lambda Labs offers the cheapest H100 committed rate at $1.85/hr.

How does model size affect TCO advantage?

The larger/smarter the LLM, the more On-Premise wins. Small 8B models: Cloud API best (€10K/3yr). Medium 70B models: On-Premise wins (€84K vs €91K Cloud GPU). Large 671B models: On-Premise dominates with 62-82% savings (€209K vs €545K Cloud GPU or €1.13M Cloud API).

TCO Calculator Expert

Compare Cloud API, Cloud GPU Rental, and On-Premise deployments with real-time hardware validation and cost breakdown. Built for senior architects making production decisions.

Start here

Three steps, one clean comparison

01
Set the workload once.
Enter queries and tokens in Cloud API. The same demand syncs across Cloud GPU and On-Premise.
02
Align the model pair.
Select the provider and the open-source match. VRAM sizing and GPU validation update automatically.
03
Review results.
Check the 3-year chart, breakeven, and recommendations before exporting.

Jump to results

Quick profiles

Pick a profile to prefill everything

Profiles sync across all tabs and can be edited anytime.

Scroll to the scenarios

💱 Display Currency:

1 USD = 0.92 EUR (Loading...)

Cloud API

€0

3-year TCO

Cloud GPU

€0

3-year TCO

On-Premise

€0

3-year TCO

📊 Default Scenario: Enterprise with Existing Datacenter (500K queries/month)

Profile: Mid-sized enterprise, 500K queries/month, Claude 3.7 Sonnet equivalent (Llama 3.3 70B FP8), 40% GPU discount, industrial power (€0.12/kWh), automated DevOps (0.05 FTE), existing datacenter.

View pricing assumptions and insights

Real pricing (Oct 2025): Cloud API €227K/3yr (€3/$15 per M tokens). Lambda H100 @ $1.85/hr = €91K/3yr (2× H100 80GB). On-premise 2× H200: €44K capex + €39K opex = €83K total. Breakeven at ~18 months.

Key Insight: The larger/smarter the LLM, the more on-premise wins! Small models (8B): Cloud API best (€10K). Medium models (70B): On-premise wins (€83K vs €91K Cloud GPU). Large models (671B): On-premise dominates with 62-82% savings. At 500K queries/month, self-hosting starts making financial sense. 5yr+ horizon: on-premise wins dramatically.

Guided view Hide advanced inputs for a clean flow.

Cloud API Configuration

Guided view hides advanced assumptions. Toggle above for full detail.

Step 1 of 3

📊 Workload

Queries per month

🔄 Auto-synced with Cloud GPU scenario

Avg input tokens per query Typical RAG: context + question

Avg output tokens per query ⚠️ Output tokens cost 2-5× more than input!

Peak concurrent requests

💰 Pricing

Provider / Model

Price per 1M input tokens (USD)

Price per 1M output tokens (USD)

Egress bandwidth (GB/month) Typical for RAG with document retrieval

Egress price (USD/GB)

💵 Cost Summary

Input tokens cost: €0

Output tokens cost: €0

Egress bandwidth: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

Cloud GPU Configuration

Guided view hides advanced assumptions. Toggle above for full detail.

Step 1 of 3

📊 Workload

Queries per month

🔄 Auto-synced with API scenario

Avg tokens per query (input + output) 1200 input + 600 output = 1800 total

🚀 Performance:

Throughput: 480 tokens/sec
Max queries/hour: 1,570
GPU utilization: 87%

🤖 Model Selection

Model

Quantization

Context Window

📊 Total VRAM Required: 88 GB

Model weights: 70 GB
KV cache: 18 GB
Safety margin (20%): 18 GB

🖥️ Hardware

GPU Type

Number of GPUs

Cloud Provider

Hourly rate (USD/GPU) Auto-filled from provider, or enter custom rate

Reserved vs On-Demand discount (%)

💾 Storage & Networking

Model storage (auto-calculated)

Vector DB storage (GB/month) $0.10/GB/month typical

Egress bandwidth (GB/month)

💵 Cost Summary

GPU rental (24/7): €0

Storage: €0

Egress bandwidth: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

On-Premise Configuration

Guided view hides advanced assumptions. Toggle above for full detail.

Step 1 of 3

🤖 Model Selection

Model

Quantization

Context Window

📊 Total VRAM Required: 88 GB

Model weights: 70 GB
KV cache: 18 GB
Safety margin (20%): 18 GB

🖥️ Hardware Capex

GPU Type

Number of GPUs 2× H200 = sufficient for Llama 3.3 70B FP8

B2B discount (%) 30-40% typical for multi-GPU enterprise orders

Server chassis + motherboard (USD)

CPU (USD) AMD EPYC 9554 or similar

RAM (GB)

RAM price (USD/GB)

NVMe SSD (TB)

NVMe price (USD/GB)

Total Capex: $0

⚡ Power & Cooling

GPU TDP (watts each)

System overhead (CPU+RAM+storage, watts)

PUE (Power Usage Effectiveness) 1.2 = excellent datacenter, 1.6 = office

Electricity cost (EUR/kWh) Industrial rate (€0.12) vs residential (€0.18-0.30)

Operating hours per day

Monthly Power: €0
Total TDP: 0W × PUE × hours

🔧 Operational Costs (Annual)

Maintenance contract (% of capex)

IT staff allocation (FTE) 0.05 FTE = ~2hr/week (fully automated k8s/vLLM)

Average IT salary (EUR/year) Shared DevOps/MLOps team cost allocation

ISP bandwidth (Mbps)

ISP monthly cost (EUR) Incremental cost (existing datacenter connectivity)

💵 Cost Summary

Capex (amortized 36 months): €0

Power & cooling: €0

Maintenance: €0

IT staff: €0

ISP: €0

Monthly Total: €0

Annual Total: €0

                                3-Year TCO:
                                €0
                            

📈 3-Year TCO Comparison

Metric	Cloud API	Cloud GPU	On-Premise
Monthly Cost	€0	€0	€0
3-Year TCO	€0	€0	€0
Breakeven vs API	-	-	-
ROI over 3 years	-	-	-

TCO Calculator Expert

Three steps, one clean comparison

Pick a profile to prefill everything

📊 Default Scenario: Enterprise with Existing Datacenter (500K queries/month)

Cloud API Configuration

📊 Workload

💰 Pricing

💵 Cost Summary

Cloud GPU Configuration

📊 Workload

🤖 Model Selection

🖥️ Hardware

💾 Storage & Networking

💵 Cost Summary

On-Premise Configuration

🤖 Model Selection

🖥️ Hardware Capex

⚡ Power & Cooling

🔧 Operational Costs (Annual)

💵 Cost Summary

📈 3-Year TCO Comparison

💡 Recommendations

📥 Export Results