Executive Summary

  • On-prem summarization pipeline for regulated R&D documents: no data egress, auditable processing.
  • 3.6× faster p95 through TensorRT-LLM optimization and batch scheduling.
  • Quality controls with evaluation gates and internal watermarking for safe reuse.
  • Predictable unit economics: -58% TCO vs cloud APIs on the target volume.
  • Hybrid retrieval (BM25 + embeddings) to reduce omissions and improve factuality.

Before / After

Metric
Before
After
Improvement
p95 latency (batch)
4.3s
1.2s
3.6×
Cost / 1k token
€0.045
€0.019
-58%
Data egress
Yes
0
Eliminated

Timeline

W1-2

Corpus + evaluation harness

Data inventory, redaction rules, and an eval set focused on factuality and omission rates.

W3-5

Optimization + quality gates

TensorRT-LLM serving, batch scheduling, hybrid retrieval, and automated regression checks.

W6

Production hardening

Watermarking, audit logs, access controls, and operational dashboards.

Decisions & Trade-offs

Serving

Choice: TensorRT-LLM for batch throughput
Alternatives: vLLM
Why: Maximizes throughput and cost efficiency for batch workloads.
Risks: More complex build/upgrade pipeline.

Retrieval

Choice: Hybrid BM25 + embeddings
Alternatives: embeddings-only
Why: Reduces omissions and improves coverage on technical terms.
Risks: Needs careful weighting and evaluation.

Security

Choice: Zero egress + watermarking + audit logs
Alternatives: cloud LLM APIs
Why: Protects R&D IP and enables regulated workflows.
Risks: Higher responsibility for patching and ops.

Stack & Architecture

Models

  • Fine-tuned summarization model
  • Bi-encoder embeddings (768D)

Serving

  • TensorRT-LLM
  • Nightly batch scheduler

Vector

  • PGVector

Security

  • Air-gapped updates
  • Watermarking
  • Audit logs

SLO & KPI

Batch p95 ≤ 1.5s

✓ Achieved 1.2s

Data egress = 0

✓ Enforced

ROI & Unit Economics

Formula: ROI = (ΔProd + ΔQuality + Risk avoided) − (Capex/amm + Opex)
  • ΔTCO ↓ 58% vs cloud APIs on the target volume
  • 3.6× faster processing on the p95 workload
  • Zero egress reduces risk for regulated R&D content

Risks & Mitigations

Risk: Omission / factuality regressions → Mitigation: automated eval gates + regression reports.
Risk: Ops overhead for on-prem serving → Mitigation: hardened release pipeline and observability-first rollout.

Lessons learned

  • Hybrid retrieval reduces silent omissions on technical R&D terms.
  • Batch workloads reward build discipline and stable inference configs.
  • Governance (watermarking, audit logs) is a product feature in regulated domains.

Testimonials

"We kept all sensitive research on-prem and improved throughput without sacrificing quality."

— R&D Engineering Manager

Bring this impact to your domain