Executive Summary
- On-prem summarization pipeline for regulated R&D documents: no data egress, auditable processing.
- 3.6× faster p95 through TensorRT-LLM optimization and batch scheduling.
- Quality controls with evaluation gates and internal watermarking for safe reuse.
- Predictable unit economics: -58% TCO vs cloud APIs on the target volume.
- Hybrid retrieval (BM25 + embeddings) to reduce omissions and improve factuality.
Before / After
Metric
Before
After
Improvement
p95 latency (batch)
4.3s
1.2s
3.6×
Cost / 1k token
€0.045
€0.019
-58%
Data egress
Yes
0
Eliminated
Timeline
W1-2
Corpus + evaluation harness
Data inventory, redaction rules, and an eval set focused on factuality and omission rates.
W3-5
Optimization + quality gates
TensorRT-LLM serving, batch scheduling, hybrid retrieval, and automated regression checks.
W6
Production hardening
Watermarking, audit logs, access controls, and operational dashboards.
Decisions & Trade-offs
Serving
Choice: TensorRT-LLM for batch throughput
Alternatives: vLLM
Why: Maximizes throughput and cost efficiency for batch workloads.
Risks: More complex build/upgrade pipeline.
Retrieval
Choice: Hybrid BM25 + embeddings
Alternatives: embeddings-only
Why: Reduces omissions and improves coverage on technical terms.
Risks: Needs careful weighting and evaluation.
Security
Choice: Zero egress + watermarking + audit logs
Alternatives: cloud LLM APIs
Why: Protects R&D IP and enables regulated workflows.
Risks: Higher responsibility for patching and ops.
Stack & Architecture
Models
- Fine-tuned summarization model
- Bi-encoder embeddings (768D)
Serving
- TensorRT-LLM
- Nightly batch scheduler
Vector
- PGVector
Security
- Air-gapped updates
- Watermarking
- Audit logs
SLO & KPI
Batch p95 ≤ 1.5s
✓ Achieved 1.2s
Data egress = 0
✓ Enforced
ROI & Unit Economics
Formula: ROI = (ΔProd + ΔQuality + Risk avoided) − (Capex/amm + Opex)
- ΔTCO ↓ 58% vs cloud APIs on the target volume
- 3.6× faster processing on the p95 workload
- Zero egress reduces risk for regulated R&D content
Risks & Mitigations
Risk: Omission / factuality regressions → Mitigation: automated eval gates + regression reports.
Risk: Ops overhead for on-prem serving → Mitigation: hardened release pipeline and observability-first rollout.
Lessons learned
- Hybrid retrieval reduces silent omissions on technical R&D terms.
- Batch workloads reward build discipline and stable inference configs.
- Governance (watermarking, audit logs) is a product feature in regulated domains.
Testimonials
"We kept all sensitive research on-prem and improved throughput without sacrificing quality."
— R&D Engineering Manager