ScreenLeak: PII redaction on screen recording telemetry

A multi-modal benchmark measuring how well today’s tools redact PII from screen telemetry, screenshots, and computer-use traces

Try it — redact PII in your browser

Paste a captured string or drop in a screenshot and watch the actual local models black out PII, right here. Everything runs in your browser — nothing is uploaded.

Text redactor v45 · 278 MB INT8

One captured fragment per line — window titles, terminal output, OCR, chat (exactly how screenpipe redacts each string as it's captured). Catches API keys, passwords, connection strings, emails, repos…

redact:

Image redactor rfdetr_v11 · 109 MB

Finds and blacks out PII regions in a screenshot — names, IDs, addresses, secrets and more. Pick a sample or upload your own. Works best on clean, standard app UIs; unusual or low-quality screens may be missed or over-boxed.

sample: Upload

Zero-leak rate — local models vs frontier & cloud

Text PIIdesktop telemetry strings

Gemini 3.1 Pro	91.0%
GPT-5.5	90.7%
Claude Opus 4.7	87.8%
pii-redactor · local	86.7%
Google Cloud DLP	37.7%
Microsoft Presidio	35.4%

Image PII regionsIoU ≥ 0.30

pii-image-redactor · local	98.9%
Gemini 3.1 Pro	4.2%
GPT-5.5	3.2%
Google Cloud DLP	2.6%
Claude Opus 4.7	2.1%
Microsoft Presidio	0.5%

Zero-leak = share of items where every PII span (text) or region (image) is caught. Local models run fully offline (~10 ms text · ~120 ms image). Full methodology, confidence intervals & per-framework breakdowns in the leaderboard.

Runs entirely in your browser via transformers.js (text) and onnxruntime-web (image) — nothing is uploaded. Models: pii-redactor · pii-image-redactor. Synthetic samples only — no real PII.

Headline — composite compliance coverage

Each adapter scored on every surface where it operates. Composite = mean across the three surfaces; the trace surface is the weakest link and caps every row.

Framework	Text (`v45_phase3`)	Image (`rfdetr_v11`)	Trace (`gpt5`)	Composite
HIPAA	91.8%	98.8%	76.0%	88.9%
GDPR	90.2%	98.8%	68.0%	85.7%
CCPA	90.2%	98.8%	68.0%	85.7%
SOC 2	88.0%	98.9%	68.0%	85.0%
PCI DSS	88.7%	100.0%	78.3%	89.0%
DPDPA	91.6%	98.8%	72.0%	87.5%

Same label-subset dict (scoring/frameworks.py) applied across all three sub-benches. Numbers are zero-leak rates on the private val sets (422 text · 221 image · 25 trace). Full breakdown: results/framework_coverage.md.

Per-surface — three different problems, three different profiles

1. They detect PII fine. So can a 278 MB local model.

n=422 desktop telemetry strings (window titles, AX nodes, OCR fragments), hand-labeled, 13 categories (the 13th, private_sensitive, covers GDPR Art. 9 / non-Safe-Harbor PHI). 95 % bootstrap CI in brackets:

Model	Zero-leak	macro-F1
Gemini 3.1 Pro	91.0% (88.1 – 93.9%)	0.847
GPT-5.5	90.7% (87.8 – 93.6%)	0.847
Claude Opus 4.7	87.8% (84.1 – 91.0%)	0.809
`v45_phase3` ⭐ local	86.7% framework-avg	0.78
`privacy_filter_ft_v6` (1.4 B)	80.9% (76.5 – 84.9%)	0.724
Google Cloud DLP	37.7%	0.236
Microsoft Presidio	35.4%	0.199
Regex baseline	33.9%	0.565

v45_phase3 is a 278 MB INT8 ONNX (xlm-roberta-base fine-tune), 9 ms p50 on CPU, runs offline — within 5 points of frontier APIs at zero per-call cost. The two flagship commercial PII products (Cloud DLP, Presidio) barely beat regex — built for documents, not screen telemetry.

2. They can’t find PII in pixels. A specialized detector can.

n=190 PII-bearing screenshots of real-shape apps. IoU ≥ 0.30. 95 % Wilson CI in brackets:

Model	Zero-leak	Oversmash
`rfdetr_v11` (local, 28 M)	98.9% (96.2 – 99.7%)	0.0%
Gemini 3.1 Pro	4.2% (2.1 – 8.1%)	9.7%
GPT-5.5	3.2% (1.5 – 6.7%)	22.6%
Google Cloud DLP	2.6% (1.1 – 6.0%)	19.4%
Tesseract OCR + 16 regex	2.6% (1.1 – 6.0%)	3.2%
Claude Opus 4.7	2.1% (0.8 – 5.3%)	35.5%
Microsoft Presidio	0.5% (0.1 – 2.9%)	48.4%

Methodology, briefly

Synthetic data only. No real PII, no real users. All names / emails / phones / IDs / secrets are fictional. Canonical placeholders where they exist (e.g. SSN 123-45-6789).
Pixel-precise gold on the image bench. Comfortably within the IoU ≥ 0.30 match threshold.
Strict gold integrity — every gold item is verified to appear verbatim at injection time. CI enforces.
CIs. 95 % bootstrap on text + trace, 95 % Wilson on image. n=25 on trace, n=190 on image, n=345 on text — trace CIs are wide; ranking is directional, not decisive.
Shared framework dict. scoring/frameworks.py is the single source of truth for HIPAA / GDPR / CCPA / SOC 2 / PCI DSS / DPDPA across all three sub-benches.

Full methodology, threat model, limitations, and per-category breakdowns are in the repo.

What this is not

Not a capability benchmark. A model that refuses to do anything will score 100 % no-leak and be useless. Use WebArena / OSWorld / GAIA for capability.
Not a vendor pitch. Scoring code + sample corpus are Apache 2.0 / CC-BY 4.0. The full val sets sit in a private companion repo to prevent contamination of future evaluations, not for monetization.
Not exhaustive. v0 ships 25 trace val cases, 422 text cases, 221 image val cases. Numbers are directional. v0.1: adversarial prompt-injection split, larger trace corpus, image bench category coverage, multilingual, more adapters.

Run it yourself

git clone https://github.com/screenpipe/screenleak
cd screenleak && make install

export ANTHROPIC_API_KEY=...  OPENAI_API_KEY=...  GOOGLE_API_KEY=...

make bench-text  ADAPTER=claude          # or: gpt5, gemini, v45_phase3, gcp_dlp, regex, …
make bench-image ADAPTER=rfdetr          # or: claude, gpt5, gemini, regex_ocr, …
make bench-trace ADAPTER=claude          # or: gpt5, gemini

# Per-compliance-framework breakdowns
python text/src/framework_coverage.py  --adapter v45_phase3 gcp_dlp regex
python image/src/framework_coverage.py --adapter rfdetr

Adapter shape is documented in CONTRIBUTING.md. PRs that add new models welcome.

Cite this

@misc{screenleak2026,
  title  = {ScreenLeak: A Multi-Modal Benchmark for PII Redaction in Computer-Use AI},
  author = {Beaumont, Louis},
  year   = {2026},
  howpublished = {\url{https://github.com/screenpipe/screenleak}},
}

Louis Beaumont (Screenpipe) — louis@screenpi.pe