SF's Digital Records Push Stalls on a Surprisingly Simple Problem: Duplicate Images
City officials, archivists, and technology experts say redundant photo files are quietly choking San Francisco's push to digitize decades of public records.
City officials, archivists, and technology experts say redundant photo files are quietly choking San Francisco's push to digitize decades of public records.

San Francisco's Department of Technology has been working since 2023 to consolidate the city's sprawling digital archive — permits, planning documents, property records — into a single accessible platform. Now, a less glamorous obstacle has surfaced at the center of that effort: thousands upon thousands of duplicate image files clogging the system and driving up storage costs before the project reaches full deployment.
The problem is not abstract. City records managers say that redundant scanned images of the same documents — created when multiple departments digitized overlapping paper files without coordination — now account for a meaningful share of the archive's total storage burden. At rates that cloud vendors currently charge for enterprise-grade storage in the Bay Area market, even a few extra terabytes translate into recurring annual costs that compound quickly for a city already facing budget pressure.
The timing is pointed. San Francisco's Controller's Office has been conducting a broader review of technology spending across city departments throughout the first half of 2026, and the digitization program sits squarely in that scope. Housing advocates at the Tenderloin Housing Clinic and planning reform groups that monitor the city's pipeline of residential permits depend on fast, accurate document retrieval — delays tied to bloated, poorly deduplicated archives have real downstream effects on how quickly permit histories can be verified.
The San Francisco Planning Department, headquartered at 49 South Van Ness Avenue, processes tens of thousands of permit applications annually. Staff there have flagged that searches for legacy records — particularly from pre-2010 paper files that were scanned in batches — sometimes return multiple copies of the same image, forcing manual review that slows turnaround. The San Francisco Public Library's San Francisco History Center at the main branch on Larkin Street faces a parallel version of the issue with its digitized photograph collections, where duplicate uploads from separate donor batches have created redundancy in the public-facing catalog.
Technology consultants working with Bay Area municipal clients say the duplicate-image problem is common in large-scale digitization projects that lacked a unified ingest standard from the start. The fix involves running automated hash-matching or perceptual comparison algorithms across file libraries to flag near-identical images, then applying a human review layer before deletion — a process that is straightforward in principle but labor-intensive at scale. For a collection running into the hundreds of thousands of files, even a well-tuned deduplication pass can take weeks of compute time followed by weeks of staff verification.
City officials have not publicly characterized the duplicate-image issue as a crisis, but the Department of Technology's 2025-2026 capital plan, which is a public document, lists data quality remediation as a priority line item within the digitization program. Archival technology specialists who advise public agencies — including consultants affiliated with the Society of American Archivists — generally recommend that municipalities adopt an ingest protocol that checks for duplicates at the point of upload rather than retroactively, a standard that San Francisco's earlier digitization waves did not consistently apply.
Digital preservation researchers note that the stakes extend beyond budget. Duplicate files without consistent metadata can create legal ambiguity in public records requests under California's Public Records Act, particularly when two near-identical scans carry different timestamps or version tags. That ambiguity is not hypothetical in a city where property disputes, environmental review challenges, and permit appeals regularly hinge on archived documents.
For residents and businesses navigating San Francisco's permit process — whether pulling a renovation permit in the Excelsior or checking a conditional use record in the Castro — the practical advice from records professionals is consistent: if a document search returns multiple results for the same record, request clarification from the relevant department before relying on any single file. The city's 311 service line remains the starting point for routing those questions to the correct custodian.
The Department of Technology has indicated that a deduplication phase is planned for the third quarter of 2026. Whether the timeline holds will depend partly on budget authorization still working its way through the Board of Supervisors.
How does this story make you feel?
Spread the word
About this article
Published by The Daily San Francisco
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News