San Francisco's city-managed digital archives are riddled with duplicate images — tens of thousands of redundant photographs, scanned documents, and public records files spread across at least a half-dozen municipal systems — and the departments responsible for cleaning them up are now facing a hard deadline and harder choices.
The problem has been building for years but grew acute after the Department of Technology's 2024 consolidation of legacy storage servers, which swept legacy content from agencies including the Planning Department, the Recreation and Parks Department, and the San Francisco Public Library into a shared infrastructure. What planners expected to be a straightforward migration instead surfaced a sprawling mess of duplicated image assets, some appearing dozens of times under different filenames.
Why it matters now: the city is under pressure to finalize its digital records compliance strategy before a state archival audit window opens later this year. California's Public Records Act mandates that agencies maintain findable, accessible records — but duplicate-heavy archives slow search, inflate storage costs, and can generate conflicting versions of the same official document, creating liability.
Where the Decisions Are Being Made
At the San Francisco Public Library's main branch on Larkin Street, staff in the San Francisco History Center have been dealing with the downstream consequences longest. The Center holds digitized photographs of the city dating to the 1850s, and curators there have flagged that automated deduplication tools — the kind now being pitched by several vendors to the Department of Technology — risk conflating near-identical images that are actually distinct historical records, such as two photographs taken seconds apart on Market Street in 1906, after the earthquake.
The Planning Department, headquartered on Mission Street, faces a different version of the same decision. Its digital case files for permit applications in neighborhoods like the Mission District and the Tenderloin frequently contain multiple scans of the same submitted document, uploaded at different stages of review. Staff there have been piloting a hash-based deduplication protocol since March 2026, which flags files with identical data signatures for removal without relying on AI interpretation. That approach is cheaper and more transparent, but it misses near-duplicates — slightly different scans of the same page, for instance.
The Recreation and Parks Department, which manages imagery across more than 220 parks and facilities including McLaren Park and the Panhandle, has taken a third path: a freeze on new uploads to shared drives while it audits its existing holdings manually. That freeze, which began in May, has already frustrated staff who manage programming at sites like the Sunset Reservoir and Precita Park.
The Costs, the Tools, and the Trade-offs
Three vendors have submitted proposals to the Department of Technology, with pricing ranging from roughly $80,000 to $240,000 for city-wide deduplication licenses, according to procurement documents filed with the city's supplier portal. The higher-end tools use machine learning to identify near-duplicate images, not just exact copies. The lower-cost options are rule-based and faster but generate more false positives, meaning human reviewers still have to make the final call on flagged files — which adds labor costs that aren't reflected in the sticker price.
The San Francisco Public Library's tech unit has separately applied for a California State Library grant to fund a dedicated digital archivist position, a role that does not currently exist in the city's personnel structure. A decision on that grant is expected before September 2026.
The key decisions ahead all converge on one question: who is accountable for the archive's integrity. Right now, no single office owns the problem across agencies. A cross-departmental working group convened by the Chief Data Officer has met three times since February, but it has not yet issued formal recommendations. The group's next scheduled session is in late July, and city officials familiar with the process say that meeting is likely to produce a recommendation on which deduplication approach — AI-assisted, hash-based, or manual — becomes the citywide standard.
For residents who rely on public records — journalists pulling permit histories in the Tenderloin, researchers at the History Center, neighborhood groups tracking park maintenance in the Outer Sunset — the practical consequence of delay is continued friction and occasional missing or garbled records. The technical fix exists. The question is whether City Hall can agree on who pays for it and who owns the outcome.