SF's Aging Digital Archive Has a Duplicate Image Problem. Here's What Comes Next.
City agencies and cultural institutions face a cascade of decisions about how to clean up redundant photo records before a costly new storage contract locks them in.
City agencies and cultural institutions face a cascade of decisions about how to clean up redundant photo records before a costly new storage contract locks them in.

San Francisco's municipal digital infrastructure is heading toward a reckoning. Across at least a dozen city departments — from the Department of Public Works to the San Francisco Public Library's digital collections branch on Larkin Street — duplicate images have accumulated inside legacy content management systems for years, inflating storage costs and making record retrieval slower and less reliable. Now, with a new centralized cloud storage contract expected to go before the Board of Supervisors for approval before the end of Q3 2026, the clock is running on a critical choice: clean up the archive first, or migrate the mess and pay to store it indefinitely.
The timing matters because San Francisco is not alone in confronting this. Municipal governments nationwide have struggled to audit digital assets as photo libraries ballooned through smartphone-era documentation practices and the explosion of open-data mandates after 2012. In San Francisco's case, the problem is amplified by the sheer volume of images generated by programs like the Healthy Streets Operation Center — the multi-agency homelessness response unit that dispatches teams across SoMa, the Tenderloin, and the Mission — which routinely photographs encampment sites before and after intervention. Those images frequently enter multiple department databases simultaneously.
The core question facing the city's Department of Technology, which oversees the centralized infrastructure contract, is whether to run a deduplication sweep before migration or after. Running it before is cheaper in the long run but requires freezing certain database functions for days — potentially weeks — at affected agencies. Running it after means paying cloud storage rates on duplicate files that could number in the hundreds of thousands. Storage costs in enterprise cloud contracts typically run between $0.02 and $0.05 per gigabyte per month, and city IT officials have estimated the overall archive in question runs to several hundred terabytes, though a precise public figure has not been released.
The San Francisco Arts Commission, which manages a separate image library of public murals, installations, and Civic Center-area events, is one of the institutions watching the process most closely. Its digital collections span more than two decades and include documentation of pieces along the Mission District's Balmy Alley as well as installations in Yerba Buena Gardens. Archivists there have flagged that automated deduplication tools can misidentify near-duplicate images — slightly different crops of the same photograph, for instance — as unique records, potentially discarding files that have distinct archival value.
That concern is not hypothetical. In 2023, the city of Los Angeles lost a subset of public works photo documentation after an automated deduplication process misclassified time-stamped variants of the same infrastructure images as redundant files. San Francisco's technology officials have pointed to that episode in internal discussions as a cautionary benchmark, according to publicly posted meeting minutes from the city's Committee on Information Technology.
The Department of Technology has until approximately October 1, 2026 to finalize the migration scope. That gives agencies a narrow window — roughly 13 weeks — to audit their own holdings, flag images that must be preserved regardless of duplication status, and agree on a shared metadata tagging standard that would make future deduplication more reliable. The SF Public Library, whose digital branch maintains the San Francisco Historical Photograph Collection dating to the 1850s, has already begun that internal audit. Other departments have not publicly confirmed similar efforts.
Three decisions will define how this plays out. First, whether the Board of Supervisors attaches a mandatory pre-migration audit requirement to the storage contract approval. Second, whether the Department of Technology selects a deduplication vendor with a track record in public-sector archival work, rather than a general-purpose enterprise tool. Third, whether smaller agencies — including the Recreation and Parks Department, which manages photo documentation for over 220 parks across the city — are given dedicated staff time to participate in the audit or left to manage it with existing workloads.
None of those decisions have been made publicly yet. The window to make them correctly is closing faster than the archive is growing — though that archive, by all indications, is still growing every day.
How does this story make you feel?
Spread the word
About this article
Published by The Daily San Francisco
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News