San Francisco's major public institutions are sitting on a growing backlog of duplicate digital images across city-managed archives, and administrators at several departments must decide this summer whether to invest in automated deduplication tools, contract the work out, or simply let the problem compound. The issue has quietly become a budget and governance headache as the city pushes to digitize more records under its open-data mandate.
The timing matters. San Francisco's Department of Technology has been expanding its DataSF platform throughout 2025 and into 2026, uploading historical photographs, permit records and planning documents at a faster clip than at any point since the program launched. That acceleration has a predictable side effect: when staff migrate files from legacy servers into centralized repositories, duplicate images follow. Without a systematic policy for identifying and removing them, storage costs climb and public search tools return cluttered, unreliable results.
Where the Backlog Is Building
The San Francisco Public Library's San Francisco History Center, based at the main branch on Larkin Street in Civic Center, has been among the most visible institutions wrestling with the problem. The center's digitized collection spans more than 200,000 images, ranging from earthquake-era photographs to mid-century planning surveys. Library staff have acknowledged in public budget presentations that duplicated files entered the collection during a 2023 migration from an older content-management system, though the library has not released a precise count of affected records.
The San Francisco Planning Department's property records portal, which covers parcels from the Tenderloin to the Bayview, faces a parallel challenge. Permit photos uploaded by contractors and inspectors frequently arrive as near-identical duplicates — the same exterior shot submitted twice under different file names. Planning staff have flagged this in internal workflow reviews, but a department-wide policy for automated image matching has not been adopted as of July 2026.
The San Francisco Arts Commission, which manages a public-art database covering more than 4,000 works installed across the city, has been piloting a deduplication script on its image repository since January 2026. That pilot covers roughly 18,000 files and is expected to produce a report to commissioners by September 2026.
The Decisions That Will Define the Next Six Months
Three choices will shape how the city handles this going forward. First, department heads must decide whether to adopt a unified citywide standard for image hashing — a technical process that assigns each file a unique fingerprint so exact or near-exact copies can be flagged automatically — or continue letting each agency manage its own files in isolation. A unified standard would require coordination through the Department of Technology and would likely appear as a line item in the fiscal year 2026-27 budget, which the Board of Supervisors is expected to finalize by August.
Second, institutions must weigh the cost of commercial deduplication software against in-house solutions. Licensing fees for enterprise-grade tools from vendors active in the municipal market typically run between $15,000 and $60,000 annually for collections of the size held by the Planning Department or the Public Library, based on publicly available vendor pricing tiers. That range is not trivial against departmental IT budgets that have already absorbed cuts following the city's post-pandemic revenue shortfall.
Third, and most consequentially for the public, administrators must establish clear retention rules: when two images are flagged as duplicates, which one gets kept, who reviews the decision, and how the removal is logged for audit purposes. Without that framework, deduplication risks turning into accidental deletion of historically significant variants — a particular concern at the History Center, where two photographs of the same subject taken minutes apart can carry distinct archival value.
The Arts Commission pilot will be the first real test case. If its September report shows meaningful storage savings and no unintended data loss, it could become the template other departments adopt before the end of the calendar year. If it surfaces errors, expect a longer debate at the Board of Supervisors' Government Audits and Oversight Committee. Either way, city departments cannot afford to keep deferring the question — storage infrastructure contracts come up for renewal in early 2027, and costs negotiated now will reflect the scale of the problem left unsolved today.