SF's Digital Archive Reckoning: The Key Decisions Ahead on Duplicate Image Replacement
City agencies and cultural institutions face a critical fork in the road as bloated digital collections strain storage budgets and public access systems.
City agencies and cultural institutions face a critical fork in the road as bloated digital collections strain storage budgets and public access systems.

San Francisco's public agencies and cultural institutions are sitting on a growing problem: years of digitization drives have left municipal archives, public libraries, and civic tech platforms loaded with duplicate image files that eat server space, confuse search results, and cost real money to store. The question of what to do next — which images to keep, which to purge, and who decides — is now landing on the desks of department heads and IT directors across the city.
The timing is not accidental. A wave of AI-assisted cataloguing tools, adopted by institutions including the San Francisco Public Library's San Francisco History Center at Civic Center and the California Historical Society on Jackson Street in Pacific Heights, has made large-scale deduplication technically feasible for the first time. But feasibility and policy are different things, and the city has not yet established clear standards for how duplicate replacement decisions get made, who holds veto authority over deletions, and whether the public gets a say before a file disappears from a shared digital collection.
Duplicate image replacement sounds like a routine IT housekeeping task. It is not. When a file is flagged as a duplicate and removed, the replacement image — even a pixel-identical copy — may carry different metadata, a different provenance record, or a lower resolution than the original. In archive management, those distinctions matter enormously. A photograph of the 1906 earthquake stored at two different resolutions is not the same asset twice; it is two assets with different downstream uses.
The San Francisco Public Library system, which includes 28 branch locations from the Excelsior to the Richmond District, has been expanding its digital holdings since at least 2019. Storage costs for municipal digital infrastructure in California cities of comparable size have risen alongside cloud pricing increases; Amazon Web Services and Google Cloud both raised baseline storage rates in 2024, a shift that pushed IT budget conversations into territory they had not occupied before. Exactly how much SFPL spends on image storage annually is not publicly itemized in budget documents reviewed for this article, but the broader Department of Technology capital budget has grown year-over-year, reaching figures discussed in Board of Supervisors committee hearings through early 2026.
The risk of getting deduplication wrong is not hypothetical. The Internet Archive, based in the Presidio on Funston Avenue, has documented cases where automated deduplication tools flagged archival photographs as duplicates based on visual similarity scores alone, without accounting for physical negative condition, colorization differences, or crop variations. Those cases have informed how San Francisco's own archivists approach the question.
Three choices will define what happens next. First, the city needs to decide whether to adopt a centralized deduplication policy that applies across departments, or let each agency manage its own digital collections independently. A unified standard would reduce redundancy but requires political agreement between the Department of Technology, the Public Library Commission, and the City Administrator's Office — three bodies that do not always move at the same speed.
Second, institutions must choose between fully automated duplicate detection and human-reviewed workflows. Automation is cheaper and faster; human review is slower but catches the edge cases that algorithms miss. A hybrid model, where AI flags candidates and a trained archivist approves each deletion, is the approach currently under discussion at several comparable municipal archives on the East Coast, though San Francisco has not formally announced an equivalent program as of July 4, 2026.
Third, and most consequentially, the city must determine public notification standards. If a duplicate image in a searchable public collection is replaced or removed, does the user who has bookmarked or cited that specific URL get notified? Right now, no such standard exists in San Francisco's digital records policy.
The California Historical Society is expected to release updated digital stewardship guidelines later this year. The San Francisco Public Library's next Commission meeting, scheduled for later in July at the Main Library on Larkin Street, will include a technology agenda item that archivists expect to address some of these questions. What gets decided there will set the precedent for how the city handles its digital memory — not just today's images, but everything that gets scanned, uploaded, and eventually flagged as redundant in the years ahead.
How does this story make you feel?
Spread the word
About this article
Published by The Daily San Francisco
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News