The Daily San Francisco

San Francisco news, every day

News

SF's Digital Archives Face a Reckoning Over Duplicate Images: What Happens Next and the Key Decisions Ahead

City agencies and nonprofits managing San Francisco's public image collections must now choose between costly manual review, automated AI tools, and a hybrid approach — and the clock is ticking.

By San Francisco News Desk · Published 4 July 2026, 12:16 pm

4 min read

SF's Digital Archives Face a Reckoning Over Duplicate Images: What Happens Next and the Key Decisions Ahead
Photo: Photo by Tom Fisk on Pexels

San Francisco's public agencies and cultural institutions are sitting on a growing backlog of duplicate digital images — redundant photographs clogging city servers, slowing archival systems, and costing taxpayers money in unnecessary storage fees — and the decisions made in the next six to twelve months will shape how the city manages its visual records for decades.

The problem isn't new, but it has become harder to ignore. As city departments accelerated their shift to digital workflows during and after the pandemic, image libraries ballooned. The San Francisco Public Library's San Francisco History Center, which maintains tens of thousands of digitized photographs at the Civic Center branch on Larkin Street, has been working through a cataloguing overhaul. Meanwhile, the San Francisco Municipal Transportation Agency, which generates thousands of images annually for infrastructure documentation, compliance, and public communications, has acknowledged the challenge of maintaining clean, deduplicated archives across multiple internal departments.

Why the Timing Matters

The urgency stems from two converging pressures. First, the city's current cloud storage contracts — many negotiated before the AI boom reshaped pricing and capability — are coming up for renewal cycles beginning in early 2027. Decisions made now about how to clean up existing archives will directly affect the scope and cost of those contracts. Second, San Francisco's Office of Digital Services, based at City Hall, has been piloting AI-assisted document and image management tools as part of a broader push to modernize municipal recordkeeping. How duplicate-image cleanup fits into that pilot will determine whether the city ends up with a scalable solution or a patchwork of incompatible systems across departments.

For cultural institutions, the stakes are different but equally concrete. The Internet Archive, headquartered on Funston Avenue in the Richmond District, manages one of the largest public digital collections on earth and has long grappled with deduplication at scale. Its approach — using hash-matching algorithms to flag identical files before human curators make final calls — has become something of an informal benchmark for Bay Area nonprofits trying to solve the same problem with far smaller teams and budgets. Organizations like the San Francisco Arts Commission, which maintains a public art image database covering more than 4,000 works across the city, are watching how those tools evolve before committing to any vendor.

The Decisions Ahead

Three choices are now in front of decision-makers. The first is a fully automated approach, where AI tools flag and delete duplicates with minimal human oversight. It's fast and cheap but carries real risk: archivists warn that metadata differences between two visually identical images can make one copy historically significant and the other redundant, and automated tools don't always catch that distinction.

The second option is manual review — having trained staff examine flagged files before deletion. The San Francisco Public Library's digital team has used versions of this method for years, but at current backlog volumes, it's slow. Industry estimates for similar-sized municipal archives suggest manual review at this scale can run between $40 and $80 per staff hour, and backlogs in the tens of thousands of files can stretch timelines to two or three years.

The third path is a hybrid: AI flags candidates, humans review a statistically sampled subset, and the rest are cleared algorithmically. Several municipal archives in New York and Chicago have moved in this direction over the past two years, and early results suggest it can cut review time by more than half without significantly increasing error rates.

For San Francisco, the hybrid model is likely the frontrunner, but it requires buy-in from department heads who are already stretched thin by budget constraints following the 2024 and 2025 rounds of city spending cuts. The Board of Supervisors has not yet taken up the question formally, and no specific budget line for archival deduplication infrastructure has been proposed for the fiscal year beginning July 1, 2026. The next step will likely come from the Office of Digital Services, which is expected to present updated recommendations to the City Administrator's Office before the end of August. What those recommendations say — and whether they come with actual funding attached — will determine whether this remains a backroom IT problem or becomes a genuine modernization milestone for a city that has long prided itself on civic innovation.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.