The Daily San Francisco

San Francisco news, every day

News

SF's Digital Archive Reckoning: The Key Decisions Ahead on Duplicate Image Cleanup

City departments and cultural institutions face a cascading backlog of redundant digital assets — and the choices made this summer will shape public records access for years.

By San Francisco News Desk · Published 4 July 2026, 12:06 pm

3 min read

SF's Digital Archive Reckoning: The Key Decisions Ahead on Duplicate Image Cleanup
Photo: McGlashan, C. F. (Charles Fayette), 1847-1931 / Public domain (Wikimedia Commons)

San Francisco's public agencies and nonprofit archives are sitting on tens of thousands of duplicate digital images scattered across incompatible servers, shared drives, and legacy content management systems — and a growing coalition of archivists, city IT staff, and civic tech advocates say the moment to act is now, before a new round of AI-assisted digitization adds even more redundant files to the pile.

The issue isn't abstract housekeeping. California's Public Records Act requires government bodies to produce responsive documents in a timely manner, and when a single photograph exists in fourteen slightly different crops and resolution variants across three different city databases, that obligation gets complicated fast. Librarians at the San Francisco Public Library's History Center on Larkin Street have been wrestling with exactly this problem since a 2023 digitization push dramatically expanded the branch's online collection without a parallel deduplication protocol.

Where the Bottlenecks Are Building

The San Francisco Arts Commission, which manages an image library covering more than 4,000 pieces of public art installed across neighborhoods from the Tenderloin to the Excelsior, flagged the duplicate problem internally last year when staff preparing an updated Civic Art Collection database found that roughly 30 percent of image records had at least one near-identical duplicate. The commission did not provide a cost estimate for cleanup when contacted for this story.

At City Hall, the Department of Technology oversees a citywide digital asset framework that covers everything from infrastructure inspection photos taken along the Central Subway corridor to permit documentation images logged through the Planning Department's Accela system. Sources familiar with the department's work — speaking in their capacity as public records professionals rather than as authorized city spokespersons — say the core technical challenge isn't finding duplicates. Perceptual hash tools can do that in hours. The hard part is deciding which version of a duplicated image is the authoritative record, who has the authority to delete the others, and whether deletion itself creates a legal exposure under state retention schedules.

The San Francisco Municipal Transportation Agency faces a related but distinct version of the challenge. BART and Muni both maintain photographic documentation of station conditions, ADA compliance surveys, and incident records. BART's station infrastructure spans 50 stations system-wide, and internal review processes for image libraries have historically lagged behind operational priorities. Neither agency has published a public timeline for deduplication work.

The Decisions That Matter Most This Summer

Three choices will largely determine how this plays out. First, city departments need to settle on a single metadata standard — the lack of one is the primary reason duplicates accumulate across systems to begin with. The Chief Data Officer's office, which operates out of the Department of Technology at 1 Dr. Carlton B. Goodlett Place, has been developing an updated open data policy framework, but no public release date has been confirmed.

Second, cultural institutions anchored along Civic Center — including the Main Library and the Asian Art Museum on Hagiwara Tea Garden Drive — will need to decide whether to handle deduplication in-house or outsource it to one of the several civic tech contractors that have approached the city since San Francisco's AI procurement guidelines were updated in early 2026. That choice carries both budget and data-sovereignty implications.

Third, and most consequentially, officials will need to determine whether images flagged as duplicates should be deleted outright, archived offline, or simply unlisted from public-facing portals. Legal staff across at least two departments are reviewing California Government Code section 34090, which governs destruction of city records, before any mass deletions proceed.

Civic tech advocates at Code for San Francisco, the volunteer brigade that meets weekly at GitHub's former Mission Street offices and now convenes at various SoMa venues, have pushed for an open-source deduplication audit tool that any department could run against its own image libraries without sending files to a third-party vendor. Whether that proposal gets traction in the next budget cycle — the city's fiscal year 2026-27 began July 1 — is the question archivists and transparency advocates are watching most closely right now.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.