San Francisco's public agencies and nonprofit archives are sitting on tens of thousands of duplicate digital images scattered across incompatible servers, shared drives, and legacy content management systems — and a growing coalition of archivists, city IT staff, and civic tech advocates say the moment to act is now, before a new round of AI-assisted digitization adds even more redundant files to the pile.
The issue isn't abstract housekeeping. California's Public Records Act requires government bodies to produce responsive documents in a timely manner, and when a single photograph exists in fourteen slightly different crops and resolution variants across three different city databases, that obligation gets complicated fast. Librarians at the San Francisco Public Library's History Center on Larkin Street have been wrestling with exactly this problem since a 2023 digitization push dramatically expanded the branch's online collection without a parallel deduplication protocol.
Where the Bottlenecks Are Building
The San Francisco Arts Commission, which manages an image library covering more than 4,000 pieces of public art installed across neighborhoods from the Tenderloin to the Excelsior, flagged the duplicate problem internally last year when staff preparing an updated Civic Art Collection database found that roughly 30 percent of image records had at least one near-identical duplicate. The commission did not provide a cost estimate for cleanup when contacted for this story.
At City Hall, the Department of Technology oversees a citywide digital asset framework that covers everything from infrastructure inspection photos taken along the Central Subway corridor to permit documentation images logged through the Planning Department's Accela system. Sources familiar with the department's work — speaking in their capacity as public records professionals rather than as authorized city spokespersons — say the core technical challenge isn't finding duplicates. Perceptual hash tools can do that in hours. The hard part is deciding which version of a duplicated image is the authoritative record, who has the authority to delete the others, and whether deletion itself creates a legal exposure under state retention schedules.
The San Francisco Municipal Transportation Agency faces a related but distinct version of the challenge. BART and Muni both maintain photographic documentation of station conditions, ADA compliance surveys, and incident records. BART's station infrastructure spans 50 stations system-wide, and internal review processes for image libraries have historically lagged behind operational priorities. Neither agency has published a public timeline for deduplication work.
The Decisions That Matter Most This Summer
Three choices will largely determine how this plays out. First, city departments need to settle on a single metadata standard — the lack of one is the primary reason duplicates accumulate across systems to begin with. The Chief Data Officer's office, which operates out of the Department of Technology at 1 Dr. Carlton B. Goodlett Place, has been developing an updated open data policy framework, but no public release date has been confirmed.
Second, cultural institutions anchored along Civic Center — including the Main Library and the Asian Art Museum on Hagiwara Tea Garden Drive — will need to decide whether to handle deduplication in-house or outsource it to one of the several civic tech contractors that have approached the city since San Francisco's AI procurement guidelines were updated in early 2026. That choice carries both budget and data-sovereignty implications.
Third, and most consequentially, officials will need to determine whether images flagged as duplicates should be deleted outright, archived offline, or simply unlisted from public-facing portals. Legal staff across at least two departments are reviewing California Government Code section 34090, which governs destruction of city records, before any mass deletions proceed.
Civic tech advocates at Code for San Francisco, the volunteer brigade that meets weekly at GitHub's former Mission Street offices and now convenes at various SoMa venues, have pushed for an open-source deduplication audit tool that any department could run against its own image libraries without sending files to a third-party vendor. Whether that proposal gets traction in the next budget cycle — the city's fiscal year 2026-27 began July 1 — is the question archivists and transparency advocates are watching most closely right now.