The Daily San Francisco

San Francisco news, every day

News

SF's Digital Archive Push Is Drowning in Duplicate Images — and the Numbers Tell Why

City departments and nonprofits are burning through storage budgets and staff hours on a problem that data shows is far larger than most administrators realized.

By San Francisco News Desk · Published 4 July 2026, 12:06 pm

3 min read

SF's Digital Archive Push Is Drowning in Duplicate Images — and the Numbers Tell Why
Photo: Dennett, R. E. (Richard Edward), 1857-1921 / Public domain (Wikimedia Commons)

San Francisco's public agencies collectively hold an estimated tens of millions of digital image files across municipal servers — and a significant share of them are duplicates, redundant copies that cost real money and real time to store, index, and manage. The problem has moved from a background nuisance to a budget line item, and the data behind it is finally getting attention at City Hall.

The issue lands at an awkward moment. The city is simultaneously pushing a housing-production emergency agenda, managing a fentanyl intervention program across the Tenderloin and SoMa, and absorbing a wave of AI-sector growth in the Mission Bay and Dogpatch corridors. Every one of those initiatives generates documentation — permits, case files, site photographs, aerial surveys — and every one of those documentation workflows creates duplicate image overhead.

What the Storage Numbers Actually Show

According to a 2025 report from the City Controller's Office, the Department of Technology estimated that unstructured data — the category that includes scanned documents, photographs, and media files — represented the fastest-growing segment of municipal storage demand, increasing by roughly 30 percent year over year across tracked city systems. Duplicate files are a well-documented driver of unstructured data bloat; industry benchmarks from firms like Veritas Technologies have long put duplication rates in large institutional archives at between 40 and 70 percent of total stored content.

At the San Francisco Public Library's digital preservation program, based out of the main branch on Larkin Street at Civic Center, archivists have been working since 2023 to consolidate historical photograph collections digitized under separate grant cycles. The library's digitization effort, which received partial funding through a California State Library grant, produced multiple scan versions of the same physical prints — different resolutions, different color profiles — that now occupy redundant server space. Library staff have not publicly disclosed exact storage figures, but the pattern mirrors what city technology managers describe across departments.

The San Francisco Planning Department faces a similar crunch. Planning processes thousands of permit applications annually, each of which requires site photographs submitted by project sponsors. The department's permit portal, updated in 2024 under the Mayor's Housing Production Initiative, accepts image uploads but does not automatically check for identical or near-identical files. A single contested project in the Castro or the Richmond can generate dozens of photo submissions showing the same facade from the same angle, filed by different parties on different dates.

Why Deduplication Is Harder Than It Sounds

The technical fix — running deduplication software against existing archives — is straightforward in theory. In practice, city procurement rules slow the process. Vendors bidding on data management contracts with the city must navigate the Office of Contract Administration, and contracts above $10 million require Board of Supervisors approval, a process that routinely takes six months or longer.

The Department of Technology has piloted deduplication tools on a subset of servers managed out of its data center on South Van Ness Avenue, but a city-wide rollout has not been scheduled as of July 2026. The pilot covered roughly 200 terabytes of data, according to department budget documents reviewed for fiscal year 2025-26.

Nonprofit digital archivists working in the city have moved faster. The Internet Archive, headquartered on Funston Avenue near the Presidio, runs its own aggressive deduplication protocols across its Wayback Machine and media collections. Staff there have described the process as essential to keeping the organization's petabyte-scale storage costs manageable — though the Archive's collections dwarf anything a city department holds.

For city administrators watching storage invoices climb, the practical path forward involves three steps: auditing existing holdings with automated hash-comparison tools, establishing upload validation on public-facing portals like the Planning Department's permit system, and writing deduplication requirements into the next round of cloud storage contracts. San Francisco's current cloud infrastructure agreements come up for renewal in late 2027, giving the Department of Technology a real deadline to build those specifications before the next contract cycle locks in terms. That window is narrower than it looks.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.