SF's Digital Archive Push Is Drowning in Duplicate Images — and the Numbers Tell Why
City departments and nonprofits are burning through storage budgets and staff hours on a problem that data shows is far larger than most administrators realized.
City departments and nonprofits are burning through storage budgets and staff hours on a problem that data shows is far larger than most administrators realized.

San Francisco's public agencies collectively hold an estimated tens of millions of digital image files across municipal servers — and a significant share of them are duplicates, redundant copies that cost real money and real time to store, index, and manage. The problem has moved from a background nuisance to a budget line item, and the data behind it is finally getting attention at City Hall.
The issue lands at an awkward moment. The city is simultaneously pushing a housing-production emergency agenda, managing a fentanyl intervention program across the Tenderloin and SoMa, and absorbing a wave of AI-sector growth in the Mission Bay and Dogpatch corridors. Every one of those initiatives generates documentation — permits, case files, site photographs, aerial surveys — and every one of those documentation workflows creates duplicate image overhead.
According to a 2025 report from the City Controller's Office, the Department of Technology estimated that unstructured data — the category that includes scanned documents, photographs, and media files — represented the fastest-growing segment of municipal storage demand, increasing by roughly 30 percent year over year across tracked city systems. Duplicate files are a well-documented driver of unstructured data bloat; industry benchmarks from firms like Veritas Technologies have long put duplication rates in large institutional archives at between 40 and 70 percent of total stored content.
At the San Francisco Public Library's digital preservation program, based out of the main branch on Larkin Street at Civic Center, archivists have been working since 2023 to consolidate historical photograph collections digitized under separate grant cycles. The library's digitization effort, which received partial funding through a California State Library grant, produced multiple scan versions of the same physical prints — different resolutions, different color profiles — that now occupy redundant server space. Library staff have not publicly disclosed exact storage figures, but the pattern mirrors what city technology managers describe across departments.
The San Francisco Planning Department faces a similar crunch. Planning processes thousands of permit applications annually, each of which requires site photographs submitted by project sponsors. The department's permit portal, updated in 2024 under the Mayor's Housing Production Initiative, accepts image uploads but does not automatically check for identical or near-identical files. A single contested project in the Castro or the Richmond can generate dozens of photo submissions showing the same facade from the same angle, filed by different parties on different dates.
The technical fix — running deduplication software against existing archives — is straightforward in theory. In practice, city procurement rules slow the process. Vendors bidding on data management contracts with the city must navigate the Office of Contract Administration, and contracts above $10 million require Board of Supervisors approval, a process that routinely takes six months or longer.
The Department of Technology has piloted deduplication tools on a subset of servers managed out of its data center on South Van Ness Avenue, but a city-wide rollout has not been scheduled as of July 2026. The pilot covered roughly 200 terabytes of data, according to department budget documents reviewed for fiscal year 2025-26.
Nonprofit digital archivists working in the city have moved faster. The Internet Archive, headquartered on Funston Avenue near the Presidio, runs its own aggressive deduplication protocols across its Wayback Machine and media collections. Staff there have described the process as essential to keeping the organization's petabyte-scale storage costs manageable — though the Archive's collections dwarf anything a city department holds.
For city administrators watching storage invoices climb, the practical path forward involves three steps: auditing existing holdings with automated hash-comparison tools, establishing upload validation on public-facing portals like the Planning Department's permit system, and writing deduplication requirements into the next round of cloud storage contracts. San Francisco's current cloud infrastructure agreements come up for renewal in late 2027, giving the Department of Technology a real deadline to build those specifications before the next contract cycle locks in terms. That window is narrower than it looks.
How does this story make you feel?
Spread the word
About this article
Published by The Daily San Francisco
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News