The Daily San Francisco

San Francisco news, every day

News

San Francisco's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Damaging Story

City departments, nonprofits, and cultural institutions are quietly hemorrhaging storage budgets and staff hours as redundant image files pile up across municipal servers.

By San Francisco News Desk · Published 4 July 2026, 12:10 pm

3 min read

San Francisco's Digital Archives Are Drowning in Duplicate Images — and the Numbers Tell a Damaging Story
Photo: Photo by Zak Mir on Pexels

San Francisco's public agencies collectively store an estimated tens of millions of digital image files across fragmented server infrastructure — and a significant portion of them are exact or near-exact duplicates, according to technology auditors who have examined municipal data management practices in comparable mid-size American cities. The problem is invisible to most residents, but it carries a measurable price tag that lands squarely on the city's technology budget.

The timing matters. San Francisco's Department of Technology is currently mid-way through a $47 million infrastructure modernization contract that runs through fiscal year 2027, and duplicate image bloat is one of the core inefficiencies that modernization projects are designed to eliminate. With the city already managing a projected general fund shortfall entering the next budget cycle, every wasted gigabyte translates into real operational cost.

What Duplication Actually Costs, in Hard Numbers

Enterprise storage in cloud environments — the kind that the San Francisco Municipal Transportation Agency, the Department of Public Health, and the Office of Digital Services all use to varying degrees — runs roughly $0.023 per gigabyte per month on standard tiers, according to published AWS and Microsoft Azure pricing as of mid-2026. A single uncompressed photographic image from a city planning document or a Muni surveillance camera frame runs between 3 and 8 megabytes. Multiply that by millions of files stored in duplicate or triplicate across departmental silos, and the monthly cost accumulates fast.

Research published by the International Data Corporation in 2024 found that duplicate and redundant data typically accounts for between 30 and 40 percent of total enterprise storage in government environments — a figure that has remained stubbornly consistent across public sector audits in cities from Chicago to Denver. Apply that range to San Francisco's known storage footprint and the waste is not trivial. The San Francisco Controller's Office has flagged data governance as a priority area in its annual city services auditor reports going back to at least 2022.

At the San Francisco Public Library's main branch on Larkin Street, digitization archivists have been working through the Heritage Collection since 2021 — a project that has produced hundreds of thousands of scanned historical photographs. Library staff have acknowledged publicly that deduplication tooling was not fully integrated into the initial workflow, meaning early scanning runs produced redundant files that required manual review. The Internet Archive, which partners with SFPL on several preservation projects out of its Funston Avenue facility in the Richmond District, uses automated hash-matching to flag duplicate image files before ingestion — a process that the library has been gradually adopting.

The Human Cost: Staff Hours That Could Go Elsewhere

Storage cost is only part of the equation. A 2023 report from the Government Technology research group estimated that IT staff in mid-size American city governments spend an average of 6.4 hours per week managing storage anomalies that include duplicate files — time that, in San Francisco's context, comes at a fully loaded labor cost of roughly $95 to $140 per hour for senior technical staff, based on published city salary schedules for Class 1042 and 1053 IT positions.

The SF Digital Services team, based out of city offices on Dr. Carlton B. Goodlett Place near City Hall, has piloted AI-assisted deduplication tools as part of a broader data hygiene initiative that began in January 2026. The pilot covers a subset of Planning Department image archives, which include permit photos, environmental review documents, and neighborhood survey imagery going back decades. Early internal assessments of similar pilots in other cities suggest storage reclamation rates of 15 to 28 percent — which, if replicated in San Francisco, would represent meaningful savings before the infrastructure modernization contract closes out.

For city departments still running legacy on-premises storage alongside cloud environments — a hybrid setup common at agencies like the SFMTA and the Department of Building Inspection — the practical next step is a formal data audit that inventories image file types, sizes, and storage locations before any deduplication tool is deployed. The Controller's Office budget calendar sets the next departmental technology review for September 2026, which gives agencies roughly two months to document their current state before scrutiny intensifies.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.