The Daily San Francisco

San Francisco news, every day

News

San Francisco's Digital Duplicate Problem: The Numbers Behind the City's Image Data Crisis

Municipal agencies and local nonprofits are sitting on thousands of redundant image files, and the storage bills are adding up fast.

By San Francisco News Desk · Published 4 July 2026, 11:08 pm

3 min read

San Francisco city departments collectively manage an estimated 4.7 million digital image files across their public-facing platforms, internal databases, and archival systems — and a growing share of those files are exact or near-exact duplicates consuming server capacity that costs real money. The figure comes from a 2025 audit conducted by the city's Department of Technology, which found that duplicate image assets alone accounted for roughly 23 percent of total media storage load across surveyed municipal systems.

The timing matters because San Francisco is mid-way through a $140 million modernization push for its digital infrastructure, a program that City Hall began rolling out in fiscal year 2024-25. Redundant data files don't just waste storage space — they slow content management systems, complicate public records requests, and inflate the licensing costs for cloud hosting contracts. With the city's Office of Digital Services consolidating platforms under a unified web framework this year, officials are confronting the duplicate problem head-on for the first time at scale.

Where the Redundancy Lives — and What It Costs

The worst offenders, according to the Department of Technology audit, are legacy content management systems inherited by the San Francisco Municipal Transportation Agency and the Department of Public Works. SFMTA's public-facing website alone had more than 18,000 image assets flagged as duplicates during a crawl conducted in March 2025. Public Works, which maintains project documentation for streets from the Embarcadero seawall all the way out to the Great Highway, had duplicate imagery embedded in contractor-submitted reports dating back to 2017.

The San Francisco Public Library's digital branch, which hosts collections through its Civic Center main branch portal, encountered a parallel issue when it migrated to a new archive platform in late 2024. Library staff identified roughly 6,200 duplicate scans of historical photographs — many of them images of neighborhoods like the Fillmore District and the Mission — that had been uploaded by multiple departments over several years without a shared deduplication protocol in place.

Storage costs on the city's Amazon Web Services and Microsoft Azure contracts run roughly $0.023 per gigabyte per month for standard storage tiers, a figure AWS publishes publicly. At that rate, even modest over-storage from duplicate image files across dozens of departments compounds into tens of thousands of dollars annually before premium retrieval and redundancy fees are factored in.

The private sector is wrestling with the same arithmetic. Nonprofits operating along the Tenderloin's Turk Street corridor — including organizations that document outreach work for grant compliance — told community technology consultants at TechSF, the city's workforce development arm at 1 South Van Ness Avenue, that managing image libraries without deduplication tools is a persistent operational drag. TechSF has run digital literacy workshops that include basic file management instruction, though deduplication at an institutional scale requires software most small nonprofits cannot afford independently.

What Deduplication Actually Fixes — and What Comes Next

Automated deduplication tools scan file hashes rather than file names, catching duplicates that have been renamed or stored in different folders — which is exactly the scenario the city's audit found most common. Enterprise tools from vendors like Cloudinary or ImageKit price their municipal tiers starting around $500 per month for mid-size organizations, though city government contracts typically negotiate volume discounts that can bring effective per-asset costs down significantly.

The Office of Digital Services has indicated it plans to roll a deduplication requirement into its unified content management standards by the third quarter of 2026, meaning agencies migrating to the new citywide platform will need to pass a media audit before going live. That deadline gives departments roughly three months to clean house.

For smaller organizations navigating this on their own — whether a Mission District arts nonprofit or a Bayview community health clinic — free tools like dupeGuru offer a starting point, though they require manual review rather than automated resolution. The practical advice from city technology staff: start with the folders that feed public-facing websites, since those files are most likely to have been uploaded repeatedly by multiple staff members over multiple content cycles. A clean library is cheaper than the alternative.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.