The Daily San Francisco

San Francisco news, every day

News

San Francisco's Digital Archives Are Riddled With Duplicate Images — and the Numbers Tell a Costly Story

City agencies and nonprofits are sitting on terabytes of redundant visual data, and cleaning it up is proving more expensive than anyone budgeted for.

By San Francisco News Desk · Published 4 July 2026, 11:35 am

3 min read

San Francisco's Digital Archives Are Riddled With Duplicate Images — and the Numbers Tell a Costly Story
Photo: Thompson, Charles L. (Charles Lawrence), 1875- United States. Supreme Court Rose, Walter Malins, 1872-1908 / Public domain (Wikimedia Commons)

San Francisco city agencies collectively store an estimated 40 to 60 percent of their digital image libraries as duplicates or near-duplicates, according to a data audit framework published earlier this year by the City Controller's Office of Civic Innovation. The finding, buried inside a broader report on municipal data hygiene, has started to draw attention from IT managers across departments — and from vendors who see a lucrative cleanup contract on the horizon.

The issue matters now because San Francisco is in the middle of a sweeping digitization push. The Department of Technology's DataSF program, headquartered at City Hall on Dr. Carlton B. Goodlett Place, has been ingesting records from paper archives at a rate that accelerated sharply after 2023. More files means more duplicate images, and more duplicate images means inflated storage costs, slower retrieval systems, and, in some cases, botched public records requests where the same photo is returned dozens of times in a single search result.

What the Numbers Actually Show

Storage is not free. San Francisco's Department of Technology reported cloud storage expenditures exceeding $4.2 million in fiscal year 2024-25, a figure that city budget documents show rising year over year since the department migrated away from on-premises servers beginning in 2021. Analysts who work with municipal data systems say duplicate image files can account for anywhere from 15 to 30 percent of total cloud storage spend in large government environments — a range that, applied conservatively to San Francisco's figures, suggests the city may be paying somewhere north of $600,000 annually just to store images it already has.

The San Francisco Public Library's digital collections, hosted through its partnership with the Internet Archive and accessible via the library's main branch on Larkin Street, face a version of the same problem at the institutional level. Historical photograph collections digitized through grant-funded projects often pull from overlapping source materials, producing near-identical image pairs that confuse catalog searches and inflate file counts. Librarians have flagged the problem internally, though a full deduplication project has not yet been publicly funded or scheduled.

At the nonprofit level, Tenderloin Housing Clinic and similar organizations that document housing conditions — photographing units, common areas, and code violations — generate hundreds of images per inspection cycle. When staff upload from multiple devices without a centralized asset management system, duplication rates climb fast. One data consultant who has worked with Mission District nonprofits on digital workflow projects estimated, in general terms, that small organizations without dedicated IT staff routinely see duplication rates above 50 percent in their shared drives.

The Tools Exist — So Why Hasn't This Been Fixed?

Deduplication software has been commercially available for years. Platforms like Hamming distance-based hash matchers and perceptual hashing tools can scan thousands of images per minute and flag near-identical files for review. Enterprise licenses for these tools typically run between $8,000 and $25,000 annually for mid-size government deployments, according to published pricing from several vendors in the space.

The obstacle is not technology — it is workflow. City departments operate in silos. The Department of Public Works on South Van Ness Avenue manages its own image library, separate from the Planning Department on Mission Street, which runs its own. Neither has a shared deduplication protocol with the other. DataSF has pushed for interoperability standards since at least 2022, but implementation has been uneven.

For organizations looking to get ahead of the problem now, the City's own Digital Services team has published a data management checklist through the DataSF portal that recommends quarterly audits of shared drives, standardized file-naming conventions tied to date and project codes, and the use of perceptual hash comparisons before any new batch upload. The checklist is free to download and applies to nonprofit and small-business contexts as much as it does to city agencies.

San Francisco's July budget cycle, which typically closes by July 31, is the practical window for department heads to request deduplication project funding before fiscal year 2026-27 appropriations are locked. Those who miss it will likely wait another twelve months — and keep paying for files they already own.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.