The Daily San Francisco

San Francisco news, every day

News

San Francisco's Digital Archives Are Drowning in Duplicate Images — And the Numbers Tell a Damaging Story

City agencies, nonprofits and cultural institutions are quietly losing thousands of hours and millions in storage costs to redundant image files, and a new push to quantify the problem is forcing a reckoning.

By San Francisco News Desk · Published 4 July 2026, 12:26 pm

3 min read

San Francisco's Digital Archives Are Drowning in Duplicate Images — And the Numbers Tell a Damaging Story
Photo: Photo by Zak Mir on Pexels

San Francisco's public institutions collectively store an estimated tens of millions of digital image files — and a significant portion of them are exact or near-exact duplicates, costing the city real money and real time every year. That's the emerging picture from a data audit effort quietly underway across several municipal departments and cultural organizations this spring.

The timing matters. The city is under acute fiscal pressure heading into fiscal year 2026-27, with the San Francisco Controller's Office projecting a structural budget deficit that has already triggered cuts to department operating budgets citywide. Against that backdrop, the hidden cost of redundant data storage is drawing fresh scrutiny from IT administrators who once treated it as a low-priority housekeeping problem.

What the Numbers Actually Look Like

At the San Francisco Public Library system — which spans 28 branch locations from the Chinatown branch on Sacramento Street to the Bayview branch on Third Street — digital collections staff have identified duplicate image rates ranging from 15 to as high as 40 percent in certain archival photograph collections, according to internal workflow documents reviewed as part of the library's ongoing digital preservation project. The library's digital collections unit, operating out of the main branch on Larkin Street in Civic Center, began a deduplication review in March 2026 using open-source image-hashing tools.

The San Francisco Recreation and Parks Department, which maintains photo documentation of its roughly 225 parks and open spaces, flagged a similar issue last year during a migration to a new asset management platform. Staff discovered that standard operating procedures had allowed field photographers to upload event images without checking for prior submissions, leading to storage bloat that department technology staff estimated could affect thousands of files across the city's digital asset repository.

Commercial cloud storage runs approximately $0.023 per gigabyte per month on standard tiers from major providers — a figure that sounds trivial until multiplied across hundreds of thousands of redundant high-resolution files. A single uncompressed RAW photograph from a modern camera can exceed 25 megabytes. Multiply that by even 50,000 redundant files and you are looking at over 1.2 terabytes of pure waste, costing institutions roughly $330 annually at standard rates — before factoring in staff time spent searching, sorting and misidentifying files in cluttered digital libraries.

The real cost, practitioners argue, is labor. Archivists and digital asset managers working at organizations like the San Francisco Museum of Modern Art on Third Street, or the California Historical Society on Mission Street, spend measurable portions of their workweek navigating image libraries bloated with near-duplicates — slightly different crops, varying resolutions, or multiple exports of the same source file. Industry benchmarks from digital asset management consultants suggest that knowledge workers lose an average of 30 minutes per day searching for digital files in poorly organized systems, a figure that compounds quickly across large teams.

What Institutions Are Doing About It

Several San Francisco-based technology nonprofits, including those clustered around the Mid-Market corridor on Market Street, have begun piloting AI-assisted deduplication workflows in 2026. These tools use perceptual hashing and vector embeddings to identify images that are visually identical or near-identical even when file names, metadata and timestamps differ — a capability that earlier checksum-based tools missed entirely.

The San Francisco Digital Services team, housed within the Department of Technology on Seventh Street, is evaluating whether a citywide digital asset management policy could standardize deduplication protocols across departments. No formal policy has been adopted yet, but the evaluation is listed in the department's 2026 technology roadmap as an active workstream.

For smaller nonprofits and cultural institutions working with tighter budgets, free and open-source tools such as dupeGuru and digiKam offer accessible starting points for image library audits. The practical first step recommended by digital archivists is a baseline inventory: count total image files, run a hash comparison, and establish what percentage of the library is redundant before committing to any platform migration. In a city where every budget line is being scrutinized this July, that kind of unglamorous data hygiene work is suddenly looking like real fiscal responsibility.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.