San Francisco's municipal data managers have a problem they rarely talk about at budget hearings: enormous swaths of the city's digital image archives — from SFMTA transit documentation to Department of Public Works infrastructure photos — are clogged with duplicate files that consume server space, slow retrieval systems, and inflate storage contracts that taxpayers ultimately fund. Internal audits reviewed by The Daily San Francisco suggest the problem is systemic and getting more expensive to ignore.
The timing matters. The city is mid-way through a multi-year digital infrastructure overhaul tied to the San Francisco Digital Services office, which launched a renewed asset-management initiative in late 2024. As agencies migrate legacy records onto centralized cloud platforms, the duplicate-image problem is surfacing in ways it never did when files sat on isolated departmental hard drives in buildings like City Hall Room 495 or the DPW operations center on Cesar Chavez Street. The migration is forcing a reckoning with data hygiene that administrators have deferred for more than a decade.
The San Francisco Public Library's digital collections arm, which manages the San Francisco History Center at the Main Branch on Larkin Street, has grappled publicly with the issue. The History Center's digitization team — working with the Internet Archive on a long-running preservation project — identified that early scanning runs between 2011 and 2016 produced substantial redundancy, with some photographic series duplicated two and three times across different catalog entries. The library declined to provide a precise count for this story, but archivists in the field commonly cite duplication rates between 15 and 45 percent in collections digitized before modern deduplication software became standard practice around 2018.
What the Data Actually Shows
The numbers are striking when you look at the storage economics. Cloud object storage — the kind used by city contractors through agreements with vendors including Amazon Web Services and Microsoft Azure — runs roughly $0.023 per gigabyte per month at standard commercial tiers as of mid-2026. A collection running 10 terabytes with a 40 percent duplication rate is spending approximately $1,100 a year storing files that add zero informational value. Multiply that across a dozen city departments and the figure climbs into six figures annually, before accounting for the labor costs of archivists and IT staff manually reviewing flagged files.
The San Francisco Arts Commission's public art archive, which documents every piece in the Civic Art Collection — more than 4,000 works spanning murals in the Mission District to sculptures at Civic Center Plaza — completed a deduplication audit in the spring of 2025. Commission staff used open-source perceptual hashing tools, a technique that identifies near-identical images even when file names differ, and removed more than 6,200 redundant image files from a collection that had grown to roughly 85,000 assets. That represents a duplication rate just above 7 percent, notably lower than industry averages for collections of comparable age, suggesting the Commission's cataloging discipline over the years paid off.
Local nonprofits face the same headache with fewer resources. Glide Memorial Church on Taylor Street in the Tenderloin, which has documented its community programs photographically for decades, contracted with a Mission District-based digital preservation firm in early 2026 to audit roughly 120,000 image files. The preliminary finding: nearly 18,000 files — about 15 percent — were duplicates or near-duplicates generated by bulk imports from smartphones and event cameras over multiple years.
What Organizations Should Do Now
Archivists and data managers interviewed for background — without attribution to specific individuals — point to three concrete steps that apply whether you're a city agency on Van Ness Avenue or a neighborhood historical society in the Excelsior. First, run a perceptual hash audit before any cloud migration, not after — retroactive cleanup costs significantly more in staff time. Second, establish a file-naming and import protocol that flags potential duplicates at ingestion rather than discovery. Third, budget explicitly for deduplication as a line item: the San Francisco Digital Services office recommends allocating roughly 8 to 12 percent of any digitization project budget to data-quality remediation, according to its 2025 project guidelines published on the city's data portal.
With the city's broader digital infrastructure contract up for renewal before the Board of Supervisors in the fourth quarter of 2026, the duplicate-image problem is unlikely to stay a back-office footnote for much longer. Every gigabyte counts when the budget is tight and the servers are full.