The San Francisco Public Library system holds more than 1.2 million digital image files across its History Center collections at Larkin Street, and by the library's own internal estimates, somewhere between 18 and 22 percent of those files are duplicates — identical or near-identical scans uploaded multiple times by different departments or digitization contractors over the past decade. That redundancy is not just a filing inconvenience. It costs storage budget, slows search systems, and frustrates the archivists trying to make those records publicly accessible.
The problem sits at the intersection of two forces pressing hard on San Francisco right now: a municipal government still wrestling with budget shortfalls that pushed the city to close a roughly $800 million two-year deficit in 2025, and a tech sector that has spent the last 18 months selling AI-powered deduplication tools to any institution willing to listen. For city agencies, nonprofits, and public bodies already stretched thin, the question of how many duplicate images they're actually carrying — and what it genuinely costs — has become a budget line item, not an abstract IT concern.
The Scale of the Problem Across City Institutions
The library is not alone. The San Francisco Arts Commission, which maintains a publicly accessible digital registry of the city's more than 4,000 public artworks, has flagged duplicate image records as a recurring data quality problem in its Civic Art Collection database. Staff there have noted that contractor submissions, media uploads, and internal photography sessions frequently generate multiple versions of the same artwork image without a systematic deduplication pass before ingestion.
At the San Francisco Municipal Transportation Agency, which manages photo documentation for infrastructure projects across Muni and SFMTA capital programs, the volume of project imagery generated annually runs into the hundreds of thousands of files. Storage costs for cloud-based asset management systems have climbed steadily — enterprise-grade cloud storage for large image libraries typically runs between $0.02 and $0.05 per gigabyte per month on major platforms, and a library with 500,000 high-resolution TIFF files can occupy 10 terabytes or more without aggressive file management.
The nonprofit sector faces parallel pressure. The Tenderloin Housing Clinic and Glide Memorial Church on Taylor Street, both of which have used photography extensively for grant documentation and advocacy over decades, maintain digital archives that have grown organically rather than systematically. Deduplication audits at organizations of that scale — running 50,000 to 200,000 image files — can take a staff member weeks using manual tools, or several thousand dollars if outsourced to a digital asset management firm.
AI Tools Enter, But Accuracy Questions Follow
The current generation of AI-driven duplicate detection tools works by generating perceptual hashes — numerical fingerprints of an image's visual content — and flagging pairs that fall below a set similarity threshold. Vendors including companies with offices in SoMa's tech corridor have pitched these systems to Bay Area public agencies since late 2024, with accuracy rates for exact duplicates typically above 99 percent. Near-duplicate detection, which catches slightly cropped or color-adjusted versions of the same image, is more variable, with published benchmarks ranging from 87 to 95 percent depending on the dataset.
That variance matters. A false positive — flagging two legitimately different images as duplicates — in a historical archive can mean a photograph gets deleted permanently. The San Francisco History Center's photograph collection includes irreplaceable images from the 1906 earthquake and the Fillmore District's Jazz Era, making any automated culling process a decision with real cultural stakes.
For institutions moving forward, archivists and digital asset managers recommend a three-step approach: run a perceptual hash audit first to identify the full scope of duplication, set human review thresholds for anything below 100 percent match confidence, and establish file naming and metadata conventions before the next digitization contract is signed. The Library of Congress has published guidance along these lines since 2023, and the California Digital Library in Oakland offers consulting support for member institutions navigating exactly these decisions. Getting the numbers right before deleting anything is the one step that cannot be skipped.