The Daily San Francisco

San Francisco news, every day

News

San Francisco's Digital Archives Are Riddled With Duplicate Images — And the Numbers Are Staggering

A citywide audit of municipal and nonprofit digital records reveals tens of thousands of redundant image files draining storage budgets and slowing public-access systems across San Francisco.

By San Francisco News Desk · Published 4 July 2026, 12:12 pm

3 min read

San Francisco's Digital Archives Are Riddled With Duplicate Images — And the Numbers Are Staggering
Photo: Photo by Suphot Punnachaiya on Pexels

San Francisco's public agencies are sitting on a digital hoarding problem. An internal review circulated this spring among city department heads found that duplicate image files account for roughly 34 percent of all stored digital assets across municipal databases — a figure that translates, in raw storage terms, to an estimated 18 petabytes of redundant data costing the city several million dollars annually in cloud hosting fees alone.

The timing matters. The city's Department of Technology is mid-rollout on a $47 million infrastructure modernization push that includes migrating legacy records to a unified cloud platform. That migration, expected to complete by March 2027, has exposed just how badly catalogued San Francisco's digital image libraries have become — particularly in departments that digitized paper records during the pandemic years without consistent metadata standards.

Where the Problem Concentrates

The worst backlogs sit in the Planning Department's parcel-photo archive on Seventh Street and in the San Francisco Public Library's digital collections hub, which manages the city's historical photograph repository out of the main branch on Larkin Street in Civic Center. Librarians there have flagged that some individual photographs — particularly images of the 1906 earthquake and the Fillmore District jazz era — exist in as many as 40 separate file versions across different digitization projects, each with slightly different file names and cropping but identical underlying content.

The SF Digital Services office, the Mayor's Office unit tasked with improving online public tools, has been working since January 2026 with a contractor called Starling Labs — a Stanford-affiliated digital verification nonprofit with offices on Mission Street — to pilot an automated deduplication protocol across three city departments. The pilot covers Recreation and Parks, the City Attorney's office, and the Office of the Assessor-Recorder. Early results from the Assessor-Recorder phase, completed in April, identified 2.3 million duplicate image files out of 6.8 million total — a 34 percent redundancy rate that matched the citywide estimate almost exactly.

The financial drag is real. Cloud storage prices for municipal contracts typically run between $0.02 and $0.05 per gigabyte per month depending on tier and vendor. At 18 petabytes of purely redundant storage — a conservative estimate that several city IT managers have described as likely understated — the annual cost sits somewhere between $4.3 million and $10.8 million, depending on which tier the data occupies. That range, while wide, represents money that budget analysts say could otherwise fund roughly 30 to 70 additional Muni operator positions.

What Deduplication Actually Fixes

Beyond cost, duplicate image bloat slows the public-facing systems San Franciscans actually use. The city's open data portal, DataSF, serves roughly 280,000 page views per month according to figures the platform published for Q1 2026. Database queries that touch image-heavy datasets — zoning maps, building permit photos, neighborhood planning documents — run measurably slower when indexes are cluttered with redundant files pointing to functionally identical content.

Nonprofits aren't immune either. The Internet Archive, headquartered on Funston Avenue in the Inner Richmond, has grappled with the same structural problem at a global scale and has developed hash-based deduplication tools that SF Digital Services is now evaluating for potential city adoption. The approach assigns every image a unique cryptographic fingerprint at the moment of upload; any subsequent file that generates an identical fingerprint gets flagged automatically rather than stored as a new asset.

For San Francisco residents and businesses that rely on public records for permitting, property research, or historical documentation, the practical upshot is straightforward: expect faster load times and more accurate search results on city platforms as the deduplication project scales beyond its current three-department pilot. SF Digital Services has indicated it plans to expand the program to the Planning Department and the Department of Building Inspection by the end of 2026. The full citywide rollout, if it holds to the current March 2027 timeline, would mark the first comprehensive image-library audit San Francisco has conducted since 2014.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.