The Daily San Francisco

San Francisco news, every day

News

How San Francisco's Digital Archives Ended Up Full of the Same Image Twice — And Why Fixing It Took Years

The city's slow reckoning with duplicate image problems in its public-records systems traces back to a decade of rushed digitization, siloed departments, and a tech boom that prioritized speed over data hygiene.

By San Francisco News Desk · Published 4 July 2026, 11:57 am

3 min read

How San Francisco's Digital Archives Ended Up Full of the Same Image Twice — And Why Fixing It Took Years
Photo: Photo by GuiGo Lopes on Pexels

San Francisco's Office of Digital Services confirmed this week that a multi-year audit of the city's public-facing image repositories — spanning the Planning Department's permit portal on South Van Ness Avenue, the Recreation and Parks Department's online inventory, and the SF Public Library's digital collections at Larkin Street — had identified more than 340,000 duplicate image files clogging databases that residents and city workers depend on daily. The cleanup effort, funded through a $2.1 million line item in the Fiscal Year 2025–26 budget, is now formally underway.

The problem matters right now because the city is simultaneously trying to digitize decades of paper records under the California Public Records Act modernization push, while also migrating legacy systems to a new cloud infrastructure managed through a contract with the Department of Technology on Grove Street. Pouring new records into a database that is already swollen with redundant files risks compounding errors, slowing search functions, and increasing storage costs at a moment when the city's tech budget is under intense scrutiny.

How the Duplicates Piled Up

The roots of the problem stretch to roughly 2013, when departments across city hall began independent digitization sprints with little coordination. The Planning Department scanned permit records at high volume after the post-recession construction boom pushed application numbers past 35,000 per year. Each scanner operator uploaded files to whatever shared drive their supervisor designated. When the city consolidated some of those drives in 2017 under the DataSF program, migration scripts failed to check for existing copies, effectively doubling entire folder trees overnight.

The Recreation and Parks Department compounded matters during the COVID-19 shutdown in 2020, when staff scanning archival photographs of Golden Gate Park and Dolores Park used a third-party vendor whose software auto-saved both a compressed JPEG and an uncompressed TIFF for every image — standard practice in archival work, but nobody flagged it when those files landed in a shared city portal not designed for dual-format storage. By 2022, internal estimates put duplicate storage waste in that department alone at roughly 18 terabytes.

The SF Public Library's digital branch ran into a different version of the same trouble. A 2019 grant from the Institute of Museum and Library Services funded a rapid scan of the San Francisco History Center's photograph collection — more than 50,000 images of the city from the 1850s onward. Tight grant deadlines meant quality checks were minimal. When staff later uploaded a corrected batch, the originals were never deleted. Both sets lived side by side in the catalog, creating search results that returned identical images under different metadata tags, confusing researchers and wasting staff time.

The Audit, and What Comes Next

The current audit was triggered in part by a 2024 report from the Budget and Legislative Analyst's office, which noted that the city was paying for approximately 4.7 petabytes of cloud storage, a figure that external reviewers believed could be cut by at least 20 percent with proper deduplication. At current AWS and Azure enterprise rates, that excess storage runs the city an estimated $180,000 a year in unnecessary fees.

The Office of Digital Services is now deploying a deduplication tool called Hashdeep — an open-source checksum utility — alongside a custom-built verification layer developed by a small team at the city's Civic Innovation Lab on Market Street. The goal is to complete the first phase of deletion by October 1, 2026, ahead of the next budget cycle. Departments have been told to designate a single records coordinator who must sign off on any new bulk upload of more than 500 files.

For residents who use the city's online permit tracker or the library's digital archive, the practical effect should eventually be cleaner search results and faster load times. Anyone who has filed a Sunshine Ordinance request and received a response packed with duplicate attachments — a complaint that showed up repeatedly in ombudsman logs as recently as March 2026 — may also see shorter, more accurate document packets. The Office of Digital Services has posted a public dashboard at datasf.org tracking deletion progress by department, updated weekly.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.