The Daily San Francisco

San Francisco news, every day

News

San Francisco's Duplicate Image Problem: The Numbers Driving a City-Wide Data Cleanup

From municipal permit databases to Muni route maps, redundant and duplicate imagery is costing the city measurable time and money — and the reckoning is overdue.

By San Francisco News Desk · Published 4 July 2026, 12:51 pm

3 min read

San Francisco's Duplicate Image Problem: The Numbers Driving a City-Wide Data Cleanup
Photo: Photo by Clément Proust on Pexels

San Francisco's Department of Technology has flagged duplicate image files as a growing liability across at least seven city data systems, with internal assessments pointing to storage bloat that runs into the terabytes across platforms maintained by agencies ranging from the Planning Department on Mission Street to the San Francisco Municipal Transportation Agency at 1 South Van Ness. The problem is unglamorous, largely invisible to the public, and getting expensive.

The timing matters because the city is mid-way through a digital infrastructure overhaul that Mayor Daniel Lurie's office began prioritizing in early 2026, partly in response to pressure from the Budget and Legislative Analyst's office to reduce redundant operating costs. Cloud storage is not cheap. Industry benchmarks from the Cloud Security Alliance put average enterprise cloud storage costs at roughly $0.023 per gigabyte per month, and when city departments are sitting on duplicate image assets numbering in the hundreds of thousands — building permit photos, transit signage scans, GIS map tiles — the arithmetic compounds quickly.

Where the Redundancy Lives

The San Francisco Planning Department's permit portal, accessible to contractors and residents across the city, has accumulated visual records since its digitization push began in earnest around 2018. By early 2026, city technology staff identified that a significant share of image files in the permit system had two or more identical copies stored under different file names or directory paths, a byproduct of multiple upload pathways and inconsistent file-handling protocols introduced during successive software migrations. The SFMTA's real-time passenger information system — which pulls from camera feeds and static route imagery at hundreds of stops from the Embarcadero to Balboa Park — faces a parallel issue, with redundant image assets cached across both legacy servers and newer cloud nodes.

The SF Digital Services team, based at City Hall and responsible for consumer-facing municipal web tools, began a systematic deduplication audit in March 2026, targeting the DataSF open data portal first. DataSF hosts more than 600 public datasets, and among those with visual components, the audit found duplication rates ranging from roughly 12 percent to, in some older environmental mapping datasets, above 30 percent of total image assets. Even at the low end, that represents thousands of files consuming storage cycles and slowing query response times for researchers, journalists, and developers who rely on the portal.

What Deduplication Actually Costs — and Saves

Running a deduplication process is not free. The city's contract with its primary cloud vendor, terms of which were not publicly detailed in documents reviewed for this article, includes compute costs for scanning and hashing operations. Technology vendors in the municipal space typically quote deduplication projects for mid-size government data environments — comparable in scope to San Francisco's footprint — at between $80,000 and $250,000 for a one-time remediation, depending on data volume and the degree of manual review required for ambiguous near-duplicate images. The break-even calculation, however, tends to favor action: a 20 percent storage reduction across systems carrying six-figure annual cloud bills returns savings within a single budget cycle.

San Francisco Public Library's digital collections team at the main branch on Larkin Street completed a smaller-scale version of this work in late 2024, clearing roughly 40,000 duplicate image files from its digitized historical photograph archive. The process took three staff members approximately six weeks and freed capacity that had previously required a supplemental storage allocation.

For city residents and the businesses that interact with planning and transit portals daily, the practical payoff is faster load times and more accurate search results — fewer cases where the same pothole photo or the same bus stop image surfaces three times in a single results page. For the agencies themselves, cleaner data pipelines reduce the risk of errors that arise when staff or automated systems act on stale duplicate records instead of authoritative ones.

The Department of Technology has not yet announced a citywide deduplication completion date, but the digital services audit is expected to produce a public report by the end of the third quarter of 2026. City departments that have not yet submitted their image inventories for review have been given a deadline of August 15 to comply with the data hygiene directive issued in May.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Francisco

This article was produced by the The Daily San Francisco editorial desk and covers news in San Francisco. See our editorial standards for how we use AI.

The Daily San Francisco brief

The day's San Francisco news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to San Francisco news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Francisco and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily San Francisco

More in News

Enjoyed this story? Get tomorrow's briefing free.