San Francisco's Duplicate Image Problem: The Numbers Driving a City-Wide Data Cleanup
From municipal permit databases to Muni route maps, redundant and duplicate imagery is costing the city measurable time and money — and the reckoning is overdue.
From municipal permit databases to Muni route maps, redundant and duplicate imagery is costing the city measurable time and money — and the reckoning is overdue.

San Francisco's Department of Technology has flagged duplicate image files as a growing liability across at least seven city data systems, with internal assessments pointing to storage bloat that runs into the terabytes across platforms maintained by agencies ranging from the Planning Department on Mission Street to the San Francisco Municipal Transportation Agency at 1 South Van Ness. The problem is unglamorous, largely invisible to the public, and getting expensive.
The timing matters because the city is mid-way through a digital infrastructure overhaul that Mayor Daniel Lurie's office began prioritizing in early 2026, partly in response to pressure from the Budget and Legislative Analyst's office to reduce redundant operating costs. Cloud storage is not cheap. Industry benchmarks from the Cloud Security Alliance put average enterprise cloud storage costs at roughly $0.023 per gigabyte per month, and when city departments are sitting on duplicate image assets numbering in the hundreds of thousands — building permit photos, transit signage scans, GIS map tiles — the arithmetic compounds quickly.
The San Francisco Planning Department's permit portal, accessible to contractors and residents across the city, has accumulated visual records since its digitization push began in earnest around 2018. By early 2026, city technology staff identified that a significant share of image files in the permit system had two or more identical copies stored under different file names or directory paths, a byproduct of multiple upload pathways and inconsistent file-handling protocols introduced during successive software migrations. The SFMTA's real-time passenger information system — which pulls from camera feeds and static route imagery at hundreds of stops from the Embarcadero to Balboa Park — faces a parallel issue, with redundant image assets cached across both legacy servers and newer cloud nodes.
The SF Digital Services team, based at City Hall and responsible for consumer-facing municipal web tools, began a systematic deduplication audit in March 2026, targeting the DataSF open data portal first. DataSF hosts more than 600 public datasets, and among those with visual components, the audit found duplication rates ranging from roughly 12 percent to, in some older environmental mapping datasets, above 30 percent of total image assets. Even at the low end, that represents thousands of files consuming storage cycles and slowing query response times for researchers, journalists, and developers who rely on the portal.
Running a deduplication process is not free. The city's contract with its primary cloud vendor, terms of which were not publicly detailed in documents reviewed for this article, includes compute costs for scanning and hashing operations. Technology vendors in the municipal space typically quote deduplication projects for mid-size government data environments — comparable in scope to San Francisco's footprint — at between $80,000 and $250,000 for a one-time remediation, depending on data volume and the degree of manual review required for ambiguous near-duplicate images. The break-even calculation, however, tends to favor action: a 20 percent storage reduction across systems carrying six-figure annual cloud bills returns savings within a single budget cycle.
San Francisco Public Library's digital collections team at the main branch on Larkin Street completed a smaller-scale version of this work in late 2024, clearing roughly 40,000 duplicate image files from its digitized historical photograph archive. The process took three staff members approximately six weeks and freed capacity that had previously required a supplemental storage allocation.
For city residents and the businesses that interact with planning and transit portals daily, the practical payoff is faster load times and more accurate search results — fewer cases where the same pothole photo or the same bus stop image surfaces three times in a single results page. For the agencies themselves, cleaner data pipelines reduce the risk of errors that arise when staff or automated systems act on stale duplicate records instead of authoritative ones.
The Department of Technology has not yet announced a citywide deduplication completion date, but the digital services audit is expected to produce a public report by the end of the third quarter of 2026. City departments that have not yet submitted their image inventories for review have been given a deadline of August 15 to comply with the data hygiene directive issued in May.
How does this story make you feel?
Spread the word
About this article
Published by The Daily San Francisco
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News