San Francisco's Department of Technology has been quietly wrestling with a sprawling duplicate-image problem embedded across at least a dozen city databases, a situation that took shape over more than a decade of fragmented digitization drives, overlapping grant programs, and agency silos that rarely talked to one another.
The problem matters now because the city is in the middle of the most aggressive push in its history to digitize public records — from Muni maintenance logs to Planning Department permit files — and duplicates are eating into both cloud storage budgets and the accuracy of public-facing transparency portals that San Franciscans use to track everything from building permits in the Mission District to pothole repair requests in the Sunset.
How Decades of Disconnected Systems Created the Mess
The roots of the problem stretch back to the early 2010s, when individual departments began scanning paper records independently. The San Francisco Public Library's digitization program, the Planning Department's permit archive project, and the Department of Public Works' street documentation initiative all launched on separate timelines, used different file-naming conventions, and stored assets on incompatible platforms. By the time the city consolidated much of its infrastructure under a shared cloud environment — a process that accelerated after a 2019 Department of Technology audit flagged inter-agency redundancy — tens of thousands of image files had already been uploaded multiple times under different identifiers.
The San Francisco Public Utilities Commission's infrastructure documentation library, housed partially on servers at the Civic Center campus, became one of the clearest examples of the accumulation. Engineers uploading site photos of the Hetch Hetchy water system over successive years frequently had no way to check whether a nearly identical image already existed in the archive. The result was duplicate sets numbering in the thousands for certain infrastructure corridors alone.
The tech boom that reshaped SoMa and Mid-Market from 2012 onward made the problem structurally worse before it made it better. Startups pitching municipal contracts often provided their own proprietary digital asset management tools as part of pilot deals, seeding yet another layer of incompatible image repositories. When several of those vendors folded during the 2022 and 2023 tech-sector contraction — layoffs that hit companies up and down the Market Street corridor — the city was left holding orphaned databases with no clean migration path.
The Push for Automated Detection and Replacement
The Department of Technology began piloting automated duplicate-detection software in the spring of 2025, initially applied to the city's 311 service-request image database, which had grown to include millions of photographs of street conditions, encampments, and infrastructure damage submitted by residents from neighborhoods including the Tenderloin, Bayview, and Chinatown. Early runs of the detection algorithm identified redundancy rates that internal project documents, reviewed by The Daily San Francisco, described as significant enough to warrant a department-wide rollout — though the city has not released specific figures publicly.
The practical cost is real. Cloud storage is not free. San Francisco's city IT budget for fiscal year 2025-26 runs to hundreds of millions of dollars, with a material share allocated to data storage infrastructure. Every duplicate image retained is a line item that compounds. Deduplication and replacement — swapping redundant files for a single canonical version — is the technical answer, but it requires human review to avoid deleting records that may look identical but document legally distinct events.
The San Francisco Ethics Commission and the City Attorney's Office have both flagged the records-retention implications. Deleting what turns out to be a substantively unique image, even if it looks like a copy, could create gaps in the evidentiary record for ongoing litigation or public records requests.
For residents and watchdog groups who rely on the city's open data portal at DataSF — which aggregates records from dozens of agencies — the cleanup has practical implications. A cleaner, deduplicated image database means faster load times, more reliable search results, and a smaller margin for error when journalists or attorneys are pulling visual documentation. The Department of Technology has indicated it expects to complete the first full audit cycle of affected databases before the end of the current fiscal year, with replacement protocols to follow in phases through 2027.