San Francisco's Department of Technology has been grappling for the better part of three years with a problem that sounds almost trivially bureaucratic: thousands of duplicate images embedded across municipal databases, from planning permit scans filed at 49 South Van Ness Avenue to health inspection records maintained by the Department of Public Health's Civic Center offices. The redundancy is not trivial. Storage costs, licensing fees for document-management software, and staff hours spent manually sorting misfiled images have added up to a recurring line item that city budget analysts have flagged in successive fiscal reviews.
The timing matters because the city is mid-way through an ambitious digitization push tied to its Open Data program, administered through DataSF, the municipal unit housed within the Department of Technology. Officials have staked significant political capital — and federal grant money — on making San Francisco's public records faster to search and easier for residents to use. Duplicate images clog those systems, return false results in database queries, and force IT contractors to re-index records that should have been clean from the start.
How the Backlog Built Up
The roots of the problem run back to roughly 2019 and 2020, when multiple city departments independently migrated paper records to digital formats under time pressure, often using different scanning vendors with incompatible file-naming conventions. The Planning Department, which processes thousands of permit applications annually for projects across neighborhoods from the Mission District to the Sunset, was scanning documents into one system while the City Attorney's office was ingesting related legal filings into a separate platform. When those repositories were later merged under a unified content-management initiative, duplicate images multiplied rather than consolidated.
The SF Digital Services team, based at City Hall, identified the scope of the issue formally in a 2023 internal audit. At that point, preliminary estimates suggested that between 15 and 20 percent of images stored in certain legacy databases were exact or near-exact duplicates, though those figures were internal working estimates rather than published findings. The audit recommended a phased deduplication protocol, but budget cycles and staff turnover — compounded by the same tech-sector contraction that hammered private employers in SoMa and the Financial District between 2022 and 2024 — slowed implementation.
The city also leaned heavily on outside contractors during the initial digitization push. Several of those contracts, administered through the Office of Contract Administration, did not include deduplication as a deliverable, meaning vendors were paid to ingest files without ensuring uniqueness. By the time city staff noticed the problem at scale, the contracts had closed and responsibility for cleanup had reverted to in-house teams already stretched thin.
What Deduplication Actually Requires
Fixing the problem is not simply a matter of running a script. Images scanned at different resolutions or with slightly different timestamps are not recognized as identical by standard hash-based deduplication tools, which means human review is still required for a significant subset of files. The city's Main Library on Larkin Street, which maintains its own digital archive of San Francisco historical photographs, dealt with a comparable challenge when it digitized its collection beginning in 2017 and developed a hybrid automated-plus-manual workflow that took roughly 18 months to complete. That model has since been cited internally as a template for the broader municipal effort.
The Department of Technology is expected to issue a formal request for proposals later in 2026 for a deduplication and image-management overhaul. The contract, based on comparable municipal projects in cities including Chicago and Denver, is likely to run into the low seven figures over a multi-year term. DataSF has already begun tagging affected datasets on its public portal so that developers and researchers pulling city records are at least aware of the duplication risk while cleanup proceeds.
For residents trying to pull permit histories for properties in neighborhoods like the Tenderloin or Dogpatch, the practical advice for now is to cross-reference records against the Planning Department's public portal at 49 South Van Ness and to treat any image-based document returned in duplicate as a known system artifact rather than a substantive discrepancy. The city says it expects the first phase of deduplication — covering Planning and DPH records — to be complete before the end of the current fiscal year, which closes June 30, 2027.