San Francisco's Department of Technology confirmed last month that the city's consolidated open-data portal, DataSF, had accumulated more than 340,000 duplicate image files across municipal databases — a bloat that was slowing permit processing at the Planning Department on South Van Ness Avenue and creating redundancy headaches for agencies from the Public Utilities Commission to the Municipal Transportation Agency. The city has begun a remediation program that city IT officials say could recover roughly 12 terabytes of server capacity by the end of fiscal year 2026.
The timing matters. San Francisco, like dozens of other major cities, digitized enormous volumes of paper records during the pandemic years, scanning building permits, zoning maps, infrastructure inspection photos and court documents in bulk. The speed of that conversion — necessary at the time — left behind catalogues riddled with near-duplicate files: multiple scans of the same page, variant-resolution copies of the same street photograph, and archived images re-uploaded each time a department switched software vendors. The problem is not aesthetic. Duplicate image records create legal ambiguity in property disputes, complicate public records requests under the California Public Records Act, and add processing time to already strained housing approval workflows.
How SF's Approach Differs From London and Seoul
The city's current effort, managed through a contract with the Department of Technology and coordinated with the City Administrator's Office at City Hall, uses perceptual hashing — a method that generates a fingerprint for each image based on visual content rather than file name or metadata. When two fingerprints match above a defined similarity threshold, a human reviewer confirms before deletion. That two-step approach separates San Francisco from London, where the Greater London Authority ran an automated deduplication sweep of its planning image archives in 2024 and later acknowledged that several historic Southwark streetscape photos had been incorrectly purged. Seoul's municipal government, which undertook a comparable exercise across its 25 autonomous districts in 2023, reported a faster throughput but faced criticism from archivists over the loss of contextual variants — photos taken minutes apart that documented changing conditions at a construction site.
San Francisco's hybrid model is slower. City IT staff estimate the full audit of roughly 2.1 million municipal image assets will take until March 2027 to complete. But advocates for open-records access say the caution is warranted. The San Francisco Public Press, which regularly files CPRA requests for planning and building inspection imagery, has documented cases where duplicates in the system returned conflicting metadata — same image, different stated capture dates — that complicated reporting on Mission District construction disputes.
The San Francisco Public Library's San Francisco History Center at the Civic Center branch is separately managing its own digitized photographic collection, which runs to more than 800,000 images. Librarians there have used a different tool stack, relying on open-source software maintained by a coalition of North American public libraries, and say the duplicate rate in their collection ran to about 8 percent before a 2025 cleanup effort — lower than the roughly 16 percent rate found in the city's operational permit databases, according to a DataSF program update published in May 2026.
What Comes Next for Residents and Businesses
For property owners filing permits through the San Francisco Planning Department's online portal, the practical effect should eventually be faster document retrieval and fewer instances of applications flagged for missing attachments that actually exist somewhere in the archive under a duplicate file name. The Planning Department has said the first phase of deduplication — covering permit images filed between 2018 and 2022 — is expected to wrap by September 2026.
Businesses on the Mid-Market corridor and in the Dogpatch neighborhood, where building permit activity has been heavy amid ongoing commercial-to-residential conversion projects, stand to see the most immediate administrative benefit once that phase closes. The city is also exploring whether the hashing methodology can be extended to scanned PDF documents, not just images — a step that would address an adjacent and arguably larger redundancy problem in the permit record system.
For now, anyone who has filed a building or planning application in San Francisco since 2018 and is waiting on document confirmation should check the DataSF portal directly. The city's 311 service line remains the fastest route for flagging specific records discrepancies while the audit is ongoing.