San Francisco's Office of the City Clerk is sitting on an estimated 47,000 duplicate image files spread across its digital records archive, a backlog that has grown steadily since the city's mass digitization push began in earnest in 2019. The redundant files — scanned permits, planning documents, meeting minutes, and ordinance attachments — are consuming roughly 2.3 terabytes of storage on servers maintained through the city's Department of Technology on Grove Street, according to a records management review completed this past March.
The timing matters. San Francisco is mid-way through a $14.2 million modernization contract with Tyler Technologies to overhaul how municipal departments store and retrieve public records. Bloated archives stuffed with duplicate images directly undermine that investment by driving up indexing times, slowing search results for residents and attorneys pulling Planning Commission filings, and making it harder for city IT staff to verify which version of a scanned document is authoritative. With the Tyler system scheduled to go live for the Planning Department and the Department of Building Inspection by January 2027, the city has roughly six months to clean house.
The Scale of the Problem, By the Numbers
Duplicate image replacement — the process of identifying redundant scans, designating one canonical file, and purging or redirecting the rest — sounds unglamorous. The data behind it is not. A standard duplicate image in the city's archive runs between 800 kilobytes and 4 megabytes depending on scan resolution. At the low end, 47,000 duplicates represent roughly 37 gigabytes of dead weight. At the high end, closer to 188 gigabytes. City IT staff pegged the fully-loaded annual storage cost at approximately $0.023 per gigabyte per month on the city's hybrid cloud arrangement — a small per-unit figure that compounds across tens of thousands of files over years.
Storage cost alone does not tell the full story. The more significant drag is on retrieval latency. When residents or attorneys submit California Public Records Act requests at the City Hall Clerk's counter on Dr. Carlton B. Goodlett Place, staff must manually confirm which version of a flagged document is the correct one before releasing it. That verification step, per the March review, adds an average of 11 minutes per request where duplicates are involved. The Clerk's Office logged 6,800 CPRA requests in fiscal year 2024-25. If even 20 percent of those touched a duplicate-flagged file, that's more than 240 staff-hours annually spent on a problem that better data hygiene would largely eliminate.
The San Francisco Public Library's San Francisco History Center at the Main Branch on Larkin Street faces a parallel version of the issue. Digitized photograph collections uploaded to the Online Archive of California have accumulated duplicate entries from multiple scanning campaigns — one in 2014, another in 2018, and a third ongoing project tied to a $380,000 National Endowment for the Humanities grant awarded in 2023. Library staff are currently using an open-source perceptual hashing tool to flag near-duplicate images before the NEH grant period closes in September 2026.
What Happens When the Cleanup Stalls
The consequences of inaction compound fast. Duplicate records create divergent metadata trails — two files tagged with different upload dates, different staff initials, different compression settings — that can produce conflicting search results in public-facing portals. For SFGovTV recordings of Board of Supervisors hearings posted on the city's archive portal, even a 12-hour delay in correct file resolution can push a video link to return a broken thumbnail, effectively disappearing a public meeting from accessible records until a technician manually repairs the index.
For residents navigating housing permit appeals in the Tenderloin or Mission District, where SRO owners and tenants both rely on building inspection records, the stakes of an inaccessible or mislabeled file are not abstract. A permit history pulled from a duplicate-contaminated database can show a gap in inspection records that does not actually exist — or worse, surface an outdated document as the most recent filing.
The Department of Technology expects to complete Phase 1 of its deduplication sweep — covering Building Inspection and Planning records — by October 31, 2026. Residents who need city records in the meantime can submit CPRA requests directly through the City Clerk's online portal or in person at City Hall, Room 168, and should flag if they receive documents with mismatched dates or reference numbers, which may indicate a duplicate-related retrieval error.