Distill.

Why your CRM dedup isn't dedup

HubSpot's native dedup is a rule-basedsystem. It runs over your contacts looking for exact-match collisions on email or first+last+domain. When those rules hit, it surfaces a clean candidate cluster. When they don't — and the interesting duplicates almost always don't — it shrugs.

Distill is the answer to the question what if dedup understood that two records belonged to the same person, even when their email local-parts were different, their phone numbers were formatted differently, and their company names had different punctuation?

This is what the pipeline looks like:

  1. Blocking — cheap predicate groupings (same email domain, normalized phone, first-3-chars-of-last-name). The whole point is to avoid an O(N²) all-pairs comparison.
  2. Embeddings — OpenAI text-embedding-3-smallon "first last | email | company | phone". 1536-dimensional vectors stored in pgvector.
  3. Pairwise scoring — cosine similarity on embeddings, plus rapidfuzz edit-distance on (name, email local-part, company), weighted.
  4. HDBSCAN— density-based clustering over the candidate pairs. Robust to noise; doesn't require you to pre-pick a cluster count.
  5. Human review — side-by-side cluster card with column-wise diff, association rollup, keyboard shortcuts. Reject pairs are remembered.
  6. Reversible write — HubSpot merge API, full pre-merge snapshot persisted. Undo recreates the merged records, surfaces the association-recreation caveat.