Event clustering and deduplication in a news API
The biggest source of noise in a news feed isn't irrelevant stories — it's the same story, forty times. Event clustering collapses that duplication into one signal, and turns spread into a metric you can use.
The duplication problem
A single strike, ruling or rate decision is reported by dozens of outlets within minutes. Naive de-duplication on exact title or URL fails immediately — every outlet phrases it differently. You need to group by event, not by string.
cluster_id and cluster_size
Every article carries a cluster_id shared by all items covering the same event, plus a cluster_size for how many outlets have picked it up. De-duplicating is then trivial — keep one item per cluster:
curl -H "X-API-Key: YOUR_KEY" \ "https://api.newsagentdata.com/v1/feed?min_score=6&days=1" # then, client-side: keep the first item per cluster_id
Spread as a signal
Because cluster_size grows as a story propagates, it's a live importance signal in its own right. A cluster jumping from 3 to 40 outlets in an hour is a breaking event even before you read a word — useful for ranking, alerting and detecting what's going viral.
Compare framing within a cluster
Pull a whole cluster and you have the same event across state, independent and Western sources — contrast the political_lean distribution to see how camps frame one story. That's a far richer output than a single de-duplicated headline. See urgency and lean scoring.
Alert once, not forty times
In any alerting pipeline, group incoming items by cluster_id before you notify — one page per event, with cluster_size attached for context. Combine with urgency thresholds from the breaking-news guide. Full schema in the docs.