Blog / Guide

Event clustering and deduplication in a news API

GuideJuly 1, 2026· 3 min read

The biggest source of noise in a news feed isn't irrelevant stories — it's the same story, forty times. Event clustering collapses that duplication into one signal, and turns spread into a metric you can use.

The duplication problem

A single strike, ruling or rate decision is reported by dozens of outlets within minutes. Naive de-duplication on exact title or URL fails immediately — every outlet phrases it differently. You need to group by event, not by string.

cluster_id and cluster_size

Every article carries a cluster_id shared by all items covering the same event, plus a cluster_size for how many outlets have picked it up. De-duplicating is then trivial — keep one item per cluster:

curl -H "X-API-Key: YOUR_KEY" \
  "https://api.newsagentdata.com/v1/feed?min_score=6&days=1"
# then, client-side: keep the first item per cluster_id

Spread as a signal

Because cluster_size grows as a story propagates, it's a live importance signal in its own right. A cluster jumping from 3 to 40 outlets in an hour is a breaking event even before you read a word — useful for ranking, alerting and detecting what's going viral.

Compare framing within a cluster

Pull a whole cluster and you have the same event across state, independent and Western sources — contrast the political_lean distribution to see how camps frame one story. That's a far richer output than a single de-duplicated headline. See urgency and lean scoring.

Alert once, not forty times

In any alerting pipeline, group incoming items by cluster_id before you notify — one page per event, with cluster_size attached for context. Combine with urgency thresholds from the breaking-news guide. Full schema in the docs.

Try it free

Grab a free API key — no card — and query live data in under a minute.

Get a free API key