Blog / Guide

News data for academic research: a corpus you can cite

GuideJuly 2, 2026· 5 min read

Academic work on news needs three things a typical news API doesn't provide: reproducibility, labels you can trust, and a methodology you can cite. Here's how to build a research corpus from NewsAgent Data — and where the honest limits are.

What research needs (that most APIs skip)

For a paper or a shared dataset, raw headlines aren't enough. You need stable timestamps, consistent labels, and a scoring method you can describe in a methods section. Black-box sentiment models that drift over time make results impossible to reproduce.

Reproducible, labeled records

Every article carries a UTC fetched_at timestamp plus urgency (0–10), a 9-category political_lean, topic_tags, country_tags, language and an event cluster_id — the same fields on historical rows as on live ones. Because urgency is a deterministic rules engine, not a model, a score computed today reproduces next year. That reproducibility is what makes it citable — the full scheme is documented in the methodology.

Building your corpus

curl -H "X-API-Key: YOUR_KEY" \
  "https://api.newsagentdata.com/v1/feed?country=ru&topic=defense&days=30"

Filter by country, language, topic, lean and date window to carve the exact slice, then paginate and store as JSONL/CSV. Because coverage spans Russian state media, independent outlets, Telegram and Western wires, you can run comparative media-framing studies that English-only corpora can't support — see the historical data guide.

What people study with it

Computational social science — agenda-setting, framing, salience over time.
Media-bias research — lean distribution per event (group by cluster_id).
Crisis & disaster timing — urgency spikes and how fast coverage spreads.
NLP corpora — labeled multilingual headlines for classification/summarization.

Ethics & honest scope

Only public sources (RSS + public Telegram) — no paywalled, private, or licensed content. Deep enrichment is Russian and English; Spanish/Portuguese are scored; other languages are coverage-tagged — state that in your methods rather than over-claiming. The free tier (100 requests/day, full schema, no card) is enough to pilot a study before you commit.

Try it free

Grab a free API key — no card — and query live data in under a minute.

Get a free API key