News data for academic research: a corpus you can cite
Academic work on news needs three things a typical news API doesn't provide: reproducibility, labels you can trust, and a methodology you can cite. Here's how to build a research corpus from NewsAgent Data — and where the honest limits are.
What research needs (that most APIs skip)
For a paper or a shared dataset, raw headlines aren't enough. You need stable timestamps, consistent labels, and a scoring method you can describe in a methods section. Black-box sentiment models that drift over time make results impossible to reproduce.
Reproducible, labeled records
Every article carries a UTC fetched_at timestamp plus urgency (0–10), a 9-category political_lean, topic_tags, country_tags, language and an event cluster_id — the same fields on historical rows as on live ones. Because urgency is a deterministic rules engine, not a model, a score computed today reproduces next year. That reproducibility is what makes it citable — the full scheme is documented in the methodology.
Building your corpus
curl -H "X-API-Key: YOUR_KEY" \ "https://api.newsagentdata.com/v1/feed?country=ru&topic=defense&days=30"
Filter by country, language, topic, lean and date window to carve the exact slice, then paginate and store as JSONL/CSV. Because coverage spans Russian state media, independent outlets, Telegram and Western wires, you can run comparative media-framing studies that English-only corpora can't support — see the historical data guide.
What people study with it
- Computational social science — agenda-setting, framing, salience over time.
- Media-bias research — lean distribution per event (group by
cluster_id). - Crisis & disaster timing — urgency spikes and how fast coverage spreads.
- NLP corpora — labeled multilingual headlines for classification/summarization.
Ethics & honest scope
Only public sources (RSS + public Telegram) — no paywalled, private, or licensed content. Deep enrichment is Russian and English; Spanish/Portuguese are scored; other languages are coverage-tagged — state that in your methods rather than over-claiming. The free tier (100 requests/day, full schema, no card) is enough to pilot a study before you commit.