Data Collection
Every article in the NewsAgent Data feed comes from one of two source types: RSS feeds and Telegram channels. All sources are publicly accessible — no paywalled, private, or licensed content is included.
RSS Feeds (800+ active feeds)
The RSS crawler runs continuously on a tiered schedule: high-priority breaking-news feeds are polled roughly every 60 seconds, while lower-volume feeds are refreshed on a rolling cycle to balance freshness against load. On each pass it fetches the latest entries, deduplicates by URL (unique index on the link field), runs urgency scoring, and writes new articles to the database.
Sources span Russian state media, Russian independent media, CIS regional press, and international wire services and think tanks. The full source list is available via GET /v1/sources.
Telegram Channels (2,600+ public channels)
The Telegram collector runs continuously, monitoring thousands of public channels through a direct API integration (not web scraping). Each new post is deduplicated by URL, scored, and inserted with the original Telegram post timestamp as fetched_at (not the fetch time).
Deduplication
Primary deduplication is by exact URL match. The link field has a unique index — attempting to insert the same URL twice silently skips the duplicate. The Telegram bot's sending log maintains an additional title-level deduplication to prevent the same story headline from being pushed to subscribers multiple times across a 24-hour window.
| Metric | Value (as of June 2026) |
|---|---|
| Total RSS sources | 800+ active feeds |
| Total Telegram channels | 2,600+ public channels via direct API integration |
| Database size | 484,000+ articles (grows ~25,000/day) |
| Archive start date | May 9, 2026 |
| RSS crawl cycle | Continuous, tiered (~60s for priority feeds) |
| Telegram poll | Continuous via direct API integration |
| Breaking news latency | ≤60 seconds from ingestion to webhook push |
Urgency Scoring (0–10)
Every article receives an integer urgency score from 0 (routine content) to 10 (maximum urgency). The score is computed at ingestion time by a deterministic multi-signal engine — not a probabilistic ML model. This means scores are fully reproducible and auditable: given the same article text and source, the score will always be the same.
The engine runs 19 sequential steps on the combined title + content text of each article:
Score interpretation
| Score | Label | Meaning | API behavior |
|---|---|---|---|
| 0–4 | Routine | Background news, science, entertainment, routine politics | In /v1/feed only |
| 5–6 | Notable | Significant developments worth monitoring | Included in /v1/feed; /v1/breaking (24h window) |
| 7–8 | Breaking | Confirmed breaking event: military action, major policy, mass casualty | Triggers webhook push; included in /v1/breaking |
| 9–10 | Critical | High-confidence major event corroborated across multiple sources | Triggers webhook push; prioritized in breaking feeds |
Event Clustering
Articles covering the same event are grouped into clusters. Each cluster is identified by a cluster_id integer. The cluster_size field on each article indicates how many articles share that cluster ID — a direct measure of how widely an event was covered.
Clustering algorithm
On each ingestion cycle, newly collected articles are compared against each other and against recent articles in the database using TF-IDF cosine similarity on headline text. Articles with similarity above the configured threshold are assigned to the same cluster.
Cluster IDs are timestamp-based (Unix timestamp of the cycle start, e.g. 1780434861). Each cycle generates a new base ID, ensuring cluster IDs are globally unique across all cycles. This means articles from different crawl cycles that cover the same long-running story will have different cluster IDs — clustering reflects co-coverage within a single ingestion cycle, not long-term story tracking.
cluster_id > 1,000,000 use the timestamp-based scheme and have reliable cluster_size values.Using cluster data
To find all articles covering the same event: GET /v1/feed?cluster_id=<id>. To filter by narrative angle on the same event: add political_lean=state or political_lean=opposition.
Political Lean Classification
Each article carries a political_lean field. This is a per-source classification, not a per-article text analysis. It reflects the editorial position of the outlet, not the content of the specific article.
| Value | Meaning | Example sources |
|---|---|---|
| state | State-controlled or state-aligned outlet | TASS, РИА Новости, RT, BELTA, Первый канал |
| official | Government and institutional press services | Ministry, agency & regulator channels |
| centrist | Mainstream centrist / business outlets | Interfax, Kommersant, RBC, Forbes |
| liberal | Independent liberal/centrist Western orientation | BBC Russia, DW Russisch, Meduza, Novaya Gazeta |
| conservative | Conservative/right-of-center Western outlets | WSJ Opinion, Fox News, National Review |
| nationalist | Russian nationalist / pro-war commentary channels | Military-correspondent & "Z" commentary channels |
| opposition | Active opposition to current Russian government | NEXTA, Current Time, iStories |
| tabloid | Tabloid / sensationalist outlets (urgency-capped) | mk.ru, Life.ru, PressTV |
| neutral | Wire services and non-partisan international outlets | Reuters, AP, Bloomberg, AFP, Al Jazeera, ISW |
Classifications were assigned by the NewsAgent Data editorial team based on: country of registration, ownership structure, editorial guidelines, funding sources, and documented editorial positions. Classification is static per source and updated when editorial position verifiably changes.
political_lean describes the source's orientation, not the ideological content of the individual article. A TASS article about weather carries political_lean=state regardless of its content. Use this field to filter or compare narrative angles across outlets covering the same story, not to classify individual article content.Topic Tagging
Each article may carry one or more topic_tags from a fixed 27-category taxonomy. Unlike event_type (which is a broad category), topic tags are specific sub-categories aligned to the needs of the API's primary audience segments.
Tagging method
Topic tags are assigned by a keyword pattern matching engine at ingestion time. The engine scans the article title and content against a language-specific pattern dictionary for each category. Articles may match zero, one, or multiple categories.
The sanctions tag uses specific multi-word phrases only (sanctions against, us sanctions, eu sanctions, western sanctions, economic sanctions, financial sanctions, sdn list) to avoid false positives from non-geopolitical uses of the word "sanction."
Country & Region Tags
The country_tags field contains one or more ISO-2 country codes associated with the article's subject matter. Tags are assigned per-article by a keyword/entity detection engine, not per-source. Coverage currently spans 67 countries and territories.
The region field is a coarser two-value field: rucis for Russian-language/CIS-focused content, int for international. The int value in country_tags is used for articles with global or multi-country scope.
Audience Tags
The audience_tags field is a comma-separated list of one or more audience categories the article is relevant to. These are used to power the /v1/audience/{audience} dedicated feeds.
Audience tags are assigned primarily at the source level (a defense/military channel is tagged security; a markets publication is tagged trading) with supplemental article-level keyword rules for multi-audience sources. A single article can carry multiple audience tags.
Known Limitations
Pipeline Updates
The scoring engine, topic pattern library, and source list are updated as the service evolves. Major changes are logged below.
| Date | Change |
|---|---|
| June 2026 | Source expansion: RSS grown to 800+ active feeds and Telegram coverage scaled to 2,600+ public channels. Political-lean taxonomy expanded to 9 categories (added official, centrist, tabloid). Live source counts now published at /public/stats. |
| June 2026 | Steps 16–19 added to urgency scorer: transport accident cap, domestic scandal cap, domestic shooting cap, protest cap. Sanctions topic tag false positives fixed. Timestamp-based cluster IDs deployed. |
| June 2026 | Telegram ingestion moved to a direct API integration (no web scraping) for improved reliability and coverage. |
| May 2026 | 18-category topic tagging system launched. country_tags and political_lean backfilled across full archive. |
| May 2026 | Initial collection began. RSS crawler + Telegram parser deployed on Contabo VPS. |
Questions about methodology? Email [email protected].