Data Methodology

How we collect, score, cluster, and label every article in the NewsAgent Data feed. This document covers the complete pipeline from ingestion to API response.

Last updated: June 2026 · v1.1
Contents
  1. 1. Data Collection
  2. 2. Urgency Scoring (0–10)
  3. 3. Event Clustering
  4. 4. Political Lean Classification
  5. 5. Topic Tagging
  6. 6. Country & Region Tags
  7. 7. Audience Tags
  8. 8. Known Limitations
  9. 9. Pipeline Updates

Data Collection

Every article in the NewsAgent Data feed comes from one of two source types: RSS feeds and Telegram channels. All sources are publicly accessible — no paywalled, private, or licensed content is included.

RSS Feeds (800+ active feeds)

The RSS crawler runs continuously on a tiered schedule: high-priority breaking-news feeds are polled roughly every 60 seconds, while lower-volume feeds are refreshed on a rolling cycle to balance freshness against load. On each pass it fetches the latest entries, deduplicates by URL (unique index on the link field), runs urgency scoring, and writes new articles to the database.

Sources span Russian state media, Russian independent media, CIS regional press, and international wire services and think tanks. The full source list is available via GET /v1/sources.

Telegram Channels (2,600+ public channels)

The Telegram collector runs continuously, monitoring thousands of public channels through a direct API integration (not web scraping). Each new post is deduplicated by URL, scored, and inserted with the original Telegram post timestamp as fetched_at (not the fetch time).

Note: Only publicly accessible channels are monitored. Newly added channels are onboarded gradually. Channels that fail consistently (40+ consecutive fetch failures) are automatically disabled, and deleted channels or username changes are detected and permanently disabled.

Deduplication

Primary deduplication is by exact URL match. The link field has a unique index — attempting to insert the same URL twice silently skips the duplicate. The Telegram bot's sending log maintains an additional title-level deduplication to prevent the same story headline from being pushed to subscribers multiple times across a 24-hour window.

MetricValue (as of June 2026)
Total RSS sources800+ active feeds
Total Telegram channels2,600+ public channels via direct API integration
Database size484,000+ articles (grows ~25,000/day)
Archive start dateMay 9, 2026
RSS crawl cycleContinuous, tiered (~60s for priority feeds)
Telegram pollContinuous via direct API integration
Breaking news latency≤60 seconds from ingestion to webhook push

Urgency Scoring (0–10)

Every article receives an integer urgency score from 0 (routine content) to 10 (maximum urgency). The score is computed at ingestion time by a deterministic multi-signal engine — not a probabilistic ML model. This means scores are fully reproducible and auditable: given the same article text and source, the score will always be the same.

The engine runs 19 sequential steps on the combined title + content text of each article:

1
Source authority pre-score
Wire services (Reuters, AP, TASS, RIA) and first-party government sources start with a small positive weight. Science, entertainment, and tabloid aggregators receive a hard cap (max 6) that overrides all subsequent steps.
2–6
Keyword signal accumulation
The article text is scanned against a bilingual keyword library of 300+ weighted trigger patterns in Russian and English. Patterns cover conflict terms (missile, обстрел, strikes), geopolitical events (sanctions, treaty, withdrawal), casualty language (killed, погибших, wounded), and economic shocks (default, collapse, emergency). Each match adds to the raw score.
7
Casualty detection
Articles with explicit casualty language (killed, dead, погибших, убитых, раненых, wound, kills) have their score amplified. High body counts mentioned explicitly push the score toward 8–10.
8–10
Russian morphology expansion
Russian keyword matching covers inflected forms: ракет, ракетами, обстрелов, дронов, погибших, раненых in addition to nominative forms. This prevents score drops for naturally inflected Russian sentences.
11–12
Cross-source corroboration boost
If 3 or more distinct sources report on the same event within a crawl cycle (detected via clustering), the urgency score receives a corroboration bonus. A single fringe source cannot push a story to score 9–10 alone.
13–15
Event-type caps
Scores are capped by event type before applying domain-specific exceptions. Science/discovery articles cap at 6. Economic routine (rate releases, earnings, indices) cap at 7.
16
Transport accident cap (max 7)
Articles matching transport accident patterns (truck crash, bus crash, train derailment, road accident, building/house fire with victims) are capped at 7 unless military context words are present (attack, strike, missile, теракт). Prevents NDTV-style "bus crash kills 18" from scoring 10.
17
Domestic scandal/crime cap (max 7)
Articles matching scandal/arrest patterns (скандал, controversy, arrested, задержан) with single-casualty language and no mass-event indicators are capped at 7.
18
Domestic shooting/family violence cap (max 8)
Articles matching domestic violence shooting patterns (gunman kills own family, kills his/her/own) are capped at 8. Not applied if conflict context words are present (ukraine, gaza, war, terror).
19
Protest/civil unrest cap (max 7)
Articles matching protest-clash patterns (protesters clash, clashes with police, riots over) are capped at 7 unless active conflict context is detected.

Score interpretation

ScoreLabelMeaningAPI behavior
0–4RoutineBackground news, science, entertainment, routine politicsIn /v1/feed only
5–6NotableSignificant developments worth monitoringIncluded in /v1/feed; /v1/breaking (24h window)
7–8BreakingConfirmed breaking event: military action, major policy, mass casualtyTriggers webhook push; included in /v1/breaking
9–10CriticalHigh-confidence major event corroborated across multiple sourcesTriggers webhook push; prioritized in breaking feeds
Design choice: The engine is deterministic (rule-based), not probabilistic. This is intentional — it provides full reproducibility and auditability. The trade-off is that novel event types not covered by the keyword library may be underscored. We update the keyword patterns regularly.

Event Clustering

Articles covering the same event are grouped into clusters. Each cluster is identified by a cluster_id integer. The cluster_size field on each article indicates how many articles share that cluster ID — a direct measure of how widely an event was covered.

Clustering algorithm

On each ingestion cycle, newly collected articles are compared against each other and against recent articles in the database using TF-IDF cosine similarity on headline text. Articles with similarity above the configured threshold are assigned to the same cluster.

Cluster IDs are timestamp-based (Unix timestamp of the cycle start, e.g. 1780434861). Each cycle generates a new base ID, ensuring cluster IDs are globally unique across all cycles. This means articles from different crawl cycles that cover the same long-running story will have different cluster IDs — clustering reflects co-coverage within a single ingestion cycle, not long-term story tracking.

Historical note: Articles ingested before a June 2026 pipeline update used small sequential cluster IDs (1, 2, 3…) that accumulated across cycles. Those historical cluster IDs are not meaningful for cross-source analysis. Only articles with cluster_id > 1,000,000 use the timestamp-based scheme and have reliable cluster_size values.

Using cluster data

To find all articles covering the same event: GET /v1/feed?cluster_id=<id>. To filter by narrative angle on the same event: add political_lean=state or political_lean=opposition.

Political Lean Classification

Each article carries a political_lean field. This is a per-source classification, not a per-article text analysis. It reflects the editorial position of the outlet, not the content of the specific article.

ValueMeaningExample sources
stateState-controlled or state-aligned outletTASS, РИА Новости, RT, BELTA, Первый канал
officialGovernment and institutional press servicesMinistry, agency & regulator channels
centristMainstream centrist / business outletsInterfax, Kommersant, RBC, Forbes
liberalIndependent liberal/centrist Western orientationBBC Russia, DW Russisch, Meduza, Novaya Gazeta
conservativeConservative/right-of-center Western outletsWSJ Opinion, Fox News, National Review
nationalistRussian nationalist / pro-war commentary channelsMilitary-correspondent & "Z" commentary channels
oppositionActive opposition to current Russian governmentNEXTA, Current Time, iStories
tabloidTabloid / sensationalist outlets (urgency-capped)mk.ru, Life.ru, PressTV
neutralWire services and non-partisan international outletsReuters, AP, Bloomberg, AFP, Al Jazeera, ISW

Classifications were assigned by the NewsAgent Data editorial team based on: country of registration, ownership structure, editorial guidelines, funding sources, and documented editorial positions. Classification is static per source and updated when editorial position verifiably changes.

Important: political_lean describes the source's orientation, not the ideological content of the individual article. A TASS article about weather carries political_lean=state regardless of its content. Use this field to filter or compare narrative angles across outlets covering the same story, not to classify individual article content.

Topic Tagging

Each article may carry one or more topic_tags from a fixed 27-category taxonomy. Unlike event_type (which is a broad category), topic tags are specific sub-categories aligned to the needs of the API's primary audience segments.

ukrainemiddle_eastterrorism cybernucleardiplomacy electionssanctionsmarkets cryptooil_gasrates mergersaichips spaceregulationpandemic

Tagging method

Topic tags are assigned by a keyword pattern matching engine at ingestion time. The engine scans the article title and content against a language-specific pattern dictionary for each category. Articles may match zero, one, or multiple categories.

The sanctions tag uses specific multi-word phrases only (sanctions against, us sanctions, eu sanctions, western sanctions, economic sanctions, financial sanctions, sdn list) to avoid false positives from non-geopolitical uses of the word "sanction."

Coverage: As of June 2026, approximately 26% of the archive (~208k of 800k+ articles) has at least one topic tag. Coverage is higher for recent articles (ingested after topic tagging was added) and for English-language sources. Coverage improves continuously as new articles are ingested with the full pattern library applied.

Country & Region Tags

The country_tags field contains one or more ISO-2 country codes associated with the article's subject matter. Tags are assigned per-article by a keyword/entity detection engine, not per-source. Coverage currently spans 67 countries and territories.

ruusuk eucnua iltrde frinint

The region field is a coarser two-value field: rucis for Russian-language/CIS-focused content, int for international. The int value in country_tags is used for articles with global or multi-country scope.

Audience Tags

The audience_tags field is a comma-separated list of one or more audience categories the article is relevant to. These are used to power the /v1/audience/{audience} dedicated feeds.

tradingmediaacademic securitytechpolitics

Audience tags are assigned primarily at the source level (a defense/military channel is tagged security; a markets publication is tagged trading) with supplemental article-level keyword rules for multi-audience sources. A single article can carry multiple audience tags.

Known Limitations

Archive depth: starts May 9, 2026
The NewsAgent Data database began continuous collection on May 9, 2026. There is no historical data before this date. The archive currently grows by roughly 25,000 articles per day as Telegram coverage expands. Tier history limits (7/90 days/unlimited) are access windows within this archive.
Telegram channel coverage: thousands of public channels
Telegram content is collected from public channels through a direct API integration — not web scraping. Thousands of channels are monitored; newly added channels are onboarded gradually. Channels that fail 40+ consecutive fetches are auto-disabled.
Topic tag coverage: ~26% of archive
Articles ingested before topic tagging was introduced, and articles with non-standard language patterns, may have empty topic_tags. Coverage is higher for recent articles and English-language sources. We are continuously expanding the pattern library.
Urgency scoring: rule-based, not ML
Scores are deterministic and reproducible but may miss novel event types not covered by the current keyword library. The engine is tuned against Russian-English geopolitical and economic news — it is not a general-purpose news classifier. No training set, precision/recall figures, or confusion matrix are available because no probabilistic model is used.
Political lean: source-level, not article-level
political_lean reflects the outlet's orientation, not the individual article's content. A neutral wire service article about the Kremlin carries political_lean=neutral. Do not use this field as an article-level sentiment or framing score.
Cluster IDs: per-cycle, not story-level
Clustering detects same-story co-coverage within a single ingestion cycle. Articles from different cycles covering the same long-running story (e.g. an ongoing conflict) will not share a cluster_id. Only articles with cluster_id > 1,000,000 have reliable cluster_size values.

Pipeline Updates

The scoring engine, topic pattern library, and source list are updated as the service evolves. Major changes are logged below.

DateChange
June 2026Source expansion: RSS grown to 800+ active feeds and Telegram coverage scaled to 2,600+ public channels. Political-lean taxonomy expanded to 9 categories (added official, centrist, tabloid). Live source counts now published at /public/stats.
June 2026Steps 16–19 added to urgency scorer: transport accident cap, domestic scandal cap, domestic shooting cap, protest cap. Sanctions topic tag false positives fixed. Timestamp-based cluster IDs deployed.
June 2026Telegram ingestion moved to a direct API integration (no web scraping) for improved reliability and coverage.
May 202618-category topic tagging system launched. country_tags and political_lean backfilled across full archive.
May 2026Initial collection began. RSS crawler + Telegram parser deployed on Contabo VPS.

Questions about methodology? Email [email protected].