Limitations and ethics

Tecnical challenges

Anti-scraping measures: Many platforms implement rate limits, CAPTCHAs, dynamic rendering, and bot detection. Changes in the DOM and APIs frequently break automation. It is necessary to design data collection processes that are resilient, log selector versions, and maintain alternatives such as official APIs or web archives.
Information overload and noise: The volume of data and duplication make it difficult to identify relevant signals. Effective handling requires applying filters, deduplication, sampling, and prioritization aligned with Intelligence Requirements (IR).
Synthetic content (deepfakes/generated text): Images, audio, and text generated by AI are increasingly common. Analysts must verify origin (when available, using metadata or provenance standards such as C2PA), perform basic forensic checks (visual, spatial, or temporal inconsistencies), and corroborate information using methods like the Rule of Two or temporal/geospatial triangulation.
Volatility and disappearance of historical data: Retention policies, content deletion, and ephemeral formats (stories, streams) reduce traceability. Captures should be timestamped, evidence hashed, archives (Wayback or institutional archives) consulted, and chain-of-custody procedures followed.

Legal basis and minimization: Collect only what is necessary for a legitimate purpose (principles of lawfulness, purpose limitation, and data minimization). Avoid special categories of data unless clearly permitted by law.
Terms of service and unauthorized access: Respect platform ToS (terms of service) and do not bypass technical controls (no bypassing of paywalls, gating, or authentication).
Jurisdiction and international transfer: Consider where data is stored or processed and the applicable regulatory regimes (e.g., GDPR/NIS2 in the EU). Document transfer bases when relevant.
Third-party rights: Avoid doxxing, harassment, or unnecessary exposure of non-target individuals. Evaluate proportionality before including personal data in intelligence products.

Identity separation: Use isolated accounts and research environments (virtual machines, depersonalized browsers), controlling metadata leaks (time, language, fingerprinting).
Attack surface: Prevent malvertising, malicious downloads, and tracking; employ blocklists, sandboxing, and up-to-date security tools.
Minimal interaction: Observe without participating; avoid likes, comments, or messages that could alter the environment or reveal the analyst’s presence.

Multilevel corroboration: Combine internal verification (consistency, metadata, integrity) with external verification (other independent sources). Apply credibility-reliability matrices and label evidence by confidence level.
Chain of custody: Record who obtained what, when, and how; preserve unaltered originals alongside hashes and derived versions for analysis.
Human-in-the-loop: Maintain human oversight to review automated outputs, especially for high-impact findings.