All Methodology
Section 3

Data Sources and Collection

Our data collection methodology, sources, ethical constraints, and bias considerations.

Data Collection Methodology

We collect data from multiple platforms using multiple methods, designed to be scalable, reproducible, and ethical.

Primary Data Sources

Platform Sampling

Platform Access Method Sample Strategy Update Frequency
YouTube Public API + web sampling Stratified by category, geography, channel size Continuous
TikTok Web sampling + trend monitoring Trending content + category sampling Daily
Instagram Public profile sampling Hashtag-based + account-type stratified Weekly
Facebook Public page and group sampling Category-based, geographic stratification Weekly
X/Twitter Public API + web sampling Keyword-based + trending topic sampling Continuous
Pinterest Web sampling Category boards + trending pins Weekly
Reddit Public API Subreddit-based stratified sampling Daily
Amazon Product listing sampling Category-based, review-focused Weekly
Google Search SERP sampling Query-based across categories Weekly

Website Crawling

For AI-generated blog networks and content farms:

  • Seed list of known AI content domains
  • Expansion through link analysis and ad network correlation
  • New domain discovery through search result monitoring
  • Domain registration pattern monitoring

Sampling Methodology

Stratified Random Sampling

Content is sampled using stratified random sampling across:

  • Content categories (entertainment, education, health, finance, etc.)
  • Geographic targeting (US, EU, Asia-Pacific, Latin America)
  • Time periods (distributed across days of week and hours)
  • Source types (established accounts, new accounts, trending content)

Sample Sizes

  • Monthly platform samples: 10,000-50,000 content pieces per platform
  • Search result samples: 5,000 queries per month
  • Website samples: 5,000-10,000 domains per month

API Limitations

Platform Key Limitations Mitigation
YouTube Rate limits, quota restrictions Distributed sampling, priority queuing
TikTok Limited official API access Web-based sampling with ethical constraints
Instagram Restricted API post-2024 Public profile sampling only
X/Twitter Paid API tiers required Targeted sampling within budget constraints
Reddit API rate limits Efficient query design, caching

Ethical Constraints

Our data collection adheres to strict ethical guidelines:

  1. No personal data collection — we analyze content, not individuals
  2. Public content only — no access to private posts, messages, or protected accounts
  3. Respectful rate limiting — we do not overwhelm platform infrastructure
  4. No deception — we do not create fake accounts or misrepresent our identity
  5. Proportionality — we collect only what is necessary for our analysis
  6. Data minimization — we retain analyzed signals, not raw content, where possible
  7. Responsible disclosure — when we discover specific harmful content, we follow responsible disclosure practices

Bias Considerations

Known Biases

  1. English-language bias — our detection methods are most developed for English content
  2. Platform access bias — platforms with better API access are more thoroughly sampled
  3. Detection bias — our methods are better at detecting current-generation AI content than older or novel generation methods
  4. Category bias — some content categories are more testable than others
  5. Geographic bias — US and European content is overrepresented in our samples

Mitigation Strategies

  • Regular bias audits comparing sample demographics to platform demographics
  • Investment in multilingual detection capabilities
  • Cross-validation with external researchers and datasets
  • Transparent reporting of known biases in all publications

Human Labeling Workflows

Human reviewers are used for:

  • Training data generation — labeled examples for detection model training
  • Validation — spot-checking automated scoring accuracy
  • Edge cases — reviewing content in the ambiguous 35-55 Slop Score range
  • Calibration — regular reviews to ensure scoring drift is detected

Reviewer Training

All human reviewers complete:

  • 8 hours of initial training on AI content identification
  • Monthly calibration exercises
  • Regular inter-rater reliability assessments
  • Specialization tracks for platform-specific content types

Dataset Validation

Published datasets undergo:

  1. Automated consistency checks
  2. Human sample validation (minimum 5% of dataset)
  3. Cross-method verification (multiple detection approaches on same content)
  4. External expert review for major publications
  5. Versioning and change documentation