All Methodology
Section 3
Data Sources and Collection
Our data collection methodology, sources, ethical constraints, and bias considerations.
Data Collection Methodology
We collect data from multiple platforms using multiple methods, designed to be scalable, reproducible, and ethical.
Primary Data Sources
Platform Sampling
| Platform | Access Method | Sample Strategy | Update Frequency |
|---|---|---|---|
| YouTube | Public API + web sampling | Stratified by category, geography, channel size | Continuous |
| TikTok | Web sampling + trend monitoring | Trending content + category sampling | Daily |
| Public profile sampling | Hashtag-based + account-type stratified | Weekly | |
| Public page and group sampling | Category-based, geographic stratification | Weekly | |
| X/Twitter | Public API + web sampling | Keyword-based + trending topic sampling | Continuous |
| Web sampling | Category boards + trending pins | Weekly | |
| Public API | Subreddit-based stratified sampling | Daily | |
| Amazon | Product listing sampling | Category-based, review-focused | Weekly |
| Google Search | SERP sampling | Query-based across categories | Weekly |
Website Crawling
For AI-generated blog networks and content farms:
- Seed list of known AI content domains
- Expansion through link analysis and ad network correlation
- New domain discovery through search result monitoring
- Domain registration pattern monitoring
Sampling Methodology
Stratified Random Sampling
Content is sampled using stratified random sampling across:
- Content categories (entertainment, education, health, finance, etc.)
- Geographic targeting (US, EU, Asia-Pacific, Latin America)
- Time periods (distributed across days of week and hours)
- Source types (established accounts, new accounts, trending content)
Sample Sizes
- Monthly platform samples: 10,000-50,000 content pieces per platform
- Search result samples: 5,000 queries per month
- Website samples: 5,000-10,000 domains per month
API Limitations
| Platform | Key Limitations | Mitigation |
|---|---|---|
| YouTube | Rate limits, quota restrictions | Distributed sampling, priority queuing |
| TikTok | Limited official API access | Web-based sampling with ethical constraints |
| Restricted API post-2024 | Public profile sampling only | |
| X/Twitter | Paid API tiers required | Targeted sampling within budget constraints |
| API rate limits | Efficient query design, caching |
Ethical Constraints
Our data collection adheres to strict ethical guidelines:
- No personal data collection — we analyze content, not individuals
- Public content only — no access to private posts, messages, or protected accounts
- Respectful rate limiting — we do not overwhelm platform infrastructure
- No deception — we do not create fake accounts or misrepresent our identity
- Proportionality — we collect only what is necessary for our analysis
- Data minimization — we retain analyzed signals, not raw content, where possible
- Responsible disclosure — when we discover specific harmful content, we follow responsible disclosure practices
Bias Considerations
Known Biases
- English-language bias — our detection methods are most developed for English content
- Platform access bias — platforms with better API access are more thoroughly sampled
- Detection bias — our methods are better at detecting current-generation AI content than older or novel generation methods
- Category bias — some content categories are more testable than others
- Geographic bias — US and European content is overrepresented in our samples
Mitigation Strategies
- Regular bias audits comparing sample demographics to platform demographics
- Investment in multilingual detection capabilities
- Cross-validation with external researchers and datasets
- Transparent reporting of known biases in all publications
Human Labeling Workflows
Human reviewers are used for:
- Training data generation — labeled examples for detection model training
- Validation — spot-checking automated scoring accuracy
- Edge cases — reviewing content in the ambiguous 35-55 Slop Score range
- Calibration — regular reviews to ensure scoring drift is detected
Reviewer Training
All human reviewers complete:
- 8 hours of initial training on AI content identification
- Monthly calibration exercises
- Regular inter-rater reliability assessments
- Specialization tracks for platform-specific content types
Dataset Validation
Published datasets undergo:
- Automated consistency checks
- Human sample validation (minimum 5% of dataset)
- Cross-method verification (multiple detection approaches on same content)
- External expert review for major publications
- Versioning and change documentation