AiSlopData.org — AI Slop Intelligence for Advertising

Data Collection Methodology

We collect data from multiple platforms using multiple methods, designed to be scalable, reproducible, and ethical.

Primary Data Sources

Platform Sampling

Platform	Access Method	Sample Strategy	Update Frequency
YouTube	Public API + web sampling	Stratified by category, geography, channel size	Continuous
TikTok	Web sampling + trend monitoring	Trending content + category sampling	Daily
Instagram	Public profile sampling	Hashtag-based + account-type stratified	Weekly
Facebook	Public page and group sampling	Category-based, geographic stratification	Weekly
X/Twitter	Public API + web sampling	Keyword-based + trending topic sampling	Continuous
Pinterest	Web sampling	Category boards + trending pins	Weekly
Reddit	Public API	Subreddit-based stratified sampling	Daily
Amazon	Product listing sampling	Category-based, review-focused	Weekly
Google Search	SERP sampling	Query-based across categories	Weekly

Website Crawling

For AI-generated blog networks and content farms:

Seed list of known AI content domains
Expansion through link analysis and ad network correlation
New domain discovery through search result monitoring
Domain registration pattern monitoring

Sampling Methodology

Stratified Random Sampling

Content is sampled using stratified random sampling across:

Content categories (entertainment, education, health, finance, etc.)
Geographic targeting (US, EU, Asia-Pacific, Latin America)
Time periods (distributed across days of week and hours)
Source types (established accounts, new accounts, trending content)

Sample Sizes

Monthly platform samples: 10,000-50,000 content pieces per platform
Search result samples: 5,000 queries per month
Website samples: 5,000-10,000 domains per month

API Limitations

Platform	Key Limitations	Mitigation
YouTube	Rate limits, quota restrictions	Distributed sampling, priority queuing
TikTok	Limited official API access	Web-based sampling with ethical constraints
Instagram	Restricted API post-2024	Public profile sampling only
X/Twitter	Paid API tiers required	Targeted sampling within budget constraints
Reddit	API rate limits	Efficient query design, caching

Ethical Constraints

Our data collection adheres to strict ethical guidelines:

No personal data collection — we analyze content, not individuals
Public content only — no access to private posts, messages, or protected accounts
Respectful rate limiting — we do not overwhelm platform infrastructure
No deception — we do not create fake accounts or misrepresent our identity
Proportionality — we collect only what is necessary for our analysis
Data minimization — we retain analyzed signals, not raw content, where possible
Responsible disclosure — when we discover specific harmful content, we follow responsible disclosure practices

Bias Considerations

Known Biases

English-language bias — our detection methods are most developed for English content
Platform access bias — platforms with better API access are more thoroughly sampled
Detection bias — our methods are better at detecting current-generation AI content than older or novel generation methods
Category bias — some content categories are more testable than others
Geographic bias — US and European content is overrepresented in our samples

Mitigation Strategies

Regular bias audits comparing sample demographics to platform demographics
Investment in multilingual detection capabilities
Cross-validation with external researchers and datasets
Transparent reporting of known biases in all publications

Human Labeling Workflows

Human reviewers are used for:

Training data generation — labeled examples for detection model training
Validation — spot-checking automated scoring accuracy
Edge cases — reviewing content in the ambiguous 35-55 Slop Score range
Calibration — regular reviews to ensure scoring drift is detected

Reviewer Training

All human reviewers complete:

8 hours of initial training on AI content identification
Monthly calibration exercises
Regular inter-rater reliability assessments
Specialization tracks for platform-specific content types

Dataset Validation

Published datasets undergo:

Automated consistency checks
Human sample validation (minimum 5% of dataset)
Cross-method verification (multiple detection approaches on same content)
External expert review for major publications
Versioning and change documentation

Data Sources and Collection