Discussion about this post

User's avatar
Andrew Searls's avatar

The core point about epistemic concentration lands — when a billion people are getting framed answers from a handful of companies, the "who decides" question matters enormously.

One nuance I'd add from the practitioner side: The major labs aren't training directly on raw internet text. There are extensive curation and filtering pipelines between scraping and training; deduplication, toxicity classifiers, quality weighting. That doesn't dissolve the concern, but it does shift where the concern should sit. The question isn't really "did the model read every toxic Reddit thread", it's who designed the filter, what were they optimizing for, and who's auditing the decisions they made. Which is arguably harder to see and harder to challenge.

The deeper constraint is that there may not be enough high-quality human-generated text to train at the scale these models require, even with aggressive filtering. Until models can learn from real-world interaction rather than static corpora, "just use better data" is less of an answer than it sounds.

Curious whether you see the transparency question as primarily about training data disclosure, or about the filtering and alignment decisions that happen downstream, because those seem like very different policy levers.

Dominika Michalska's avatar

The knowledge monopoly idea becomes sharper when you see how AI shapes not just access to information, but what gets surfaced, prioritized, and acted on. The real shift is from controlling knowledge to structuring decisions around it.

No posts

Ready for more?