This study examines the ethical, legal, and copyright challenges in training generative AI on a large-scale text dataset, using Books3 as a case study. This dataset, used for training foundation models such as GPT, BERT, Meta’s Llama, and StableLM includes pirated works by nearly 200,000 authors from various countries, raising concerns about intellectual property rights, dataset integrity, and transparency. Our analysis of the initial 99 ISBNs reveals significant biases, including linguistic imbalance, genre skew, and temporal limitations. AI similarity analysis shows that AI-generated text closely mirrors human-written content, suggesting that AI reconstructs patterns in words rather than copying verbatim. However, some parts of the analysis also indicate that AI outputs frequently paraphrase existing content rather than generating wholly independent text, complicating issues of copyright compliance and economic compensation for authors and publishers. These findings highlight the need for improved dataset transparency, ethical considerations, and legal safeguards in generative AI training. We propose a scalable hybrid governance framework integrating technical design-based solutions with regulatory and institutional strategies to ensure responsible AI development. This approach advances AI governance by addressing dataset integrity, source attribution, and evolving ethical, legal and economic challenges in an increasingly AI-driven society.
扫码关注我们
求助内容:
应助结果提醒方式:
