From Edgy Parrot, 1 Month ago, written in Plain Text.
Embed
  1. # Dataset Card for AO3
  2. ### Dataset Summary
  3. This dataset contains approximately 12.6 million publicly available works from AO3. The dataset was created by processing works with IDs from 1 to 63,200,000 that are publicly accessible. Each entry contains the full text of the work along with comprehensive metadata including title, author, fandom, relationships, characters, tags, warnings, and other classification information.
  4.  
  5. ### Languages
  6. The dataset is multilingual, with works in many different languages, though English is predominant.
  7.  
  8. ## Dataset Structure
  9. ### Data Files
  10. The dataset is stored in compressed JSONL files (jsonl.zst format), with each archive containing 100,000 sequential IDs. For example, `ao3_40500001-40600000.jsonl.zst` contains works with IDs in that range.
  11.  
  12. ### Data Fields
  13. This dataset includes the following fields:
  14. - `id`: Unique identifier for the work (string)
  15. - `title`: Title of the work (string)
  16. - `metadata`: Dictionary containing:
  17.   - `Archive Warning`: Content warnings for the work
  18.   - `Category`: Relationship categories (e.g., F/M, M/M, F/F)
  19.   - `Characters`: List of characters appearing in the work
  20.   - `Fandom`: Fandom(s) the work belongs to
  21.   - `Language`: Language of the work
  22.   - `Rating`: Content rating (e.g., General Audiences, Teen And Up, Mature, Explicit)
  23.   - `Relationship`: Specific relationship pairings featured
  24.   - `Series`: Series the work belongs to, if applicable
  25.   - `author`: Username of the creator
  26.   - `chapters`: Chapter structure information (e.g., "1/1" for a completed one-shot)
  27.   - `completed`: Whether the work is completed
  28.   - `published`: Publication date
  29.   - `words`: Word count
  30. - `text`: Main content of the work (string)
  31.  
  32. ### Data Splits
  33. All examples are in a single split.
  34.  
  35. ### Download
  36. magnet:?xt=urn:btih:51c21fd1ae2896d6d5307347960da059236e6bd9&dn=%5BDataset%5D%20nyuuzyou%2Fao3%20%282025-04-25%29&tr=udp%3A%2F%2Ftracker.ducks.party%3A1984%2Fannounce&tr=udp%3A%2F%2Fexplodie.org%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce