this post was submitted on 02 Feb 2024
95 points (97.0% liked)
[Outdated, please look at pinned post] Casual Conversation
6599 readers
1 users here now
Share a story, ask a question, or start a conversation about (almost) anything you desire. Maybe you'll make some friends in the process.
RULES
- Be respectful: no harassment, hate speech, bigotry, and/or trolling
- Encourage conversation in your post
- Avoid controversial topics such as politics or societal debates
- Keep it clean and SFW: No illegal content or anything gross and inappropriate
- No solicitation such as ads, promotional content, spam, surveys etc.
- Respect privacy: Don’t ask for or share any personal information
Related discussion-focused communities
- !actual_discussion@lemmy.ca
- !askmenover30@lemm.ee
- !dads@feddit.uk
- !letstalkaboutgames@feddit.uk
- !movies@lemm.ee
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Last time I looked into downloading Wikipedia it said it was 50gb for English text and 100 with images. How'd you get it for half the space?
It's only the raw text in json line files. No media and no markup. I think I downloaded a compressed dump then used wikiextractor to extract the text.
Does it include each article’s edit history, talk page, etc?
The dictionaries for Aard2 are 21gb in .slob compressed format (text only).
No idea what that means. But thank you for adding more info.
OK yes, some supporting info is: Aard2 is an offline wikipedia app, that uses small compressed data files in .slob format.
Slob compression is best visualized as putting a sleeping bag into a stuff sack, except it’s all your possessions and you’re stuffing them into an old Chevy Metro