this post was submitted on 02 Feb 2024
95 points (97.0% liked)

[Outdated, please look at pinned post] Casual Conversation

6599 readers
1 users here now

Share a story, ask a question, or start a conversation about (almost) anything you desire. Maybe you'll make some friends in the process.


RULES

Related discussion-focused communities

founded 1 year ago
MODERATORS
 

Well not quite but close. I'm holding a hard disk that has ALL of Wikipedia's text in 10 different languages.

Yes you can download all of Wikipedia and yes it can easily fit in a hard drive. Isn't that amazing? Text is incredibly dense compared to images and video. Around 22 GiB for English Wikipedia alone and 56 GiB for the 10 languages I downloaded.

I also have all of Wiktionary in the same hard drive. It's around 16.4 GiB.

top 14 comments
sorted by: hot top controversial new old
[–] henfredemars@infosec.pub 27 points 9 months ago (1 children)

It also connects you to a huge swath of humanity and the editors that brought that content to you.

[–] droning_in_my_ears@lemmy.world 19 points 9 months ago (1 children)

Yeah it's pretty incredible. Wikimedia is the kind of project that almost feels like a small glimpse into a better world. What the internet could have been. It's got some problems of course but it's still a huge success.

[–] intensely_human@lemm.ee 1 points 7 months ago

Uh, wikipedia is what the internet is.

Wikipedia’s not a glimpse of a better world, it’s a glimpse of our current, existing world. Because wikipedia exists.

It’s not like that hard drive came through a portal from another universe.

[–] penquin@lemm.ee 12 points 9 months ago

You're going to be the savior of humanity after the apocalypse

[–] trolololol@lemmy.world 10 points 9 months ago

Not the sum. The summary.

[–] Masterblaster@kbin.social 7 points 9 months ago

there's still so much valuable academic information that never sees the light of day, or gets erased as the internet serpent eats its own tail.

[–] WarmSoda@lemm.ee 6 points 9 months ago (2 children)

Last time I looked into downloading Wikipedia it said it was 50gb for English text and 100 with images. How'd you get it for half the space?

[–] droning_in_my_ears@lemmy.world 9 points 9 months ago (1 children)

It's only the raw text in json line files. No media and no markup. I think I downloaded a compressed dump then used wikiextractor to extract the text.

[–] AbouBenAdhem@lemmy.world 2 points 9 months ago

Does it include each article’s edit history, talk page, etc?

[–] ace_garp@lemmy.world 2 points 9 months ago* (last edited 9 months ago) (1 children)

The dictionaries for Aard2 are 21gb in .slob compressed format (text only).

[–] WarmSoda@lemm.ee 2 points 9 months ago (2 children)

No idea what that means. But thank you for adding more info.

[–] ace_garp@lemmy.world 3 points 9 months ago

OK yes, some supporting info is: Aard2 is an offline wikipedia app, that uses small compressed data files in .slob format.

[–] intensely_human@lemm.ee 1 points 7 months ago

Slob compression is best visualized as putting a sleeping bag into a stuff sack, except it’s all your possessions and you’re stuffing them into an old Chevy Metro

[–] user1234@lemmynsfw.com 5 points 9 months ago