Can a search engine be selfhosted? (lemmy.world)

submitted 1 year ago by cll7793@lemmy.world to c/selfhosted@lemmy.world

17 comments fedilink hide all child comments

The quality of search engines has gone down so much for technical questions.

I'm looking for a way to index sites like stack exchanges, reddit, quora, and research papers. Would this be possible to do this locally with metadata?

top 17 comments

sorted by: hot top controversial new old

[-] ptz@dubvee.org 52 points 1 year ago* (last edited 1 year ago)

I mean, you can easily self host a meta-search engine like Searx, Searx-ng, Whoogle, etc. I run Searx-ng and it sends your queries to multiple engines and aggregates the results for you.

To host your own search engine, you'd need to crawl and index every site. It's certainly doable, but it would take a lot of time /effort.

[-] PlexSheep@feddit.de 18 points 1 year ago

I agree. Selfhosting a true search engine is way too much work and infeasible for individuals. Meta search engines however are very feasible and a great option.

[-] Mubelotix@jlai.lu 23 points 1 year ago* (last edited 1 year ago)

I'm glad you ask! I have been working on a peer-to-peer search engine. The goal is to index websites that are on IPFS. It's an MVP but you can already try the demo and run your node to make your data searchable for the whole network (you would need to generate html files from your data and put it on ipfs first)

[-] cll7793@lemmy.world 2 points 1 year ago

Thank you so much for your answer!

[-] meldrik@lemmy.wtf 8 points 1 year ago

https://yacy.net

[-] u202307011927@feddit.de 5 points 1 year ago

I wished so much that the installation of this would be easier. It's such an amazing concept and idea

[-] anzo@programming.dev 8 points 1 year ago

https://docs.searxng.org/ is a meta-search engine but good one 👍

[-] Anafroj@sh.itjust.works 6 points 1 year ago* (last edited 1 year ago)

StackExchange dumps are available for Kiwix, the project that allows to use a local dump of Wikipedia. You can find all the available dumps there, including the StackExchange ones. You can even build your own search engine through libs allowing to use those zim files (the dumps), if you want.

[-] jcolag@lemmy.sdf.org 6 points 1 year ago* (last edited 1 year ago)

In addition to YaCy and the varieties of Searx (both of which perform better for me than any of the commercial search engines), it's not even out of the question to do this yourself, if you're willing to start with the most recent Common Crawl dump and do some spidering in between releases. I don't recommend it, unless you want to learn for yourself why search engines often give such miserable results, but it's possible.

However, that's the issue, here. Can you self-host a search engine? Sure, if you want to maintain the storage to back it. That depends on how deep your pockets go...

[-] radiated@lemm.ee 6 points 1 year ago

Well Google does self-host their own search engine.

[-] Untitled_Pribor@kbin.social 8 points 1 year ago* (last edited 1 year ago)

"Hey @SundarPichai , how do I set up one of those googley thingies of yours?"

[-] PainInTheAES@lemmy.world 10 points 1 year ago

Docker container pls

[-] AES@lemmy.ronsmans.eu 4 points 1 year ago

Great username.

[-] Vendetta9076@sh.itjust.works 7 points 1 year ago

I know you're getting downvoted but you're also technically correct. Which I appreciate.

[-] BrightCandle@lemmy.world 5 points 1 year ago* (last edited 1 year ago)

Even the main search engines don't index the entire internet of content these days and their databases are truly massive already. Writing a basic web crawler to produce a search index isn't all that hard (I used to do it as a programming exercise for applicants) but dealing with the volume of data of the entire internet and storing it to produce a worthwhile search engine however is just not feasible on home hardware, it would be TB's at least. It wouldn't just be a little worse it would dramatically worse unless you put substantial resources to it including enormous amounts of network bandwidth which would have your ISP questioning your "unlimited 1 gbps fibre" contract. It would probably take years to get decent and always be many months out of date at best.

Doesn't seem practical to try to self host based on the need to download and index every single page of the internet its a truly massive scale problem.

[-] TheOhNoNotAgain@lemmy.world 3 points 1 year ago

Solr is a great search engine. It won't help you with the crawling, but if you manage to get the data into Solr you have a come far.

[-] elscallr@lemmy.world 1 points 1 year ago

Solr

triggered

this post was submitted on 31 Jul 2023

74 points (97.4% liked)

Selfhosted

39937 readers

373 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz