this post was submitted on 22 Aug 2023
784 points (95.8% liked)
Fediverse
17734 readers
42 users here now
A community dedicated to fediverse news and discussion.
Fediverse is a portmanteau of "federation" and "universe".
Getting started on Fediverse;
- What is the fediverse?
- Fediverse Platforms
- How to run your own community
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It’s not
It's really odd how many people around here think the server crashes are perfectly normal and are glad to see newcomers driven away.
They are perfectly normal. Unlike giant corporations, the people who run Lemmy don't have the money to support a fleet of failover servers that take over when the main server goes offline. That's basically the only reason you don't see lots of downtime from major corporations: investment in redundancy, so when something breaks, a perfect copy takes over. Server crashes happen all the time for major corporations, you just never see them due to investment in redundancy.
That's the difference between a community and a company. One takes actual investment from the community as a whole, and the other ruthlessly exploits for profit.
That has nothing to do with the issue I'm talking about. Every server with the amount of data in them would fail. Doesn't matter if you had 100 servers on standby.
The Rust logic for database access and PostgreSQL logic in lemmy is unoptimized and there is a serious lack of Diesel programming skills. site_aggregates table had a mistake where 1500 rows were updated for every single new comment and post - and it only got noticed when lemmy.ca was crashing so hard they made a complete copy of the data and studied what was gong on.
Throwing hardware at it, as you describe, has been the other thing... massive numbers of CPU cores. What's needed is to learn what Reddit did before 2010 with PostgreSQL.... as Reddit also used PostgreSQL (and is open source).
Downtime because you avoid using Redis or Memcached caching at all costs in your project isn't common to see in major corporations. But Lemmy avoids caching any data from PostgreSQL at all costs. Been that way for several years. May 17, 2010: "Lesson 5: Memcache;"
As I said in my very first comment, server crashing as a way to scale is a very interesting approach.
EDIT: Freudian slip, "memecached" instead of Memcached
That's a much more... coherent explanation than your original one, friend. I wouldn't have argued this point if you had started here.
If anyone bothered to actually look at the SQL SELECT that Lemmy uses to list posts every time you hit refresh it would be blindingly obvious how convoluted it is. yet the community does not talk about the programming issues and instead keeps raising money for 64 core hardware upgrades without recognizing just how tiny Lemmy's database really is and how 57K users is not a large number at all!
I mentioned "ORM" right in my first comment.
Damn, so many joins :/
How could this monster be optimized though?
First optimization is to not fetch every field and prune it down. For example, it gets public key and private key for every community and user account - then does nothing with them. That's just pushing data between Rust and PostgreSQL for no reason. That kind of thing is pretty obvious.. the huge number of things listed after "SELECT".
The whole approach is what I recently described as: make a JOIN fusion implosion bomb, then wait for null columns to fall out
There are short-term and long-term solutions. Right now there is already a new feature that will add one more JOIN that is pending merge.... "instance blocking" by each single user.
Based on the server overloads and resulting crashes, I think some obvious solutions would be to remove post_aggregates table entirely and just throw more columns on the post table... I've seen people do stuff like that. But really you have to have a concept of core foundation.
To me the core foundation of Lemmy data is that people want fresh meat, when world events get into a frenzy, they want to F5 and get the LATEST post and the LATEST comments. Data should have a big wall between the most recent 5 days and everything else. It's the heart of the beast of human events and a platform like this.
From that perspective, that fresh posts and fresh comments mean everything, you can optimize by just doing a INNER SELECT before any JOIN... or partition the database TABLE into recent and non-recent, or some out-of-band steps to prepare recent data before this SELECT even comes up from an API call... and not let PostgreSQL do so much heavy lifting each page refresh.
If I remember, I'm gonna look into that tomorrow when I'm not on a phone screen. Not that I could contribute anything, but this seems like a good opportunity to learn some advanced stuff. Thanks for your answer!