Calls it "Deepsneak", failing to make it clear that the reason people love Deepseek is that you can download and it run it securely on any of your own private devices or servers - unlike most of the competing SOTA AIs.

I can't speak for Proton, but the last couple weeks are showing some very clear biases coming out.

you are viewing a single comment's thread
view the rest of the comments

[–] lily33@lemm.ee 34 points 4 months ago (1 children)

To be fair, most people can't actually self-host Deepseek, but there already are other providers offering API access to it.

[–] halcyoncmdr@lemmy.world 31 points 4 months ago (3 children)

There are plenty of step-by-step guides to run Deepseek locally. Hell, someone even had it running on a Raspberry Pi. It seems to be much more efficient than other current alternatives.

That's about as openly available to self host as you can get without a 1-button installer.

[–] tekato@lemmy.world 17 points 4 months ago (2 children)

You can run an imitation of the DeepSeek R1 model, but not the actual one unless you literally buy a dozen of whatever NVIDIA’s top GPU is at the moment.

[–] lily33@lemm.ee 10 points 4 months ago

A server grade CPU with a lot of RAM and memory bandwidth would work reasonable well, and cost "only" ~$10k rather than 100k+...

[–] alcoholicorn@lemmy.ml 1 points 4 months ago (1 children)

I saw posts about people running it well enough for testing purposes on an NVMe.

[–] Aria@lemmygrad.ml 1 points 4 months ago (1 children)

Can you link that post?

[–] alcoholicorn@lemmy.ml 1 points 4 months ago (2 children)

https://old.reddit.com/r/LocalLLaMA/comments/1idseqb/deepseek_r1_671b_over_2_toksec_without_gpu_on/

[–] Aria@lemmygrad.ml 2 points 4 months ago

Thanks!

[–] Aria@lemmygrad.ml 2 points 4 months ago

That's cool! I'm really interested to know how many tokens per second you can get with a really good U.2. My gut is that it won't actually be better than the 24VRAM+96RAM cache setup this user already tested with though.

[–] Dyf_Tfh@lemmy.sdf.org 6 points 4 months ago* (last edited 4 months ago) (2 children)

Those are not deepseek R1. They are unrelated models like llama3 from Meta or Qwen from Alibaba "distilled" by deepseek.

This is a common method to smarten a smaller model from a larger one.

Ollama should have never labelled them deepseek:8B/32B. Way too many people misunderstood that.

[–] yogthos@lemmy.ml 5 points 4 months ago (2 children)

I'm running deepseek-r1:14b-qwen-distill-fp16 locally and it produces really good results I find. Like yeah it's a reduced version of the online one, but it's still far better than anything else I've tried running locally.

[–] stink@lemmygrad.ml 3 points 4 months ago

Its so cute when chinese is sprinkled in randomly hehe my little bilingual robot in my pc

[–] morrowind@lemmy.ml 1 points 4 months ago* (last edited 4 months ago) (1 children)

Have you compared it with the regular qwen? It was also very good

[–] yogthos@lemmy.ml 1 points 4 months ago (1 children)

The main difference is speed and memory usage. Qwen is a full-sized, high-parameter model while qwen-distill is a smaller model created using knowledge distillation to mimic qwen's outputs. If you have the resources to run qwen fast then I'd just go with that.

[–] morrowind@lemmy.ml 1 points 4 months ago (1 children)

I think you're confusing the two. I'm talking about the regular qwen before it was finetuned by deep seek, not the regular deepseek

[–] yogthos@lemmy.ml 1 points 4 months ago (1 children)

I haven't actually used that one, but doesn't the same point apply here too? The whole point of DeepSeek is in distillation that makes runtime requirements smaller.

[–] morrowind@lemmy.ml 1 points 4 months ago (1 children)

No cause I was already running regular (non-deepseek) qwen 14B, admittedly a heavily quantized and uncensored version, so I was just curious if it would be any better

[–] yogthos@lemmy.ml 1 points 4 months ago

The big difference with R1 at least is that it is trained using reinforcement learning, and it gives you an explanation of how it arrived at a solution. I find the thinking bit is really fascinating cause it really reads like a train of thought. Here's an example of what it outputs when I ask it to write a react component with a text input and a button:

First, I'll start by setting up the basic structure of a React component. I'll import React and useState since we'll need state management for the input value.

Next, I'll create a functional component called TextInputButton. Inside this component, I'll use the useState hook to manage the text input's value. The initial state can be an empty string.

Then, I'll define a handler function, handleChange, that updates the state whenever the user types into the input. This function will take the event object and set the new value using setTextInputValue.

I also need a function to handle the button click. For now, I'll use handleClick with an alert message as a placeholder. The actual functionality can be added later based on requirements.

In the return statement, I'll structure the JSX. I'll include an input element of type text, binding its value to the state variable and attaching the handleChange function to the onChange event. This ensures that the state always reflects the current input value.

Below the input, I'll add a button element. On clicking this button, it will trigger handleClick. I'll also give it a class for styling purposes, like 'button' which can be styled with CSS if needed.

Finally, I'll export the component so it can be used elsewhere in the application. I'll make sure to include comments explaining each part for clarity.

Here's a simple React component with a text input and a button:

import React from 'react';
import { useState } from 'react';

function TextInputButton() {
  const [text, setText] = useState('');

  const handleChange = (event) => {
    setText(event.target.value);
  };

  const handleClick = () => {
    // Add your click handler logic here
    alert('Button clicked!');
  };

  return (
    <div>
      <input
        type="text"
        value={text}
        onChange={handleChange}
        placeholder="Enter text..."
      />
      <button onClick={handleClick}>Submit</button>
    </div>
  );
}

export default TextInputButton;

This component:

Uses useState to manage the input value
Has a handleChange function that updates the state when the user types
Has a handleClick function that currently just shows an alert (you can add your own logic here)
Renders an input field and a submit button

You can use this component in your application like:

<TextInputButton />

The input value is controlled through the text state variable, and you can access it using text. When the button is clicked, the handleClick function will execute whatever logic you need.

[–] pcalau12i@lemmygrad.ml 5 points 4 months ago* (last edited 4 months ago) (1 children)

The 1.5B/7B/8B/13B/32B/70B models are all officially DeepSeek R1 models, that is what DeepSeek themselves refer to those models as. It is DeepSeek themselves who produced those models and released them to the public and gave them their names. And their names are correct, it is just factually false to say they are not DeepSeek R1 models. They are.

The "R1" in the name means "reasoning version one" because it does not just spit out an answer but reasons through it with an internal monologue. For example, here is a simple query I asked DeepSeek R1 13B:

Me: can all the planets in the solar system fit between the earth and the moon?

DeepSeek: Yes, all eight planets could theoretically be lined up along the line connecting Earth and the Moon without overlapping. The combined length of their diameters (approximately 379,011 km) is slightly less than the average Earth-Moon distance (about 384,400 km), allowing them to fit if placed consecutively with no required spacing.

However, on top of its answer, I can expand an option to see its internal monologue it went through before generating the answer, which you can find the internal monologue here because it's too long to paste.

What makes these consumer-oriented models different is that that rather than being trained on raw data, they are trained on synthetic data from pre-existing models. That's what the "Qwen" or "Llama" parts mean in the name. The 7B model is trained on synthetic data produced by Qwen, so it is effectively a compressed version of Qen. However, neither Qwen nor Llama can "reason," they do not have an internal monologue.

This is why it is just incorrect to claim that something like DeepSeek R1 7B Qwen Distill has no relevance to DeepSeek R1 but is just a Qwen model. If it's supposedly a Qwen model, why is it that it can do something that Qwen cannot do but only DeepSeek R1 can? It's because, again, it is a DeepSeek R1 model, they add the R1 reasoning to it during the distillation process as part of its training. They basically use synthetic data generated from DeepSeek R1 to fine-tune readjust its parameters so it adopts a similar reasoning style. It is objectively a new model because it performs better on reasoning tasks than just a normal Qwen model. It cannot be considered solely a Qwen model nor an R1 model because its parameters contain information from both.

[–] lily33@lemm.ee 3 points 4 months ago* (last edited 4 months ago) (1 children)

What makes these consumer-oriented models different is that that rather than being trained on raw data, they are trained on synthetic data from pre-existing models. That’s what the “Qwen” or “Llama” parts mean in the name. The 7B model is trained on synthetic data produced by Qwen, so it is effectively a compressed version of Qen. However, neither Qwen nor Llama can “reason,” they do not have an internal monologue.

You got that backwards. They're other models - qwen or llama - fine-tuned on synthetic data generated by Deepseek-R1. Specifically, reasoning data, so that they can learn some of its reasoning ability.

But the base model - and so the base capability there - is that of the corresponding qwen or llama model. Calling them "Deepseek-R1-something" doesn't change what they fundamentally are, it's just marketing.

[–] pcalau12i@lemmygrad.ml 0 points 4 months ago* (last edited 4 months ago)

There is no "fundamentally" here, you are referring to some abstraction that doesn't exist. The models are modified during the fine-tuning process, and the process trains them to learn to adopt DeepSeek R1's reasoning technique. You are acting like there is some "essence" underlying the model which is the same between the original Qwen and this model. There isn't. It is a hybrid and its own thing. There is no such thing as "base capability," the model is not two separate pieces that can be judged independently. You can only evaluate the model as a whole. Your comment is just incredibly bizarre to respond to because you are referring to non-existent abstractions and not actually speaking of anything concretely real.

The model is neither Qwen nor DeepSeek R1, it is DeepSeek R1 Qwen Distill as the name says. it would be like saying it's false advertising to say a mule is a hybrid of a donkey and a horse because the "base capabilities" is a donkey and so it has nothing to do with horses, and it's really just a donkey at the end of the day. The statement is so bizarre I just do not even know how to address it. It is a hybrid, it's its own distinct third thing that is a hybrid of them both. The model's capabilities can only be judged as it exists, and its capabilities differ from Qwen and the original DeepSeek R1 as actually scored by various metrics.

Do you not know what fine-tuning is? It refers to actually adjusting the weights in the model, and it is the weights that define the model. And this fine-tuning is being done alongside DeepSeek R1, meaning it is being adjusted to take on capabilities of R1 within the model. It gains R1 capabilities at the expense of Qwen capabilities as DeepSeek R1 Qwen Distill performs better on reasoning tasks but actually not as well as baseline models on non-reasoning tasks. The weights literally have information both of Qwen and R1 within them at the same time.

Speaking of its "base capabilities" is a meaningless floating abstraction which cannot be empirically measured and doesn't refer to anything concretely real. It only has its real concrete capabilities, not some hypothetical imagined capabilities. You accuse them of "marketing" even though it is literally free. All DeepSeek sells is compute to run models, but you can pay any company to run these distill models. They have no financial benefit for misleading people about the distill models.

You genuinely are not making any coherent sense at all, you are insisting a hybrid model which is objectively different and objectively scores and performs differently should be given the exact same name, for reasons you cannot seem to actually articulate. It clearly needs a different name, and since it was created utilizing the DeepSeek R1 model's distillation process to fine-tune it, it seems to make sense to call it DeepSeek R1 Qwen Distill. Yet for some reason you insist this is lying and misrepresenting it and it actually has literally nothing to do with DeepSeek R1 at all and it should just be called Qwen and we should pretend it is literally the same model despite it not being the same model as its training weights are different (you can do a "diff" on the two model files if you don't believe me!) and it performs differently on the same metrics.

There is simply no rational reason to intentionally want to mislabel the model as just being Qwen and having no relevance to DeepSeek R1. You yourself admitted that the weights are trained on R1 data so they necessarily contain some R1 capabilities. If DeepSeek was lying and trying to hide that the distill models are based on Qwen and Llama, they wouldn't have literally put that in the name to let everyone know, and released a paper explaining exactly how those were produced.

It is clear to me that you and your other friends here have some sort of alternative agenda that makes you not want to label it correctly. DeepSeek is open about the distill models using Qwen and Llama, but you want them to be closed and not reveal that they also used DeepSeek R1. The current name for it is perfectly fine and pretending it is just a Qwen model (or Llama, for the other distilled versioned) is straight-up misinformation, and anyone who downloads the models and runs them themselves will clearly see immediately that they perform differently. It is a hybrid model correctly called what they are: DeepSeek R1 Qwen Distill and DeepSeek R1 Llama Distill.

[–] Aria@lemmygrad.ml 1 points 4 months ago* (last edited 4 months ago) (1 children)

Running R1 locally isn't realistic. But you can rent a server and run it privately on someone else's computer. It costs about 10 per hour to run. You can run it on CPU for a little less. You need about 2TB of RAM.

If you want to run it at home, even quantized in 4 bit, you need 20 4090s. And since you can only have 4 per computer for normal desktop mainboards, that's 5 whole extra computers too, and you need to figure out networking between them. A more realistic setup is probably running it on CPU, with some layers offloaded to 4 GPUs. In that case you'll need 4 4090s and 512GB of system RAM. Absolutely not cheap or what most people have, but technically still within the top top top end of what you might have on your home computer. And remember this is still the dumb 4 bit configuration.

Edit: I double-checked and 512GB of RAM is unrealistic. In fact anything higher than 192 is unrealistic. (High-end) AM5 mainboards support up to 256GB, but 64GB RAM sticks are much more expensive than 48GB ones. Most people will probably opt for 48GB or lower sticks. You need a Threadripper to be able to use 512GB. Very unlikely for your home computer, but maybe it makes sense with something else you do professionally. In which case you might also have 8 RAM slots. And such a person might then think it's reasonable to spend 3000 Euro on RAM. If you spent 15K Euro on your home computer, you might be able to run a reduced version of R1 very slowly.

[–] krolden@lemmy.ml 2 points 4 months ago* (last edited 4 months ago) (1 children)

You don't need that much ram to run this

[–] Aria@lemmygrad.ml 2 points 4 months ago

How much do you need? Show your maths. I looked it up online for my post, and the website said 1747GB, which is completely in-line with other models.

https://apxml.com/posts/gpu-requirements-deepseek-r1