LocalLLaMA

1

6

Beginner questions thread (sh.itjust.works)

submitted 1 year ago by noneabove1182@sh.itjust.works to c/localllama@sh.itjust.works

11 comments fedilink

Trying something new, going to pin this thread as a place for beginners to ask what may or may not be stupid questions, to encourage both the asking and answering.

Depending on activity level I'll either make a new one once in awhile or I'll just leave this one up forever to be a place to learn and ask.

When asking a question, try to make it clear what your current knowledge level is and where you may have gaps, should help people provide more useful concise answers!

2

16

Trying out old GPUs with Vulkan (discuss.tchncs.de)

submitted 8 hours ago by OpticalMoose@discuss.tchncs.de to c/localllama@sh.itjust.works

11 comments fedilink

Yesterday I got bored and decided to try out my old GPUs with Vulkan. I had an HD 5830, GTX 460 and GTX 770 4Gb laying around so I figured "Why not".

Long story short - Vulkan didn't recognize them, hell, Linux didn't even recognize them. They didn't show up in nvtop, nvidia-smi or anything. I didn't think to check dmesg.

Honestly, I thought the 770 would work; it hasn't been in legacy status that long. It might work with an older Nvidia driver version (I'm on 550 now) but I'm not messing with that stuff just because I'm bored.

So for now the oldest GPUs I can get running are a Ryzen 5700G APU and 1080ti. Both Vega and Pascal came out in early 2017 according to Wikipedia. Those people disappointed that their RX 500 and RX 5000 don't work in Ollama should give Llama.cpp Vulkan a shot. Kobold has a Vulkan option too.

The 5700G works fine alongside Nvidia GPUs in Vulkan. The performance is what you'd expect from an APU, but at least it works. Now I'm tempted to buy a 7600 XT just to see how it does.

Has anyone else out there tried Vulkan?

3

39

AMD denies rumors of Radeon RX 9070 XT with 32GB memory (videocardz.com)

submitted 4 days ago by OpticalMoose@discuss.tchncs.de to c/localllama@sh.itjust.works

9 comments fedilink

Well, it was nice ... having hope, I mean. That was a good feeling.

4

10

Models not loading into RAM (lemmy.ml)

submitted 3 days ago by corvus@lemmy.ml to c/localllama@sh.itjust.works

9 comments fedilink

I didn't expect a 8B-F16 model with 16GB on disk could be run in my laptop with only 16GB of RAM and integrated GPU, It was painfuly slow, like 0.3 t/s, but it ran. Then I learnt that you can effectively run a model from your storage without loading into memory and checked that it was exactly the case, the memory usage kept constant at around 20% with and without running the model. The problem is that gpt4all-chat is running all the models greater than 1.5B in this way, and the difference is huge as the 1.5b model runs at 20 t/s. Even a distilled 6.7B_Q8 model with roughly 7GB on disk that has plenty of room (12GB RAM free) didn't move the memory usage and it was also very slow (3 tokens/sec). I'm pretty new to this field so I'm probably missing something basic, but I just followed the instrucctions for downloading it and compile it.

5

56

AMD reportedly working on gaming Radeon RX 9070 XT GPU with 32GB memory (videocardz.com)

submitted 6 days ago* (last edited 6 days ago) by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

18 comments fedilink

One might question why an RX 9070 card would need so much memory, but increased capacity can serve purposes beyond gaming, such as Large Language Model (LLM) support for AI workloads. Additionally, it’s worth noting that RX 9070 cards will use 20 Gbps memory, much slower than the RTX 50 series, which features 28-30 Gbps GDDR7 variants. So, while capacity may increase, bandwidth likely won’t.

6

7

Recommend models for GTX 1660 Super (6GB) (lemmy.sdf.org)

submitted 5 days ago by Disonantezko@lemmy.sdf.org to c/localllama@sh.itjust.works

6 comments fedilink

I have an GTX 1660 Super (6 GB)

Right now I have ollama with:

deepseek-r1:8b
qwen2.5-coder:7b

Do you recommend any other local models to play with my GPU?

7

12

The Anthropic Economic Index - an initiative aimed at understanding AI's effects on labor markets and the economy over time. (www.anthropic.com)

submitted 1 week ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

3 comments fedilink

8

4

AI Action Summit in Paris (www.youtube.com)

submitted 1 week ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

3 comments fedilink

Closing session, speech by Modi, JD Vance, Ursula von der Leyen

9

6

French President Emmanuel Macron announces €100 billion investments in AI (www.france24.com)

submitted 1 week ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

0 comments fedilink

10

8

"Flash Answers" Cerebras brings instant inference to Mistral Le Chat (cerebras.ai)

submitted 1 week ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

1 comments fedilink

Sorry I keep posting about Mistral but if you check: https://chat.mistral.ai/chat

I duno how they do it but some of these answers are lightning fast:

Fast inference dramatically improves the user experience for chat and code generation – two of the most popular use-cases today. In the example above, Mistral Le Chat completes a coding prompt instantly while other popular AI assistants take up to 50 seconds to finish.

For this initial release, Cerebras will focus on serving text-based queries for the Mistral Large 2 model. When using Cerebras Inference, Le Chat will display a “Flash Answer ⚡” icon on the bottom left of the chat interface.

11

7

Hibiki by kyutai, a simultaneous speech-to-speech translation model, currently supporting FR to EN (aussie.zone)

submitted 1 week ago* (last edited 1 week ago) by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

1 comments fedilink

Example of it working in action: https://streamable.com/ueh3sj

Paper: https://arxiv.org/abs/2502.03382

Samples: https://hf.co/spaces/kyutai/hibiki-samples

Inference code: https://github.com/kyutai-labs/hibiki

Models: https://huggingface.co/kyutai

From kyutai on X: Meet Hibiki, our simultaneous speech-to-speech translation model, currently supporting FR to EN.

Hibiki produces spoken and text translations of the input speech in real-time, while preserving the speaker’s voice and optimally adapting its pace based on the semantic content of the source speech.

Based on objective and human evaluations, Hibiki outperforms previous systems for quality, naturalness and speaker similarity and approaches human interpreters.

https://x.com/kyutai_labs/status/1887495488997404732

Neil Zeghidour on X: https://x.com/neilzegh/status/1887498102455869775

12

16

DeepSeek gives Europe's tech firms a chance to catch up in global AI race (www.reuters.com)

submitted 2 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

6 comments fedilink

13

34

How to run LLaMA (and other LLMs) on Android. (lemmy.dbzer0.com)

submitted 2 weeks ago* (last edited 2 weeks ago) by llama@lemmy.dbzer0.com to c/localllama@sh.itjust.works

17 comments fedilink

Hello, everyone! I wanted to share my experience of successfully running LLaMA on an Android device. The model that performed the best for me was llama3.2:1b on a mid-range phone with around 8 GB of RAM. I was also able to get it up and running on a lower-end phone with 4 GB RAM. However, I also tested several other models that worked quite well, including qwen2.5:0.5b , qwen2.5:1.5b , qwen2.5:3b , smallthinker , tinyllama , deepseek-r1:1.5b , and gemma2:2b. I hope this helps anyone looking to experiment with these models on mobile devices!

Step 1: Install Termux

Download and install Termux from the Google Play Store or F-Droid

Step 2: Set Up proot-distro and Install Debian

Open Termux and update the package list:
```
pkg update && pkg upgrade
```
Install proot-distro
```
pkg install proot-distro
```
Install Debian using proot-distro:
```
proot-distro install debian
```
Log in to the Debian environment:
```
proot-distro login debian
```
You will need to log-in every time you want to run Ollama. You will need to repeat this step and all the steps below every time you want to run a model (excluding step 3 and the first half of step 4).

Step 3: Install Dependencies

Update the package list in Debian:
```
apt update && apt upgrade
```
Install curl:
```
apt install curl
```

Step 4: Install Ollama

Run the following command to download and install Ollama:
```
curl -fsSL https://ollama.com/install.sh | sh
```
Start the Ollama server:
```
ollama serve &
```
After you run this command, do ctrl + c and the server will continue to run in the background.

Step 5: Download and run the Llama3.2:1B Model

Use the following command to download the Llama3.2:1B model:
```
ollama run llama3.2:1b
```
This step fetches and runs the lightweight 1-billion-parameter version of the Llama 3.2 model .

Running LLaMA and other similar models on Android devices is definitely achievable, even with mid-range hardware. The performance varies depending on the model size and your device's specifications, but with some experimentation, you can find a setup that works well for your needs. I’ll make sure to keep this post updated if there are any new developments or additional tips that could help improve the experience. If you have any questions or suggestions, feel free to share them below!

– llama

14

15

What is a good model that runs on 6GB Vram? (discuss.online)

submitted 2 weeks ago by OmegaLemmy@discuss.online to c/localllama@sh.itjust.works

10 comments fedilink

Should be good at conversations and creative, it'll be for worldbuilding

Best if uncensored as I prefer that over it kicking in when I least want it

I'm fine with those roleplaying models as long as they can actually give me ideas and talk to be logically

15

12

Has anyone applied tree of thought prompting to r1 yet? (programming.dev)

submitted 2 weeks ago by artificialfish@programming.dev to c/localllama@sh.itjust.works

9 comments fedilink

Generate 5 thoughts, prune 3, branch, repeat. I think that’s what o1 pro and o3 do

16

24

Mistral Small 3 (24B) released (mistral.ai)

submitted 2 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

1 comments fedilink

17

15

Did DeepSeek R1 just pop nvidias bubble? (www.youtube.com)

submitted 3 weeks ago by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works

8 comments fedilink

Changed title because no need for youtube clickbait here

18

28

Why llms are suprisingly good at math, and what it means to process language. (lemmy.world)

submitted 3 weeks ago* (last edited 3 weeks ago) by Smokeydope@lemmy.world to c/localllama@sh.itjust.works

20 comments fedilink

Someone asked about how llms can be so good at math operations. My response comment kind of turned into a five paragraph essay as they tend to do sometimes. Thought I would offer it here and add some reference. Maybe spark some discussion?

What do language models do?

LLMs are trained to recognize, process, and construct patterns of language data into high dimensional manifold plots.

Meaning its job is to structure and compartmentalize the patterns of language into a map where each word and its particular meaning live as pairs of points on a geometric surface. Its point is placed near closely related points in space connected by related concepts or properties of the word.

You can explore such a map for vision models here!

Then they use that map to statistically navigate through the sea of ways words can be associated into sentences to find coherent paths.

What does language really mean?

Language data isnt just words and syntax, its underlying abstract concepts, context, and how humans choose to compartmentalize or represent universal ideas given our subjective reference point.

Language data extends to everything humans can construct thoughts about including mathematics, philosophy, science storytelling, music theory, programming, ect.

Language is universal because its a fundimental way we construct and organize concepts. The first important cognative milestone for babies is the association of concepts to words and constructing sentences with them.

Even the universe speaks its own language. Physical reality and logical abstractions speak the same underlying universal patterns hidden in formalized truths and dynamical operation. Information and matter are two sides to a coin, their structure is intrinsicallty connected.

Math and conceptual vectors

Math is a symbolic representation of combinatoric logic. Logic is generally a formalized language used to represent ideas related to truth as well as how truth can be built on through axioms.

Numbers and math is cleanly structured and formalized patterns of language data. Its riggerously described and its axioms well defined. So its relatively easy to train a model to recognize and internalize patterns inherent to basic arithmetic and linear algebra and how they manipulate or process the data points representing numbers.

You can imagine the llms data manifold having a section for math and logic processing. The concept of one lives somewhere as a point of data on the manifold. By moving a point representing the concept of one along a vector dimension that represents the process of 'addition by one' to find the data point representing two.

Not a calculator though

However an llm can never be a true calculator due to the statistical nature of the tokenizer. It always has a chance of giving the wrong answer. In the infinite multitude of tokens it can pick any number of wrong numbers. We can get the statistical chance of failure down though.

Its an interesting how llms can still give accurate answers for artithmatic despite having no in built calculation function. Through training alone they are learning how to apply simple arithmetic.

hidden structures of information

There are hidden or intrinsic patterns to most structures of information. Usually you can find the fractal hyperstructures the patterns are geometrically baked into in higher dimensions once you go plotting out their phase space/ holomorphic parameter maps. We can kind of visualize these fractals with vision model activation parameter maps. Welch labs on yt has a great video about it.

Modern language models have so many parameters with so many dimensions the manifold expands into its impossible to visualize. So they are basically mystery black boxes that somehow understand these crazy fractal structures of complex information and navigate the topological manifolds language data creates.

conclusion

This is my understanding of how llms do their thing. I hope you enjoyed reading! Secretly I just wanted to show you the cool chart :)

19

25

Thoughts on new deepseek R1 distill models (lemmy.world)

submitted 3 weeks ago* (last edited 3 weeks ago) by Smokeydope@lemmy.world to c/localllama@sh.itjust.works

7 comments fedilink

Ive been playing around with the deepseek R1 distills. Qwen 14b and 32b specifically.

So far its very cool to see models really going after this current CoT meta by mimicing internal thinking monologues. Seeing a model go "but wait..." "Hold on, let me check again..." "Aha! So.." Kind of makes it feel more natural in its eventual conclusions.

I don't like how it can get caught in looping thought processes and im not sure how much all the extra tokens spent really go towards a "better" answer/solution.

What really needs to be ironed out is the reading comprehension seems to be lower th average as it misses small details in tricky questions and makes assumptions about what youre trying to ask like wanting a recipe for coconut oil cookies but only seeing coconut and giving a coconut cookie recipe with regular butter.

Its exciting to see models operate in a kind of a new way.

20

13

unsure on how to quantize model (feddit.it)

submitted 1 month ago by brokenlcd@feddit.it to c/localllama@sh.itjust.works

5 comments fedilink

I was experimenting with oobabooga trying to run this model but due to it's size it wasn't going to fit in ram, so i tried to quantize it using llama.cpp, and that worked, but due to the gguf format it was only running on the cpu. searching for ways to quantize the model while keeping it in safetensors returned nothing; so is there any way to do that?

I'm sorry if this is a stupid question, i still know almost nothing of this field

21

12

How much gpu do i need to run a 90b model (lemm.ee)

submitted 1 month ago by muntedcrocodile@lemm.ee to c/localllama@sh.itjust.works

16 comments fedilink

Do i need industry grade gpu's or can i scrape by getring decent tps with a consumer level gpu.

22

7

Nvidia Digits AI Supercomputer just announced (lemmy.world)

submitted 1 month ago by Smokeydope@lemmy.world to c/localllama@sh.itjust.works

0 comments fedilink

I am excited to see how this performs when it drops around May.

23

14

Go toolchain error - Does anyone know what's going on here? (lemmy.world)

submitted 1 month ago by ItsYourBoyHalo@lemmy.world to c/localllama@sh.itjust.works

10 comments fedilink

24

13

(New) papers by Meta: Large Concept Models and BLT (palaver.p3x.de)

submitted 1 month ago* (last edited 1 month ago) by hendrik@palaver.p3x.de to c/localllama@sh.itjust.works

2 comments fedilink

Seems Meta have been doing some research lately, to replace the current tokenizers with new/different representations:

Large Concept Models: Language Modeling in a Sentence Representation Space [Github] (December 11, 2024)
Byte Latent Transformer: Patches Scale Better Than Tokens [Github] (December 12, 2024)

25

35

New open-weight 🐋 DeepSeek V3. 685B MoE. Beats Claude 3.5 Sonnet on Aider coding benchmark (huggingface.co)

submitted 1 month ago* (last edited 1 month ago) by BB84@mander.xyz to c/localllama@sh.itjust.works

2 comments fedilink

Absolutely humongous model. Mixture of 256 experts with 8 activated each time.

Aider leaderboard: The only model above 🐋 v3 here is ~~Open~~AI o1. DeepSeek is known to make amazing models and Aider rotates their benchmark over time, so it is unlikely that this is a train-on-benchmark situation.

Some more benchmarks: on Reddit.