this post was submitted on 27 Jul 2023
192 points (97.1% liked)
Programming
17668 readers
190 users here now
Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!
Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.
Hope you enjoy the instance!
Rules
Rules
- Follow the programming.dev instance rules
- Keep content related to programming in some way
- If you're posting long videos try to add in some form of tldr for those who don't want to watch videos
Wormhole
Follow the wormhole through a path of communities !webdev@programming.dev
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Edit: definitely read the other responses because apparently there are some techniques I wasn't aware of and don't understand nearly as well as I understand the underlying AI technology - and I'm only an enthusiast layman.
I don't think there is any way of doing that. AI is like a huge matrix that says 'if (' is followed by
' x': 60%
' foo': 19%
' person': 9%
Etc.
And then it does it all over again for the next token based on randomly selecting one of the tokens and then saying 'if ( person' is followed by
'.id': 30%
'.name': 27%
Etc.
So just to write a simple 'if person.name.startsWith("foo") {' is the aggregate result of thousands of contributors - really pretty much every author of every code snippet ingested from the training material.
There is no single author even if the code matches existing code token for token. The only exception would be code that is so esoteric that there is only a single author writing code that does a particular thing. But even in that case, there is nothing in the probability matrix to indicate that a particular sequence of tokens is unique to a certain author. Best you could do is full text search a line of code to see if it matches anything in the training data and if there is a very small set of authors to whom credit might be assigned. That might be possible, but it would be an add-on (and significant performance hit) to the actual AI itself. Sort of like how browser integrated AI just runs a search and feeds the result into the context to make the output more likely to contain information in the top results.
It depends. The base model, sure you can't really figure out what percentage of it came from which data source since there's just too many data sources and that information is lost along the way. They're likely not using the entirety of SO to generate answers though. Retraining LLMs is ungodly expensive, so they can't retrain it every time a new Q or A is created, and even retraining on a regular basis would be impractical.
Instead, without knowing exactly how they're doing it of course, my guess is they're pulling relevant Q&As from their database, then using those results to improve the response (for example by providing them as context). If you're interested, look into retrieval-augmented generation.
I am interested, thank you!