this post was submitted on 26 Sep 2023
142 points (90.3% liked)

Technology

34870 readers
70 users here now

This is the official technology community of Lemmy.ml for all news related to creation and use of technology, and to facilitate civil, meaningful discussion around it.


Ask in DM before posting product reviews or ads. All such posts otherwise are subject to removal.


Rules:

1: All Lemmy rules apply

2: Do not post low effort posts

3: NEVER post naziped*gore stuff

4: Always post article URLs or their archived version URLs as sources, NOT screenshots. Help the blind users.

5: personal rants of Big Tech CEOs like Elon Musk are unwelcome (does not include posts about their companies affecting wide range of people)

6: no advertisement posts unless verified as legitimate and non-exploitative/non-consumerist

7: crypto related posts, unless essential, are disallowed

founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] Zaktor@sopuli.xyz 1 points 1 year ago (1 children)

What do you think happens to data when it's scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.

The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired. You don't however have a right to copy and use the text for other purposes, whether that's making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.

[–] treadful@lemmy.zip 1 points 1 year ago (1 children)

What do you think happens to data when it’s scraped? Copying the data is a fundamental requirement for using it in training. These models are trained in big datacenters where the original work is split up and tokenized and used over and over again.

Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

The difference between you training a model and you reading a book (put online by its author in clear text, to avoid the obvious issue of actual piracy for human use) is that you reading on a website is the intention of the copyright holder and you as a person have a fundamental right to remember things and be inspired.

Copyright holders can't say what I do with their work, nor what I do with the knowledge of their book. They can only say how I copy and distribute it. I don't need consent to burn an author's book, create fan art around it, or quote characters in my blog. I do need their consent to copy and distribute their works directly.

You don’t however have a right to copy and use the text for other purposes, whether that’s making a t-shirt with a memorable line, printing it out to give to someone else, or tokenizing it to train a computer algorithm.

And at some point the resolution of said words is so specific that it becomes uncopyrightable. You can't copyright most phrases nor words.

[–] Zaktor@sopuli.xyz 1 points 1 year ago (1 children)

Tokenizing and calculating vectors or whatever is not the same thing as distributing copies of said work.

It very much is. You can't just run a cipher on a copyrighted work and say "it's not the same, so I didn't copy it". Tokenization is reversible to the original text. And "distributing" is separate from violating copyright. It's not distriburight, it's copyright. Copying a work without authorization for private use is still violating copyright.

[–] treadful@lemmy.zip 0 points 1 year ago (1 children)

You can’t just run a cipher on a copyrighted work and say “it’s not the same, so I didn’t copy it”.

Yes I can. I can download a Web page, encrypt it on my machine, and I'm not distributing said work.

And “distributing” is separate from violating copyright. It’s not distriburight, it’s copyright. Copying a work without authorization for private use is still violating copyright.

That's just false.

[–] Zaktor@sopuli.xyz 0 points 1 year ago

You absolutely do not know what you're talking about. This is just trivial copyright law, but there's a weird internet mythology that if you can access something on the net you can take it as long as you don't share it further. The reason the mass-sharers tended to get prosecuted is because they were easier and more valuable targets, not because the people they were sharing it with weren't also breaking the law.