Content classification and search (beehaw.org)

submitted 1 year ago by astromd@beehaw.org to c/foss@beehaw.org

4 comments fedilink hide all child comments

My small, non-profit team produces a lot of content in the form of blogs, presentations, graphics, mp3 and mp4 files. We are looking for a tool that can classify the content and allow us to search on it to find relevant information on topics. The goal is to maximize existing IP we've developed. Are any of you using any #foss tools do this? Bonus points if it supports natural language querying or generative AI.

top 4 comments

sorted by: hot top controversial new old

[-] TheHobbyist@lemmy.zip 3 points 1 year ago

I suppose you can split your content in 3 categories:

text
audio
image

For text, you can use Langchain which allows to get embeddings from text (read more here: https://js.langchain.com/docs/modules/data_connection/text_embedding/).

For images, you can use CLIP (this model is open source, from OpenAI). You can read more about it here: https://github.com/openai/CLIP

For audio, I don't know anything off the top of my head but you are likely to find something even open source similar to the above I mentioned.

[-] astromd@beehaw.org 1 points 1 year ago

Thanks for the suggestions. I have audio transcripts of all the mp3s.

[-] Skedaddle@beehaw.org 1 points 1 year ago

An internal wiki like Docuwiki or wiki.js might suit your needs. Although they won't automatically categorize\classify anything, it could be a useful searchable repository (especially if you can train your team in standardizing descriptions\tags\categories\etc).

[-] astromd@beehaw.org 1 points 1 year ago

Interesting suggestion. I’ll see if there are any existing workflows along these lines.

this post was submitted on 01 Sep 2023

9 points (100.0% liked)

Free and Open Source Software

17911 readers

72 users here now

If it's free and open source and it's also software, it can be discussed here. Subcommunity of Technology.

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

Gaywallet@beehaw.org

alyaza@beehaw.org