I suppose you can split your content in 3 categories:
- text
- audio
- image
For text, you can use Langchain which allows to get embeddings from text (read more here: https://js.langchain.com/docs/modules/data_connection/text_embedding/).
For images, you can use CLIP (this model is open source, from OpenAI). You can read more about it here: https://github.com/openai/CLIP
For audio, I don't know anything off the top of my head but you are likely to find something even open source similar to the above I mentioned.