Programming

18790 readers

835 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Follow the programming.dev instance rules
Keep content related to programming in some way
If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev

founded 2 years ago

MODERATORS

snowe@programming.dev

Ategon@programming.dev

MaungaHikoi@lemmy.nz

[help] Are there tools for documents manipulating that can provide an approximate size of components (text included)? (lemmy.world)

submitted 1 year ago* (last edited 1 year ago) by Red1C3@lemmy.world to c/programming@programming.dev

14 comments fedilink hide all child comments

Long story short, I want to build a system that reorders some components in a document file (be it a docx or odt, I don't have a hard constraint atm).

So my problem input should be a document file, and I need to be able to approximate the number of pages consumed by this document file, I also need to be able to get the height of individual components (like a single paragraph or a table) to have the data I need to rearrange so I can make the document have less pages.

I don't have a hard constraint on the programming language of the tool either (Python preferred), I prefer not embedding LibreOffice into my system.

Also I'm willing to hear other solutions (maybe my input is not the optimal thing I can use for this problem).

Thanks in advance!

you are viewing a single comment's thread
view the rest of the comments

[–] Turun@feddit.de 2 points 1 year ago* (last edited 1 year ago)

How about generating latex source code, compiling it and getting the page count of the generated PDF? Reorder your set of questions and see if the result is better or worse. Optionally do it in a smart way to reduce the number of PDF compilations you have to do. (Simulated annealing comes to mind for example.)

I think it would be easier to find a library to find the last line on a PDF page than it is to parse unzipped odt files and basically write a layout engine that does the same as libre office just to get the number of pages.

Maybe you can even get Tex to put it in the log during compilation. That would be the most convenient option and seems reasonable to achieve.