Re: [mat-dev] Welcome to the "mat-dev" mailing list

Delete this message

Reply to this message
Autor: mail-login+mat
Data:  
A: mat-dev@boum.org
Assumpte: Re: [mat-dev] Welcome to the "mat-dev" mailing list
Hi,

I quite recently stumbled over mat2 when searching how to anonymize pdfs (after removing some content by drawing a black/white bar over it). So far I think this is quite a usual use-case for mat2. Now my special requirement is that I still want to be able to search the pdf afterwards.

Actually this is no trivial task, since simply keeping the text from the original pdf won't work, since this text might contain was removed by some black/white bar. Therefore, rasterization is a **must** to ensure deleted text is really deleted (like mat2 does in fact).

To still make the resulting pdf searchable, I thought why not use some OCR (in my case I tried ocrmypdf [1]) afterwards. These tools typically execute the actual recognition and put the resulting text as additional invisible layer into the pdf.

So my procedure is something like this
rasterize (e.g. with mat2) -> OCR
But this still leaves some metadata in the resulting pdf (e.g. creator=ocrmypdf/tesseract or producer=pikepdf). Now I tried using mat2 (in lightweight mode of course to keep the text) to also remove the metadata. The issue is that after applying mat2, the hidden layer with the text also seems to be gone afterwards, which makes the whole process useless.

I already had a look at the code and found out that this must have something to do with the poppler/cairo rendering in __remove_all_lightweight, if I just run __remove_superficial_meta the hidden layer is still present afterwards. But at that point I'm stuck currently, since I haven't much experience with the whole gi/cairo/poppler stuff.

Any idea how we might be able to fix this issue?

Maybe this issue [2] is related to this one.


Best regards


[1]: https://github.com/ocrmypdf/OCRmyPDF
[2]: https://0xacab.org/jvoisin/mat2/-/issues/166