[mat-dev] mat2 on OCRed pdf

Nachricht löschen

Nachricht beantworten
Autor: mail-login+mat
Datum:  
To: mat-dev@boum.org
Betreff: [mat-dev] mat2 on OCRed pdf
I'm really sorry I missed the subject in my last message. To avoid confusion I hereby resend it with the right subject set (otherwise this will look weird later in the conversation view).

And a small addition to my message:
Also note, that after applying only the __remove_superficial_meta function, the pdf still contains the xmpmeta tags internally (the data is just not printed by pdfinfo and pdftk dump_data anymore). So this also isn't the ideal solution (in the worst case we could fix this with some additional regex stuff, but that's not a really nice solution I think).

------- Forwarded Message -------
From: mail-login+mat--- via mat-dev <mat-dev@???>
Date: On Tuesday, January 30th, 2024 at 00:04
Subject: Re: [mat-dev] Welcome to the "mat-dev" mailing list
To: mat-dev@??? <mat-dev@???>
CC: mail-login+mat@??? <mail-login+mat@???>

>
>
> Hi,
>
> I quite recently stumbled over mat2 when searching how to anonymize pdfs (after removing some content by drawing a black/white bar over it). So far I think this is quite a usual use-case for mat2. Now my special requirement is that I still want to be able to search the pdf afterwards.
>
> Actually this is no trivial task, since simply keeping the text from the original pdf won't work, since this text might contain was removed by some black/white bar. Therefore, rasterization is a must to ensure deleted text is really deleted (like mat2 does in fact).
>
> To still make the resulting pdf searchable, I thought why not use some OCR (in my case I tried ocrmypdf [1]) afterwards. These tools typically execute the actual recognition and put the resulting text as additional invisible layer into the pdf.
>
> So my procedure is something like this
> rasterize (e.g. with mat2) -> OCR
>
> But this still leaves some metadata in the resulting pdf (e.g. creator=ocrmypdf/tesseract or producer=pikepdf). Now I tried using mat2 (in lightweight mode of course to keep the text) to also remove the metadata. The issue is that after applying mat2, the hidden layer with the text also seems to be gone afterwards, which makes the whole process useless.
>
> I already had a look at the code and found out that this must have something to do with the poppler/cairo rendering in __remove_all_lightweight, if I just run __remove_superficial_meta the hidden layer is still present afterwards. But at that point I'm stuck currently, since I haven't much experience with the whole gi/cairo/poppler stuff.
>
> Any idea how we might be able to fix this issue?
>
> Maybe this issue [2] is related to this one.
>
>
> Best regards
>
>
> [1]: https://github.com/ocrmypdf/OCRmyPDF
> [2]: https://0xacab.org/jvoisin/mat2/-/issues/166