The Portable Document Format, or PDF is one of the most popular file formats. PDFs are everywhere.
PDF has become a de facto global standard for more secure and dependable information exchange since Adobe published the complete PDF specification in 1993. Both government and private industry have come to rely on PDF for the volumes of electronic records that need to be more securely and reliably shared, managed, and in some cases preserved for generations.
— An excerpt from the ISO 32000-1 standard
As you can imagine, if you work with text, sooner or later you will need to deal with PDFs. However, it's still a format that causes headaches for people trying to add or extract knowledge to/from them. That's why we have built a PDF annotation tool integrated within tagtog. Our goal is to facilitate the processing of PDFs for entity/relations extraction, document classification,manual annotation and other tasks.
How does it look?
It is the same web interface as in tagtog, but annotating directly over the native PDF
You can annotate any text in the PDF: captions, text in images, figures, tables, etc. Just make sure the text is not an image itself.
You can navigate by clicking on the arrows on the top left corner of the document view. If you want to go to a specific page, just write the page number in the page text box and press enter ↵.
How does it help you?
Processing PDF files is a painful process full of tears. To solve this, we have built a PDF annotation tool. This is how it can help you:
Annotate over the PDF. Annotate directly over the native PDF layout. PDF files contain figures, pictures, tables, etc. Stripping only the text, you destroy part of the original context reducing the global understanding of the document.
Train your own models easily. Annotate native PDFs; then use them to train your ML models as easily as if they were plain texts! Find below how and forget about processing PDFs yourself.
Fully integrated. The PDF annotation tool is a web component fully integrated with tagtog. Annotate relations, document labels, entity labels, etc.
Annotate any text Sometimes image captions are important, text in tables is critical, etc. Don't miss this important source of knowledge, now you can annotate these and any other text that you find in the PDF.
Training your model to process PDF files
1Annotate using the PDF annotation tool. Import a PDF and annotate it using the native layout.
2Download the plain text and annotations. Use the API or user interface to download both: the
ann.json with the annotations and the
plain.html with the plain text. The offsets from the annotations refer to the plain text.
3Train your model. Use the annotations and the plain text to train your model.
4Import new PDFs with the annotations from your model. Import the PDFs to tagtog and download the plain texts using the API or user interface, use this easy to digest text as an input for your model. Now push the resulting annotations to tagtog. More information. You can automate this process using Webhooks, so each time a PDF is imported, you get automatically the plain.html file, the annotations are generated right away and pushed to tagtog.
5Continuosly train your model. Annotators review the annotations from your model over the native PDF layout and correct them. You can use the new
ann.json files to update your model and increase the accuracy of your predictions over time.
How to activate this feature?
By default this feature is turned off. You can change this setting differently for each project. To activate this feature follow these instructions.
You cannot create an annotation with one piece in one page and the other piece in the next page. The main constraint is that the PDF footer interferes when creating an annotation across two pages.
Currently, pre-annotations are not available with this new layout.