Inputs & Outputs

Input types

This is the type of content you can import to tagtog.

Input type Description Default format
Text Plain text. verbatim
File See below for the supported file types See below for the default formats for each file type. You can import one or more files in a single request.
URL Web address pointing to any website (e.g. http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000245.v1.p1) or resource (e.g. https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf).

See below the default format that is associated to the file type of the file the URL is pointing to.

E.g. if the URL points to a text file or a PDF file, the text file or PDF file is imported to tagtog and the default format used accordingly. If the URL points to an HTML file, the text is stripped from the HTML content and imported to tagtog. The format used is the default format for the html file type.
PMID PubMed is a free online database of references on life sciences. Each record in the PubMed database is assigned a special number to identify it. This is the PMID. Any PMID is only a number, e.g. 12781165. It also accepts inputs as: PMID12781165. You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. 25821226,12781165. You can find this id at the bottom of the document at PubMed. Bio XML format
PMCID PubMed Central® (PMC) is a free archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). Each record in the PubMed Central database is assigned a special number to identify it. This is the PMCID. Any PMCID is a number plus the PMC prefix, e.g. PMC165443. You can introduce a list of documents separated by comma and each of them will be uploaded. e.g. PMC165443,PMC165213. You can find this id usually at the top of the document at PubMed Central. This feature relies on the availability of the PubMed provider. Bio XML format

Files

You can import files to tagtog. Following are the supported file types.

File extension Description Default format
txt Any plain text file verbatim
md (Markdown)

Any Markdown file, supporting a subset of the CommonMark spec. Go to documentation.

Using Markdown you can also use tagtog blocks to build a customized annotation layout for your project! E.g. question answering datasets, chatbot training, tweets, etc.

markdown
pdf Two variants are possible: NativePDF (supported on Cloud-Large and On-Premises ML only) to annotate directly on top of the PDF, and Simple to annotate on a stripped out plain text representation of the PDF.

Native PDF format if native PDF is activated

Simple PDF format if native PDF is not activated

html Sections are not recognized. Currently, the text content is just stripped out. HTML format
csv and tsv Go to documentation. CSV format or TSV format
source code files Supported programming language extensions include: .c, .coffee, .cpp, .cs, .css, .diff, .go, .h, .java, .js, .jsx, .less, .log, .m, .matlab, .mm, .patch, .php, .pl, .py, .python, .r, .rb, .sass, .scala, .sh, .shell, .sql, .swift, .ts, .tsx, .vb markdown
xml

NCBI Journal Publishing Tag Set (versions JATS 1.0 and NLM 2.x and 3.0). This includes all PLOS journals or F1000Research articles.

BioMed Central format. This includes all articles in BioMed Central, ChemistryCentral, or SpringerOpen, among others.

Bio XML format

Bundle files

File extension Description
tar.gz tarball gzip. Bundle of files with accepted format. Coming soon.
zip zip file. Bundle of files with accepted format. Coming soon

Input formats

If there is no format specified, the default format for the content imported is used.

In the API, use the format parameter to set the format. In the GUI, open the Advanced options under the upload panel to select a format. In both ways, you explicitly "force" tagtog to represent the content by the format selected.

Below you find the different formats. There are formats that are used when you import only content, and other formats that you should use when you import pre-annotated content. The latter is useful if you want to import documents that were annotated outside tagtog (for example by your own machine learning model) or you want to update the annotations for a specific document.

Content Format Description
Only content verbatim Parsed as already pre-formatted. No transformation is done at all to the given content. This is ideal, for instance, for files that contain arbitrary indentation or white spaces. It creates one single block with the whole the content. It is the simpler option if you are dealing with plain text or simple text files. Example.
markdown The content is expected to follow the markdown syntax. The content will be formatted and visualized as markdown (e.g. you can include images, different sections, lists, code blocks, etc.).
formatted

The content is formatted and cleaned. For example, for each paragraph, one content part is created . Ideal if your content has different discourse units. For example: chat bots conversations.

Up to the tagtog version 3.2020-W30.1 this was the default mode when a user imported plain text. If using the API, you want to pre-annotate with annotations created in this period of time, please use formatted-plus-annjson. See below.

Pre-annotated content default-plus-annjson

Use it if you are importing pre-annotated documents (content + ann.json) and you want the content to be recognized using the default format.

For example to import pre-annotated PDFs, plain text or markdown files.

Choose this option if you are not sure which format to use when sending pre-annotated documents.

Example

formatted-plus-annjson

Analogous to default-plus-annjson, and complimentary to the formatted format. Use it if you are sending pre-annotated content (content + ann.json) and you want the content to be recognized with the formatted format (instead of the default).

Example

nativepdfv1-plus-annjson

Analogous to default-plus-annjson, and complimentary to the nativepdfv1 format. Use this format when you have old annotations you want to import to tagtog, and these were created using the PDF native editor.

You can verify which format was originally used in the plain.html file. If you don't have access to this information, assume that any native PDF annotations generated up to the 3.2020-W28.2 version should use this format.

anndoc Use the anndoc format to import a pre-annotated plain.html (plain.html + ann.json). Example.

Output formats

Find below the available output formats in tagtog. Some of these outputs are available through the GUI and others through the API.

Format Description Type GUI API
ann.json This is the official format for annotations. It supports all the annotation tasks in tagtog. Documentation. Only annotations
entitiestsv Tabulated annotation format, with both plain content and annotations. It closely resembles the output by the Stanford NER tool. Documentation. Content + annotations
entitiesonlyclassestsv Tabulated annotation format, with both partial plain content and annotations. Similar to entitiestsv. The non-labeled text is not included, and it supports overlapping entities. Documentation. Only annotations
plain.html, html, xml This is the official representation of the content imported. Any piece of text/document you import to tagtog is converted to plain content and the annotation offsets always refer back to this format. No annotations provided within this format, only content. Documentation. Only content
txt Plain text. No annotations provided within this format, only content. Only content
orig, original The originally submitted file (e.g. the original html or pdf document that was imported to tagtog). Only content
visualize This is the default value. Choose to visualize the document resource returning the web page directly (web or web-editor-only if the User Agent is a recognized browser and a tagtog project information was given, i.e. web, or, respectively, no tagtog project was given, i.e., web-editor-only) or otherwise return the weburl (typically, the User Agent will be a command line program). Visualization
web Visual representation of the document and its annotations on the tagtog web interface (HTML page). Visualization
web-editor-only Analogously as web, yet without the information of a tagtog project, i.e., only the document editor layout. Useful in case you want to create iFrames in your web app. Visualization
weburl URL of the annotated document at tagtog web interface. Visualization
csv List of the project's documents and the status of their master (ground truth) annotation version. Currently, it only works with a search query (i.e. with the API parameter search). Search
null Special output to signify that no document output is desired. A JSON response of the request will be returned instead. For example, when importing a document:
{
  "ok":1 //number of documents successfully changed,
  "errors":0 //number of documents with errors,
  "items": //list of documents changed
  [
    { "origid":"text",
      "names":["text.txt"],
      "tagtogID":"aOM6EFIvULWc6J.7MAYQB3V2sF84-text",
      "result":"created"}
  ],
  "warnings":[]
}

You can use this parameter, for example, if you need the API to return you the id of each document imported.

Operation result

All output formats are returned in their latest format versions. The output format versions cannot be chosen.

Other formats?

We are currently experimenting with other formats to ease your work. Stay tuned :smirk::bird:.