Train your own models


You can combine the tagtog interface along with the API to train your own models in an Active Learning (AL) fashion. The workflow is simple:

1Train a seed version of your in-house model (e.g. using scikit-learn). Have this model annotate new documents and upload them to tagtog through the API.

2Review within your team the newly annotated & uploaded documents using the tagtog interface. The human reviewers, typically subject-matter experts (SMEs) (i.e. domain experts), will go through the predicted annotations, and accept, reject, or change the annotations as they see fit. Likely, you will want your team to review documents selected in an AL fashion.

3Download the reviewed documents again using the API, and use them to re-train your model.


Eventually, you can repeat this same workflow multiple times to continuously re-train your models.

Training flow.

How to upload annotated documents?

Use the anndoc format to upload both a document's content and its annotations. To do this, a common workflow is the following:

1Upload your document/s using any supported format.

2Download back the document/s' content in plain.html format. Have your model read the html's text content and generate the annotations in ann.json.

3Upload to tagtog in a same API request both the plain.html + ann.json (i.e., 2 files) of the document. The requirement is that both files, except for the final extension, must have the same name. In that way, the tagtog system understands that both files represent the same document. Moreover, you can send multiple annotated documents at the same time. This means you always upload an even number of files.


This example shows how to upload a document, download it back in plain.html format, annotate it with your model generating an ann.json, and upload the whole annotated document (plain.html + ann.json) to tagtog.

  import requests

  tagtogAPIUrl = "https://www.tagtog.net/-api/documents/v1"

  auth = requests.auth.HTTPBasicAuth(username='yourUsername', password='yourPassword')
  params = {'project':'yourProjectName', 'owner': 'yourUsername', 'output':'html'}
  files = [('file', open('text.txt', 'rb'))]
  response = requests.post(tagtogAPIUrl, params=params, auth=auth, files=files)

  # The plain.html (the request's response is in string form). You will have to parse the html's text
  plain_html = response.text
  # Annotate the html with my model. We suppose we generate a json in string format too. We could write down to a file if desired
  ann_json = my_model_annotate(plain_html)
  # ann_json = '{"annotatable":{"parts":["s1p1","s1p2"]},"anncomplete":false,"sources":[],"metas":{},"entities":[{"classId":"e_1","part":"s1p1","offsets":[{"start":12,"text":"first sentence of the first paragrap"}],"confidence":{"state":"pre-added","who":["user:yourUsername"],"prob":1},"fields":{},"normalizations":{}}],"relations":[]}'

  # Submit both the plain.html + ann.json in the same request. The important thing: they must have the same name, except for the file extensions
  files = [('file', ('text.plain.html', plain_html)), ('file', ('text.ann.json', ann_json))]
  params['format'] = 'anndoc'
  params['output'] = 'null'
  response = requests.post(tagtogAPIUrl, params=params, auth=auth, files=files)

  print(response.text)