Doccano JSONL to spaCy v2.0 JSON format

Data exported out from Doccano is usually in a JSONL format and references the entities in text differently from the spaCy v2.0 method. So far I haven't found any method to directly converted Doccano-formatted data to a format ready for use for spaCy v3.0, so converting to v2.0 as an intermediate step will have to do for now.

IMPORTANT!!!: DO NOT EVER WRITE TO evaluation_for_v3.1+.json - OVERWRITING THE EVALUATION SET WILL LIKELY BRING IN DATASETS MODEL V3.1 HAS SEEN BEFORE, RENDERING THE EVALUATION SET BIASED TOWARDS MODEL v3.1 AND USELESS FOR EVALUATION OF MODEL V3.1

Import JSON and Random

In [ ]:

import json
import random

Use json.loads() to handle the JSONL file. Then, for each line, reorganise data and labels into the spaCy [text, {"entities":label}] format. There are two sets of code with the same functionality

In [ ]:

results = []

In [ ]:

list_of_doccano_annotation_files = ["../../data/doccano_annotated_data/edited_annotations.jsonl", "../../data/doccano_annotated_data/v3.1_add_data.jsonl"
]

In [ ]:

for doccano_file in list_of_doccano_annotation_files:
    with open(doccano_file) as jsonl_annotations:
        for line in jsonl_annotations:
            annotation_line = json.loads(line)
            if "data" in annotation_line:
                line_results = [annotation_line['data'], {"entities":annotation_line['label']}]
            elif "text" in annotation_line:
                line_results = [annotation_line['text'], {"entities":annotation_line['label']}]
            results.append(line_results)

In [ ]:

print(len(results) - 1, results[(len(results) - 1)])

In [ ]:

# DEPRACATED: Conversion of Doccano-format annotations to spaCy2-format updated above to enable multiple files

# with open("../../data/doccano_annotated_data/edited_annotations.jsonl") as annotations_in_jsonl:
#    for line in annotations_in_jsonl:
#       j_line=json.loads(line)
#       # Reorganise data to spaCy's [text, {"entities":label}] format
#       line_results = [j_line['data'], {"entities":j_line['label']}]
#       results.append(line_results)

Shuffle the annotations. Training : Validation : Evaluation split is 4 : 1 : 1 in this case

In [ ]:

random.shuffle(results)

In [ ]:

evaluation_start =  int(len(results) / 6 * 5)
evaluation_end = int(len(results))

training_end = int(len(results) / 6 * 4)
validation_start = training_end
validation_end = evaluation_start

In [ ]:

training_set = results[0:training_end]
validation_set = results [validation_start:validation_end]
evaluation_set = results[evaluation_start:evaluation_end]

In [ ]:

print(len(training_set), len(validation_set), len(evaluation_set))

Save to JSON file for further conversion to spaCy v3.0 format.

NOTE: Write Locations have not been updated to protect datasets as they exist now from being overriden. Modify the locations that the data will be written to before continuing

In [ ]:

save_data_path = "../../data/training_datasets/"

def save_data(file, data):
    with open (save_data_path + file, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4)

save_data("full_train_data_v3.1.json", results)
save_data("training_set_v3.1.json", training_set)
save_data("validation_set_v3.1.json", validation_set)
save_data("golden_set.json", evaluation_set)

Doccano JSONL to spaCy v2.0 JSON format

Product

Resources

Company