Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
drgnfrts
GitHub Repository: drgnfrts/Singapore-Locations-NER
Path: blob/main/training_scripts/doccano_base_training/doccano_to_spacy2.ipynb
744 views
Kernel: Python 3.8.9 ('venv': venv)

Doccano JSONL to spaCy v2.0 JSON format

Data exported out from Doccano is usually in a JSONL format and references the entities in text differently from the spaCy v2.0 method. So far I haven't found any method to directly converted Doccano-formatted data to a format ready for use for spaCy v3.0, so converting to v2.0 as an intermediate step will have to do for now.

IMPORTANT!!!: DO NOT EVER WRITE TO evaluation_for_v3.1+.json - OVERWRITING THE EVALUATION SET WILL LIKELY BRING IN DATASETS MODEL V3.1 HAS SEEN BEFORE, RENDERING THE EVALUATION SET BIASED TOWARDS MODEL v3.1 AND USELESS FOR EVALUATION OF MODEL V3.1

Import JSON and Random

import json import random

Use json.loads() to handle the JSONL file. Then, for each line, reorganise data and labels into the spaCy [text, {"entities":label}] format. There are two sets of code with the same functionality

results = []
list_of_doccano_annotation_files = ["../../data/doccano_annotated_data/edited_annotations.jsonl", "../../data/doccano_annotated_data/v3.1_add_data.jsonl" ]
for doccano_file in list_of_doccano_annotation_files: with open(doccano_file) as jsonl_annotations: for line in jsonl_annotations: annotation_line = json.loads(line) if "data" in annotation_line: line_results = [annotation_line['data'], {"entities":annotation_line['label']}] elif "text" in annotation_line: line_results = [annotation_line['text'], {"entities":annotation_line['label']}] results.append(line_results)
print(len(results) - 1, results[(len(results) - 1)])
# DEPRACATED: Conversion of Doccano-format annotations to spaCy2-format updated above to enable multiple files # with open("../../data/doccano_annotated_data/edited_annotations.jsonl") as annotations_in_jsonl: # for line in annotations_in_jsonl: # j_line=json.loads(line) # # Reorganise data to spaCy's [text, {"entities":label}] format # line_results = [j_line['data'], {"entities":j_line['label']}] # results.append(line_results)

Shuffle the annotations. Training : Validation : Evaluation split is 4 : 1 : 1 in this case

random.shuffle(results)
evaluation_start = int(len(results) / 6 * 5) evaluation_end = int(len(results)) training_end = int(len(results) / 6 * 4) validation_start = training_end validation_end = evaluation_start
training_set = results[0:training_end] validation_set = results [validation_start:validation_end] evaluation_set = results[evaluation_start:evaluation_end]
print(len(training_set), len(validation_set), len(evaluation_set))

Save to JSON file for further conversion to spaCy v3.0 format.

NOTE: Write Locations have not been updated to protect datasets as they exist now from being overriden. Modify the locations that the data will be written to before continuing

IMPORTANT!!!: DO NOT EVER WRITE TO evaluation_for_v3.1+.json - OVERWRITING THE EVALUATION SET WILL LIKELY BRING IN DATASETS MODEL V3.1 HAS SEEN BEFORE, RENDERING THE EVALUATION SET BIASED TOWARDS IT AND USELESS

save_data_path = "../../data/training_datasets/" def save_data(file, data): with open (save_data_path + file, "w", encoding="utf-8") as f: json.dump(data, f, indent=4) save_data("full_train_data_v3.1.json", results) save_data("training_set_v3.1.json", training_set) save_data("validation_set_v3.1.json", validation_set) save_data("golden_set.json", evaluation_set)