Path: blob/main/training_scripts/doccano_base_training/doccano_to_spacy2.ipynb
744 views
Doccano JSONL to spaCy v2.0 JSON format
Data exported out from Doccano is usually in a JSONL format and references the entities in text differently from the spaCy v2.0 method. So far I haven't found any method to directly converted Doccano-formatted data to a format ready for use for spaCy v3.0, so converting to v2.0 as an intermediate step will have to do for now.
IMPORTANT!!!: DO NOT EVER WRITE TO evaluation_for_v3.1+.json - OVERWRITING THE EVALUATION SET WILL LIKELY BRING IN DATASETS MODEL V3.1 HAS SEEN BEFORE, RENDERING THE EVALUATION SET BIASED TOWARDS MODEL v3.1 AND USELESS FOR EVALUATION OF MODEL V3.1
Import JSON and Random
Use json.loads() to handle the JSONL file. Then, for each line, reorganise data and labels into the spaCy [text, {"entities":label}] format. There are two sets of code with the same functionality
Shuffle the annotations. Training : Validation : Evaluation split is 4 : 1 : 1 in this case
Save to JSON file for further conversion to spaCy v3.0 format.
NOTE: Write Locations have not been updated to protect datasets as they exist now from being overriden. Modify the locations that the data will be written to before continuing
IMPORTANT!!!: DO NOT EVER WRITE TO evaluation_for_v3.1+.json - OVERWRITING THE EVALUATION SET WILL LIKELY BRING IN DATASETS MODEL V3.1 HAS SEEN BEFORE, RENDERING THE EVALUATION SET BIASED TOWARDS IT AND USELESS