CHANGELOG
NOTE: This Project is complete and further updates are unlikely
Version History
[1.1.1] - 🗓️ 10/06/2022
Added
New folders to organise the Training and Validation Annotation Datasets
Changed
Updated scripts for converting annotation datasets from Doccano to spaCy v2 & spaCy v2 to v3
Put up notice in scripts warning against overwriting existing datasets, especially the Evaluation Dataset
Updated source of training and validation data for config files handling training for Models v2.0, 3.0 and 3.1
Deprecated
Geocoding Plan - project no longer continuing, placed in archives folder
[1.1.0] - 🗓️ 08/06/2022
Added
New dataset of locations text & annotations (suffixed with _v3.1)
model_v3.1 , a new Enhanced NER-centric model trained on a larger (1598 > 2180) dataset
"Gold Standard" Evaluation Annotated Dataset & Evaluation Script (golden_set.spacy)
Changed
Updated Streamlit Mini-App script to enable users to run Model 3.1 on Streamlit
Updated Documentation for Model 3.1 + how to evaluate with the "Gold Standard" set
[1.0.4] - 🗓️ 11/05/2022
Added
Script to run POST requests for NER with Model v3.0, built with FastAPI
[1.0.3] - 🗓️ 04/05/2022
Added
Test Jupyter Notebook and a Plan Outline for Geocoding Locations phase of Project
Proper CHANGELOG.md
Changed
Streamlined readme.md
[1.0.2] - 🗓️ 28/04/2022
Added
[1.0.1] - 🗓️ 24/04/2022
Changed
Finalised first version of documentation.md
Updated readme.md
Reorganised certain files
Fixed
Optimised Training Scripts and Streamlit Mini-App Script
Removed
Redundant file data/training_datasets/train_data_er.json
[1.0.0] - 🗓️ 14/04/2022
Added
documentation subfolder and documentation.md
Changed
Took down and reuploaded repo to remove residual Git LFS files from model_v1.0
Fixed
requirements.txt file, following switchover to standard virtualenv environment. File lacks packages for newspaper3k, wikipediaapi and doccano to enable functioning of Streamlit mini-app.
Streamlit mini-app now fully functional
[0.4.0] - 🗓️ 29/03/2022
Added
model_v3.0, an Enhanced NER-Centric Model
Scripts used for model training of model_v3.0, including scripts used to handle and convert training data from Doccano
Training datasets for model_v3.0. File names end with the _doccano suffix before their file extensions label
Script to run Streamlit mini-app
model_v1.1, a re-trained version of model_v1.0 without the excessively large vector file. Still fundementally useless, but much smaller in size now.
Changed
Reorganised data subfolder and its internal subfolders
Names of files containing training datasets for the Dictionary-Centric Model appended with _er suffix before the file extension labels
Removed
model_v1.0 - excessively large vector file was clogging the repo
[0.3.0] - 🗓️ 10/03/2022
Added
model_v2.1, the Third (final) Dictionary-Centric NER Model. Model uses the model_v2.0 as a base, with an EntityRuler pipe added to function as a "Dictionary of Locations" to find and match locations in text
Changed
First Model and Second Model renamed to model_v1.0 and model_v2.0.
Scripts for data cleaning and model training shifted to new training_scripts subfolder.
Configuration files for model training shifted to new training_config subfolder and renamed by model version number.
[0.2.0] - 🗓️ 03/03/2022
Added
Second Dictionary-Centric NER Model custom-trained for Singapore Locations. Utilises tok2vec, tagger & parser from spaCy's pre-built en_core_web_md, and a custom-trained NER pipe.
requirements.txt file. Anaconda environment used for development of project, future updates will further refine the file.
config.cfg and base_config.cfg for Second Model.
[0.1.0] - 🗓️ 24/02/2022
Added
First iteration of scripts used to clean data for Dictionary-Centric Model
Locations Data
First Dictionary-Centric NER Model. Consists of tok2vec and ner pipes
config.cfg and base_config.cfg for First Model