Path: blob/main/training_scripts/doccano_base_training/data_to_doccano.ipynb
744 views
Preparing Data for Doccano Annotations
Doccano accepts input text in the form of JSON, text or textline files. For plain, unedited text, textline format seems to be the easiest method to import a large number of datasets to Doccano at one go. Textline is basically a text file, but each new line is processed by Doccano as a new dataset.
This notebook scrapes text from Wikipedia and news articles, breaking them up into sentences, and sending each sentence to a new line in the text file.
Import the following modules: WikipediaAPI, newspaper3k with its Article function and spaCy's en_core_wen_sm model (to parse text into sentences)
Set up the Wikipedia API as below.
List of articles to be processed by the Doccano. This is a small sample collection, and placing the article names/url into a list enables future provision for the list to be stored in a seperate JSON file to avoid clogging up the main ipynb.
Setup the newspaper3k API as below.
Below are the main functions used to scrape and break the text into sentences.
The primary function is concat_all_sentences(), which consists of a nested function articles_to_sentences(), which further calls breaks_into_sentences().