Path: blob/main/training_scripts/scripts_for_v4.0/news_scraper.ipynb
744 views
Preparing Data for Doccano Annotations
Doccano accepts input text in the form of JSON, text or textline files. For plain, unedited text, textline format seems to be the easiest method to import a large number of datasets to Doccano at one go. Textline is basically a text file, but each new line is processed by Doccano as a new dataset.
This notebook scrapes text from Wikipedia and news articles, breaking them up into sentences, and sending each sentence to a new line in the text file.
Import the following modules: WikipediaAPI, newspaper3k with its Article function and spaCy's en_core_wen_sm model (to parse text into sentences)
Setup the newspaper3k API as below.
Below are the main functions used to scrape and break the text into sentences.
The primary function is concat_all_sentences(), which consists of a nested function articles_to_sentences(), which further calls breaks_into_sentences().