Preparing Data for Doccano Annotations

Doccano accepts input text in the form of JSON, text or textline files. For plain, unedited text, textline format seems to be the easiest method to import a large number of datasets to Doccano at one go. Textline is basically a text file, but each new line is processed by Doccano as a new dataset.

This notebook scrapes text from Wikipedia and news articles, breaking them up into sentences, and sending each sentence to a new line in the text file.

Import the following modules: WikipediaAPI, newspaper3k with its Article function and spaCy's en_core_wen_sm model (to parse text into sentences)

In [1]:

import newspaper
from newspaper import Article
import spacy

nlp = spacy.load("en_core_web_sm")

Setup the newspaper3k API as below.

In [2]:

#Function to extract news articles. The articles are downloaded and parsed before the text is returned.
def extract_news_articles(article_url):
    news_article = Article(article_url)
    news_article.download()
    news_article.parse()
    return news_article.text

Below are the main functions used to scrape and break the text into sentences.

The primary function is concat_all_sentences(), which consists of a nested function articles_to_sentences(), which further calls breaks_into_sentences().

In [3]:

# Setup NLP pipeline to break the article into sentences

def break_into_sentences(article):
    corpus = []
    doc = nlp(article)
    for sent in doc.sents:
        corpus.append(sent.text)
    return corpus

In [4]:

# Check what kind of type of article the input article is, then use the relevant API to scrape and call break_into_sentences() to return a list of sentences

def article_to_sentences(article, article_type):
    if article_type == "Wikipedia Article":
        extracted_article = extract_wikipedia(article)
    elif article_type == "News Article":
        extracted_article = extract_news_articles(article)
    cleaned_text = break_into_sentences(extracted_article)
    return cleaned_text

In [5]:

sentences_for_doccano = [1,2]

In [75]:

#This is the primary function to scrape articles from a given list, break them up into sentences, and concats the sentences into a list for transfer to a text file. The function requires two parameter inputs - list of articles and article type.

sentences_for_doccano = []
def concat_all_sentences(list_of_articles, article_type):
    for article in list_of_articles:
        list_of_sentences = article_to_sentences(article, article_type)
        for sentences in list_of_sentences:
            sentences_for_doccano.append(sentences)

In [76]:

list_of_newspaper_urls = [
    "https://www.channelnewsasia.com/sustainability/ocbc-sustainability-energy-efficient-technology-25-million-investment-2705961"
]

In [77]:

concat_all_sentences(list_of_newspaper_urls, "News Article")

In [78]:

with open ("../../data/text_data/news_and_wiki_data.txt", "w") as f:
    for each_sentence in sentences_for_doccano:
        f.write(each_sentence + "\n")
    f.close()

Preparing Data for Doccano Annotations

Product

Resources

Company