Preparing Data for Doccano Annotations

Doccano accepts input text in the form of JSON, text or textline files. For plain, unedited text, textline format seems to be the easiest method to import a large number of datasets to Doccano at one go. Textline is basically a text file, but each new line is processed by Doccano as a new dataset.

This notebook scrapes text from Wikipedia and news articles, breaking them up into sentences, and sending each sentence to a new line in the text file.

Import the following modules: WikipediaAPI, newspaper3k with its Article function and spaCy's en_core_wen_sm model (to parse text into sentences)

In [1]:

import wikipediaapi
import newspaper
from newspaper import Article
import spacy

nlp = spacy.load("en_core_web_sm")

Set up the Wikipedia API as below.

In [2]:

#Calling the Wikipedia API and inputting language and format 
wiki_wiki = wikipediaapi.Wikipedia(
    language='en',
    extract_format=wikipediaapi.ExtractFormat.WIKI
)

In [3]:

#Function to extract only the text portion from the article, and clean up any syntax issues.
def extract_wikipedia(article):
    p_wiki = wiki_wiki.page(article)
    wikitext = p_wiki.text
    if "\n" in wikitext:
        wikitext = wikitext.replace("\n", "")
    return wikitext

List of articles to be processed by the Doccano. This is a small sample collection, and placing the article names/url into a list enables future provision for the list to be stored in a seperate JSON file to avoid clogging up the main ipynb.

In [4]:

list_of_newspaper_urls = [
    "https://www.channelnewsasia.com/singapore/police-arrest-man-clementi-police-centre-gunshot-2505181",
    "https://www.channelnewsasia.com/singapore/flood-heavy-rain-weather-warnings-fallen-trees-vehicles-punggol-sungei-gedong-2509816",
    "https://www.straitstimes.com/singapore/community/integrated-community-hub-one-punggol-to-open-from-mid-2022-700-seat-hawker",
    "https://www.channelnewsasia.com/singapore/fire-breaks-out-jurong-east-condominium-1-taken-hospital-2388701",
    "https://mothership.sg/2022/01/jurong-east-choked-garbage-chute/",
    "https://www.straitstimes.com/singapore/housing/hougang-bto-flats-draw-more-than-10000-applicants-all-seven-projects",
    "https://www.channelnewsasia.com/singapore/former-certis-cisco-officer-fire-gun-public-toilet-harbourfront-centre-2557076",
    "https://theindependent.sg/alien-ufo-sighting-over-bugis-singapore/",
    "https://www.channelnewsasia.com/singapore/ang-mo-kio-murder-case-court-psychiatric-observation-david-brian-chow-kwok-hun-isabel-francis-2501166",
]

In [5]:

list_of_wikipedia_articles = [
    "North South MRT line", 
    "East West MRT line", 
    "Downtown MRT line", 
    "Mass Rapid Transit (Singapore)", 
    "Geography of Singapore", 
    "Transport in Singapore", 
    'Pan Island Expressway', 
    "Urban planning in Singapore", 
    "Future developments in Singapore", 
    "North East MRT line", 
    "Chinatown MRT station", 
    "Circle MRT line"
]

Setup the newspaper3k API as below.

In [6]:

#Function to extract news articles. The articles are downloaded and parsed before the text is returned.
def extract_news_articles(article_url):
    news_article = Article(article_url)
    news_article.download()
    news_article.parse()
    return news_article.text

Below are the main functions used to scrape and break the text into sentences.

The primary function is concat_all_sentences(), which consists of a nested function articles_to_sentences(), which further calls breaks_into_sentences().

In [7]:

# Setup NLP pipeline to break the article into sentences

def break_into_sentences(article):
    corpus = []
    doc = nlp(article)
    for sent in doc.sents:
        corpus.append(sent.text)
    return corpus

In [8]:

# Check what kind of type of article the input article is, then use the relevant API to scrape and call break_into_sentences() to return a list of sentences

def article_to_sentences(article, article_type):
    if article_type == "Wikipedia Article":
        extracted_article = extract_wikipedia(article)
    elif article_type == "News Article":
        extracted_article = extract_news_articles(article)
    cleaned_text = break_into_sentences(extracted_article)
    return cleaned_text

In [15]:

sentences_for_doccano = [1,2]

In [22]:

#This is the primary function to scrape articles from a given list, break them up into sentences, and concats the sentences into a list for transfer to a text file. The function requires two parameter inputs - list of articles and article type.

sentences_for_doccano = []
def concat_all_sentences(list_of_articles, article_type):
    for article in list_of_articles:
        list_of_sentences = article_to_sentences(article, article_type)
        for sentences in list_of_sentences:
            sentences_for_doccano.append(sentences)

In [25]:

concat_all_sentences(list_of_wikipedia_articles, "Wikipedia Article")

In [26]:

concat_all_sentences(list_of_newspaper_urls, "News Article")

In [38]:

with open ("../../data/text_data/news_and_wiki_data.txt", "w") as f:
    for each_sentence in sentences_for_doccano:
        f.write(each_sentence + "\n")
    f.close()

Preparing Data for Doccano Annotations

Product

Resources

Company