³¹ views

Kernel: Python 3 (Ubuntu Linux)

Автоматическая обработка текстов

Екатерина Черняк

дАДИИ ФКН НИУ ВШЭ

[email protected]

Введение

Краткая история

7 января 1954. Джорджтаунский эксперимент по машинному переводу с русского на английский;
1957 г., Ноам Хомский ввел "универсальную грамматику";
1961 г., Начинается сбор Брауновского корпуса;
конец 1960-х гг., ELIZA –- программа, ведущая психотерапевтические разговоры;
1975 г., Солтон ввел векторную модель (Vector Space Model, VSM);
до 1980–х. Методы решения задач, основанные на правилах;
после 1980–х гг., Методы решения задач, основанные на машинном обучении и корпусной лингвистике;
1998 г., Понте и Крофт вводят языковую модель (Language Model, LM);
конец 1990–х гг., Вероятностные тематические модели (LSI, pLSI, LDA, и т.д.);
1999 г., Опубликован учебник Маннинга и Щютце "Основы статистической автоматической обработки текстов" ("Foundations of Statistical Natural Language Processing");
2009 г.. Опубликован учебник Берда, Кляйна и Лопера "Автоматическая обработка текстов на Python" ("Natural Language Processing with Python);
Mikolov, Tomas и др. "Efficient estimation of word representations in vector space".

Основные задачи

Машинный перевод
Классификация текстов
- Фильтрация спама
- По тональности
- По теме или жанру
Кластеризация текстов
Извлечение информации
- Фактов и событий
- Именованных сущностей
Вопросно-ответные системы
Суммаризация текстов
Генерация текстов
Распознавание речи
Проверка правописания
Оптическое распознавание символов
Пользовательские эксперименты и оценка точности и качества методов

Основные техники

Уровень символов:
- Токенизация: разбиение текста на слова
- Разбиение текста на предложения
Уровень слов – морфология:
- Разметка частей речи
- Снятие морфологической неоднозначности
Уровень предложений – синтаксис:
- Выделенние именных или глагольных групп (chunking)
- Выделенние семантических ролей
- Деревья составляющих и зависимостей
Уровень смысла – семантика и дискурс:
- Разрешение кореферентных связей
- Анализ дискурсивных связей
- Выделение синонимов
- Анализ аргументативных связей

Основные проблемы

Неоднозначность
- Лексическая неоднозначность: орган, парить, рожки, атлас
- Морфологическая неоднозначность: Хранение денег в банке. Что делают белки в клетке?
- Синтаксическая неоднозначность: Мужу изменять нельзя. Его удивил простой солдат.
Неологизмы: печеньки, заинстаграммить, репостнуть, расшарить, затащить, килорубли
Разные варианты написания: Россия, Российская Федерация, РФ
Нестандартное написание: каг дила?

Синтаксическая неоднозначность

синтаксическая неоднозначность

I saw the man. The man was on the hill. I was using a telescope.
I saw the man. I was on the hill. I was using a telescope.
I saw the man. The man was on the hill. The hill had a telescope.
I saw the man. I was on the hill. The hill had a telescope.
I saw the man. The man was on the hill. I saw him using a telescope.

План

Морфология. Синтаксис. Извлечение ключевых слов и словосочетаний.
Векторная модель документа и информационный поиск. Векторная модель слова и дистрибутивная семантика. Методы снижения размерности. Тематическое моделирование, word2vec, GloVe
Классификация документов и классификация последовательностей. Сверточные нейронные сети, условные случайные поля.
Языковая модель. Нейронная языковая модель. Реккурентные нейронные сети. Извлечение именованных сущностей.

Токенизация и подсчет количества слов

Сколько слов в этом предложении?

На дворе трава, на траве дрова, не руби дрова на траве двора.*

** 12 токенов** : На, дворе, трава, на, траве, дрова, не, руби, дрова, на, траве, двора

** 8 - 9 типов** : Н/на, дворе, трава, траве, дрова, не, руби, двора.

** 6 лексем** : на, не, двор, трава, дрова, рубить

Токен и тип

** Тип ** – уникальное слово из текста

** Токен ** – тип и его позиция в тексте

Обозначения

$N$ = число токенов

$V$ – словарь (все типы)

$|V|$ = количество типов в словаре

** Как связаны $N$ и $|V|$ ?**

Закон Ципфа

В любом достаточно большом тексте ранг типа обратно пропорционален его частоте: $f = \frac{a}{r}$

$f$ – частота типа, $r$ – ранг типа, $a$ – параметр, для славянских языков – около 0.07

Закон Хипса

С увеличением длины текста (количества токенов), количество типов увеличивается в соответствии с законом: $|V| = K*N^b$

$N$ – число токенов, $|V|$ – количество типов в словаре, $K, b$ – параметры, обычно $K \in [10,100], b \in [0.4, 0.6]$

Анализ новостных сообщений

Рассмотрим коллекцию новостных сообщений за первую половину 2017 года. Про каждое новостное сообщение известны:

его заголовок и текст
дата его публикации
событие, о котором это новостное сообщение написано
его рубрика

In [1]:

import pandas as pd

df = pd.read_csv('../data/news.csv')
df.head()

Out[1]:

Предварительный анализ коллекции

Средняя длина текстов

In [6]:

len_data = df.text.apply(len)
len_data.describe()

Out[6]:

count      1930.000000
mean       3798.322798
std        7865.936695
min          31.000000
25%        1215.250000
50%        1918.000000
75%        4044.000000
max      185698.000000
Name: text, dtype: float64

Количество текстов о разных событиях

In [2]:

from bokeh.charts import Bar, output_notebook, show, hplot
import math
output_notebook()

counts = df.event.value_counts()

bar = Bar(counts,  width=1000, height = 600, legend = False)
bar.xaxis.major_label_orientation = math.pi/2-0.3
show(bar)

Out[2]:

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-4e6b87a33326> in <module>()
----> 1 from bokeh.charts import Bar, output_notebook, show, hplot
      2 import math
      3 output_notebook()
      4 
      5 counts = df.event.value_counts()
ImportError: No module named 'bokeh.charts'

Длины текстов (в символах)

In [100]:

from bokeh.charts import Histogram
hist = Histogram(len_data[len_data < 10000],  width=500, height = 300, legend = False)
show(hist)

Out[100]:

WARNING: Some output was deleted.

Токенизация

Используем регулярные выражения, чтобы разбить тексты на слова

In [3]:

import re
regex = re.compile("[А-Яа-я]+")

def words_only(text, regex=regex):
    return " ".join(regex.findall(text))


df.text = df.text.str.lower()
df.text = df.text.apply(words_only)

df.text.iloc[0]

Out[3]:

'в петербурге прошел митинг против передачи исаакиевского собора рпц в санкт петербурге люди устроили акцию протеста против передачи исаакиевского собора в безвозмездное пользование рпц жители петербурга собрались на исаакиевской площади чтобы высказаться против передачи исаакиевского собора в безвозмездное пользование рпц передает тасс акция проходит в формате встречи с депутатами законодательного собрания города и не требует согласования с властями участники акции не используют какую либо символику и плакаты а также мегафоны или средства звукоусиления по словам депутата алексея ковалева на исаакиевскую площадь пришло примерно тысяча человек перед участниками протеста выступили депутаты местного парламента борис вишневский и максим резник которые заявили о том что потребуют отмены решения смольного вишневский сообщил что акция будет проходить в виде встречи депутатов с избирателями закон санкт петербурга предоставляет нам право встречаться с избирателями такую встречу мы и проведем расскажем как защищаем их интересы при передаче собора сказал парламентарий в свою очередь директор музея исаакиевский собор николай буров проинформировал что собор в пятницу будет закрыт намного раньше в связи с акцией протеста он подчеркнул что необходимо избежать стычек между сторонниками передачи собора и ее противниками ранее стало известно что собор передадут в безвозмездное пользование на лет русской православной церкви в лице московского патриархата при этом он останется в собственности петербурга тем временем в петербурге продолжается сбор подписей под петицией об отмене данного решения под документом уже поставили подписи более тысяч человек комментарии другие интересные статьи'

Самые частые слова

In [4]:

from nltk import FreqDist
n_types = []
n_tokens = []
tokens = []
fd = FreqDist()
for index, row in df.iterrows():
    tokens = row['text'].split()
    fd.update(tokens)
    n_types.append(len(fd))
    n_tokens.append(sum(fd.values()))
for i in fd.most_common(10):
    print(i)

Out[4]:

('в', 43571)
('и', 25182)
('на', 19120)
('что', 13617)
('не', 11953)
('с', 10868)
('по', 9080)
('о', 5035)
('это', 4955)
('он', 4761)

Закон Ципфа

In [11]:

from bokeh.plotting import figure
freqs = list(fd.values())
freqs = sorted(freqs, reverse = True)


p = figure(plot_width=400, plot_height=200)

p.line(freqs[:300], range(300))
show(p)

Out[11]:

Закон Хипса

In [12]:

from bokeh.plotting import figure
freqs = list(fd.values())
freqs = sorted(freqs, reverse = True)


p = figure(plot_width=700, plot_height=200)

p.line(n_types, n_tokens)
show(p)

Out[12]:

WARNING: Some output was deleted.

Регулярные выражения подробнее

Классы символов:

[A-Z] – символы верхнего регистра (латиница)

[a-z] – символы нижнего регистра (латиница)

[А-Я] – символы верхнего регистра (кириллица)

[а-я] – символы нижнего регистра (кириллица)

[0-9] или \d – цифра

[^0-9] или \D – любой символ, кроме цифры

. – любой символ

Служебные символы:

\t – табуляция

\s – любой пробельный символ

\S – все символы, кроме пробельных

\n – перенос строки

^ – начало строки

$ – конец строки

__ – экранирование

Операторы:

? - предыдущий символ/группа может быть, а может не быть

+ - предыдущий символ/группа может повторяться 1 и более раз

* - предыдущий символ/группа может повторяться 0 и более раз

{n,m} - предыдущий символ/группа может повторяться от от n до m включительно

{n,} - предыдущий символ/группа в скобках может повторяться n и более раз

{,m} - предыдущий символ/группа может повторяться до m раз

{n} - предыдущий символ/группа повторяется n раз

Внутри групп не работают операторы ., +, *, их необходимо экранировать с помощью обратного слеша: \

Методы:

re.match(pattern, string) - найти подстроку pattern в начале строки string

In [14]:

import re

m = re.match(r'рыбак', 'рыбак рыбака видит издалека')

print(m)
print(m.group(0))
print(m.start(), m.end())

l = re.match(r'видит', 'рыбак рыбака видит издалека')
print(l)

Out[14]:

<_sre.SRE_Match object; span=(0, 5), match='рыбак'>
рыбак
0 5
None

re.search(pattern, string) - аналогичен методу match, но ищет не только в начале строки (но возвращает только первое вхождение!)

In [16]:

m = re.search(r'издалека', 'рыбак рыбака видит издалека')

print(m)
print(m.group(0))
print(m.start(), m.end())

l = re.search(r'прорубь', 'рыбак рыбака видит издалека')
print(l)

Out[16]:

<_sre.SRE_Match object; span=(19, 27), match='издалека'>
издалека
19 27
None

re.findall(pattern, string) - возвращает все вхождения pattern в string в виде списка

In [19]:

m = re.findall(r'рыбак', 'рыбак рыбака видит издалека')

print(m)

l = re.findall(r'прорубь', 'рыбак рыбака видит издалека')
print(l)

Out[19]:

['рыбак', 'рыбак']
[]

re.split(pattern, string, [maxsplit=0]) - разделяет строку string по шаблону pattern; параметр maxsplit отвечает за максимальное количество разбиений (если их существует несколько).

In [20]:

m = re.split(r'видит', 'рыбак рыбака видит издалека')

print(m)

l = re.split(r'рыбак', 'рыбак рыбака видит издалека')
print(l, len(l))

l1 = re.split(r'рыбак', 'рыбак рыбака видит издалека',maxsplit=1)
print(l1, len(l1))

Out[20]:

['рыбак рыбака ', ' издалека']
['', ' ', 'а видит издалека'] 3
['', ' рыбака видит издалека'] 2

re.sub(pattern, string2, string1) - заменяет все вхождения pattern в string1 на srting2

In [21]:

m = re.sub(r'рыбак', 'Рыбак', 'рыбак рыбака видит издалека')

print(m)

Out[21]:

Рыбак Рыбака видит издалека

re.compile(pattern) - создает объект для последующего поиска

In [22]:

prog = re.compile(r'рыбак')

m = prog.findall('рыбак рыбака видит издалека')

print(m)

Out[22]:

['рыбак', 'рыбак']

In [23]:

prog = re.compile('[А-Я]') # поиск всех заглавныех букв в строке

m = prog.findall('Рыбак рыбака видит издалека. Всегда!')

print(m)

Out[23]:

['Р', 'В']

In [25]:

prog = re.compile('[авекорсту]{1}[0-9]{3}[авекорсту]{2}') # регулярное выражение для поиска автомобильных
                                                          # номеров (русские буквы, совпадающие с латиницей)

s = 'у456ао, ы234ег, 99авто443'
print(s)
res = prog.findall(s)

print(*res)

Out[25]:

у456ао, ы234ег, 99авто443
у456ао

In [26]:

# пример "жадных" операторов: ищем котов

s = 'кот котик компот'
res1 = re.findall(r'к.*т', s)
print(res1)

res2 = re.findall(r'к.*?т', s)
print(res2)

res3 = re.findall(r'к[\S]*?т', s)
print(res3)

res4 = re.findall(r'кот.*\s', s)
print(res4)

Out[26]:

['кот котик компот']
['кот', 'кот', 'к компот']
['кот', 'кот', 'компот']
['кот котик ']

Задание 1

Найдите в тексте все номера телефонов; текст лежит в файле 'task1.txt'. Обратите внимание на возможные форматы написания номеров.

In [30]:

import re

with open ('../data/task1.txt') as f:
    phones = f.read()
    
print(phones)

#здесь Ваш код

Out[30]:

Гарантируется, что в номере 8 цифр и он отделен пробелом, но форматы написания могут отличаться:


89268659970	Анна
8(495)3451212	Алексей Иванин
Автомастерская	+7(234)456-78-90
8(956)234-23-23	соседка 125 квартира
Офис, 5 этаж	85679962312 
Игорь		+7-845-344-23-65

In [31]:

#решение

prog1 = re.compile('[+0-9\-\(\)]{8,}')
res = prog1.findall(phones)
print(*res, sep='\n')

Out[31]:

89268659970
8(495)3451212
+7(234)456-78-90
8(956)234-23-23
85679962312
+7-845-344-23-65

Сегментация предложений

"?", "!" как правило однозначны. Проблемы возникают с ".".

Бинарный классификатор для сегментации предложений: для каждой точки "." определить, является ли она концом предложения или нет.

In [1]:

from nltk.tokenize import sent_tokenize

text = 'Первое предложение. Второе предложение! И, наконец, третье?'
sents = sent_tokenize(text)

print(len(sents))
print(*sents, sep='\n')

Out[1]:

3
Первое предложение.
Второе предложение!
И, наконец, третье?

Задание 2

Посчитайте количество токенов и предложений в тексте из файла task2.txt. Сохраните список токенов в массив tokens.

In [44]:

import nltk
text = ' '.join([line.strip() for line in open('../data/task2.txt')])
prog = re.compile('[А-Яа-я\-]+')
tokens = prog.findall(text.lower())
d1 = nltk.FreqDist(tokens)
print(tokens)
sents = sent_tokenize(text)
print(sents)

Out[44]:

['судьба', 'был', 'только', 'один', 'выход', 'ибо', 'наши', 'жизни', 'сплелись', 'в', 'слишком', 'запутанный', 'узел', 'гнева', 'и', 'блаженства', 'чтобы', 'решить', 'все', 'как-нибудь', 'иначе', 'доверимся', 'жребию', 'орел', 'и', 'мы', 'поженимся', 'решка', 'и', 'мы', 'расстанемся', 'навсегда', 'монетка', 'была', 'подброшена', 'она', 'звякнула', 'завертелась', 'и', 'остановилась', 'орел', 'мы', 'уставились', 'на', 'нее', 'с', 'недоумением', 'затем', 'в', 'один', 'голос', 'мы', 'сказали', 'может', 'еще', 'разок', 'джей', 'рип']
['Судьба.', 'Был только один выход, ибо наши жизни сплелись в слишком запутанный узел гнева и блаженства, чтобы решить все как-нибудь иначе.', 'Доверимся жребию: орел — и мы поженимся, решка — и мы расстанемся навсегда.', 'Монетка была подброшена.', 'Она звякнула, завертелась и остановилась.', 'Орел.', 'Мы уставились на нее с недоумением.', 'Затем, в один голос, мы сказали: «Может, еще разок?».', 'Джей Рип']

Частотный анализ текста

In [48]:

import nltk

d1 = nltk.FreqDist(tokens) # частотный словарь для текста
d1.most_common(10) # токен и кол-во его появлений в тексте

Out[48]:

[('и', 4),
 ('мы', 4),
 ('один', 2),
 ('в', 2),
 ('орел', 2),
 ('судьба', 1),
 ('был', 1),
 ('только', 1),
 ('выход', 1),
 ('ибо', 1)]

Задание 3

Посчитайте, сколько слов в тексте task2 встречается больше одного раза.
Посчитайте количество слов, состоящих из 5 букв и более.

In [46]:

res = [i for i in d1.most_common() if i[1] > 1]
print(*res, sep='\n')
print(len(res))

Out[46]:

('и', 4)
('мы', 4)
('один', 2)
('в', 2)
('орел', 2)
5

In [51]:

res = [i for i in d1 if len(i) > 5]
print(res)
print(len(res))

Out[51]:

['судьба', 'только', 'сплелись', 'слишком', 'запутанный', 'блаженства', 'решить', 'как-нибудь', 'доверимся', 'жребию', 'поженимся', 'расстанемся', 'навсегда', 'монетка', 'подброшена', 'звякнула', 'завертелась', 'остановилась', 'уставились', 'недоумением', 'сказали']
21

Морфологический анализ

Задачи морфологического анализа

Разбор слова — определение нормальной формы (леммы), основы (стема) и грамматических характеристик слова
Синтез слова — генерация слова по заданным грамматическим характеристикам

Морфологический процессор – инструмент морфологического анализа

Морфологический словарь
Морфологический анализатор

Лемматизация

У каждого слова есть лемма (нормальная форма):

кошке, кошку, кошкам, кошкой $\implies$ кошка
бежал, бежит, бегу $\implies$ бежать
белому, белым, белыми $\implies$ белый

In [6]:

sent = 'Действительно, на его лице не отражалось никаких чувств – ни проблеска сочувствия не было на нем, а ведь боль просто невыносима'

In [7]:

from pymorphy2 import MorphAnalyzer

m = MorphAnalyzer()
lemmas1 = [m.parse(word)[0].normal_form for word in sent.split()]
print(' '.join(lemmas1))

Out[7]:

действительно, на он лицо не отражаться никакой чувство – ни проблеск сочувствие не быть на нем, а ведь боль просто невыносимый

In [8]:

from pymystem3 import Mystem

m = Mystem()
lemmas2 = m.lemmatize(sent)
print(''.join(lemmas2))

Out[8]:

действительно, на его лицо не отражаться никакой чувство – ни проблеск сочувствие не быть на немой, а ведь боль просто невыносимый

Стемминг

Слова состоят из морфем: $word = stem + affixes$ . Стемминг позволяет отбросить аффиксы. Чаще всего используется алгоритм Портера.

1-ый вид ошибки: белый, белка, белье $\implies$ бел
2-ой вид ошибки: трудность, трудный $\implies$ трудност, труд
3-ий вид ошибки: быстрый, быстрее $\implies$ быст, побыстрее $\implies$ побыст

Алгоритм Портера состоит из 5 циклов команд, на каждом цикле – операция удаления / замены суффикса. Возможны вероятностные расширения алгоритма.

In [59]:

from nltk.stem.snowball import RussianStemmer

stemmer = RussianStemmer()
words = ['распределение', 'приставить', 'сделала', 'словообразование']
for w in words:
    stem = stemmer.stem(w)
    print(stem)

Out[59]:

распределен
пристав
сдела
словообразован

Разбор слова

In [95]:

word = 'стекло'

In [96]:

m = MorphAnalyzer()
m.parse(word)

Out[96]:

[Parse(word='стекло', tag=OpencorporaTag('NOUN,inan,neut sing,nomn'), normal_form='стекло', score=0.75, methods_stack=((<DictionaryAnalyzer>, 'стекло', 545, 0),)),
 Parse(word='стекло', tag=OpencorporaTag('NOUN,inan,neut sing,accs'), normal_form='стекло', score=0.1875, methods_stack=((<DictionaryAnalyzer>, 'стекло', 545, 3),)),
 Parse(word='стекло', tag=OpencorporaTag('VERB,perf,intr neut,sing,past,indc'), normal_form='стечь', score=0.0625, methods_stack=((<DictionaryAnalyzer>, 'стекло', 968, 3),))]

In [97]:

m = Mystem()
m.analyze(word)

Out[97]:

[{'analysis': [{'gr': 'S,сред,неод=(вин,ед|им,ед)', 'lex': 'стекло'}],
  'text': 'стекло'},
 {'text': '\n'}]

Задание 4

Найдите в списке персонажей романа "Война и мир" (task4.txt) все уникальные женские имена.

In [0]:

from pymorphy2 import MorphAnalyzer

m = MorphAnalyzer()
prog = re.compile('[А-Я]{1}[а-я]+') #слова с заглавной буквы
tokens = prog.findall(raw)
lemmas = [m.parse(word)[0].normal_form for word in tokens]

names = set()
for w in lemmas:
    p = m.parse(w)[0].tag
    if {'Name', 'femn'} in p:
        names.add(w.capitalize())
        
print(*names, sep='\n')

Первичная обработка текстов

Удаление стоп-слов

In [69]:

from nltk.corpus import stopwords
print(stopwords.words('russian'))

Out[69]:

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты', 'к', 'у', 'же', 'вы', 'за', 'бы', 'по', 'только', 'ее', 'мне', 'было', 'вот', 'от', 'меня', 'еще', 'нет', 'о', 'из', 'ему', 'теперь', 'когда', 'даже', 'ну', 'вдруг', 'ли', 'если', 'уже', 'или', 'ни', 'быть', 'был', 'него', 'до', 'вас', 'нибудь', 'опять', 'уж', 'вам', 'ведь', 'там', 'потом', 'себя', 'ничего', 'ей', 'может', 'они', 'тут', 'где', 'есть', 'надо', 'ней', 'для', 'мы', 'тебя', 'их', 'чем', 'была', 'сам', 'чтоб', 'без', 'будто', 'чего', 'раз', 'тоже', 'себе', 'под', 'будет', 'ж', 'тогда', 'кто', 'этот', 'того', 'потому', 'этого', 'какой', 'совсем', 'ним', 'здесь', 'этом', 'один', 'почти', 'мой', 'тем', 'чтобы', 'нее', 'сейчас', 'были', 'куда', 'зачем', 'всех', 'никогда', 'можно', 'при', 'наконец', 'два', 'об', 'другой', 'хоть', 'после', 'над', 'больше', 'тот', 'через', 'эти', 'нас', 'про', 'всего', 'них', 'какая', 'много', 'разве', 'три', 'эту', 'моя', 'впрочем', 'хорошо', 'свою', 'этой', 'перед', 'иногда', 'лучше', 'чуть', 'том', 'нельзя', 'такой', 'им', 'более', 'всегда', 'конечно', 'всю', 'между']

In [71]:

mystopwords = stopwords.words('russian') + ['это', 'наш' , 'тыс', 'млн', 'млрд', 'также',  'т', 'д']
def  remove_stopwords(text, mystopwords = mystopwords):
    try:
        return " ".join([token for token in text.split() if not token in mystopwords])
    except:
        return ""

In [72]:

m = Mystem()
def lemmatize(text, mystem=m):
    try:
        return "".join(m.lemmatize(text)).strip()  
    except:
        return " "

In [73]:

mystoplemmas = ['который','прошлый','сей', 'свой', 'наш', 'мочь']
def  remove_stoplemmas(text, mystoplemmas = mystoplemmas):
    try:
        return " ".join([token for token in text.split() if not token in mystoplemmas])
    except:
        return ""

In [74]:

df.text = df.text.apply(remove_stopwords) 
df.text = df.text.apply(lemmatize)
df.text = df.text.apply(remove_stoplemmas)

In [75]:

lemmata = []
for index, row in df.iterrows():
    lemmata += row['text'].split()
fd = FreqDist(lemmata)
for i in fd.most_common(10):
    print(i)

Out[75]:

('россия', 5643)
('год', 4750)
('москва', 4632)
('человек', 4556)
('путин', 4357)
('президент', 4109)
('выборы', 2849)
('вопрос', 2672)
('российский', 2312)
('время', 2261)

Синтаксический анализ

Грамматика зависимостей

Я купил кофе в большом магазине

дерево зависимостей

Все слова в предложении связаны отношением типа "хозяин-слуга", имеющим различные подтипы
Узел дерева – слово в предложении
Дуга дерева – отношение подчинения

Универсальные зависимости

SyntaxNet

SyntaxNet – архитектура синтаксического парсера. Доступны обученные модели для более чем 40 языков, в том числе, для русского.

D. Chen and C. D. Manning. A Fast and Accurate Dependency Parser using Neural Networks. EMNLP. 2014.

In [79]:

!echo "На северо-западе Москвы два подростка провалились под лед" | docker run --rm -i inemo/syntaxnet_rus

Out[79]:

I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.morphology-set input(1).token.morphology-set input(2).token.morphology-set input(3).token.morphology-set stack.token.morphology-set stack.child(1).token.morphology-set stack.child(1).sibling(-1).token.morphology-set stack.child(-1).token.morphology-set stack.child(-1).sibling(1).token.morphology-set stack.child(2).token.morphology-set stack.child(-2).token.morphology-set stack(1).token.morphology-set stack(1).child(1).token.morphology-set stack(1).child(1).sibling(-1).token.morphology-set stack(1).child(-1).token.morphology-set stack(1).child(-1).sibling(1).token.morphology-set stack(1).child(2).token.morphology-set stack(1).child(-2).token.morphology-set stack(2).token.morphology-set stack(3).token.morphology-set; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;morphology;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;32;64
I syntaxnet/term_frequency_map.cc:103] Loaded 66 terms from ./syntaxnet/models/Russian-SynTagRus/morphology-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [12 20 20 20] domain_sizes: [    37     66     33 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-tag input(-2).pred-tag input(-3).pred-tag input(-4).pred-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-morph-tag input(-2).pred-morph-tag input(-3).pred-morph-tag input(-4).pred-morph-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [8 8 4 8 8 8 8 4 8] domain_sizes: [     7  18750      5   8502   8502   7249   7249     34 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [8 8 4 8 8 8 8 4 8] domain_sizes: [     7  18750      5   8502   8502   7249   7249    449 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.morphology-set input(1).token.morphology-set input(2).token.morphology-set input(3).token.morphology-set stack.token.morphology-set stack.child(1).token.morphology-set stack.child(1).sibling(-1).token.morphology-set stack.child(-1).token.morphology-set stack.child(-1).sibling(1).token.morphology-set stack.child(2).token.morphology-set stack.child(-2).token.morphology-set stack(1).token.morphology-set stack(1).child(1).token.morphology-set stack(1).child(1).sibling(-1).token.morphology-set stack(1).child(-1).token.morphology-set stack(1).child(-1).sibling(1).token.morphology-set stack(1).child(2).token.morphology-set stack(1).child(-2).token.morphology-set stack(2).token.morphology-set stack(3).token.morphology-set; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;morphology;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;32;64
I syntaxnet/term_frequency_map.cc:103] Loaded 66 terms from ./syntaxnet/models/Russian-SynTagRus/morphology-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-tag input(-2).pred-tag input(-3).pred-tag input(-4).pred-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-morph-tag input(-2).pred-morph-tag input(-3).pred-morph-tag input(-4).pred-morph-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Processed 1 documents
INFO:tensorflow:Total processed documents: 1
INFO:tensorflow:num correct tokens: 0
INFO:tensorflow:total tokens: 8
INFO:tensorflow:Seconds elapsed in evaluation: 0.16, eval metric: 0.00%
INFO:tensorflow:Processed 1 documents
INFO:tensorflow:Total processed documents: 1
INFO:tensorflow:num correct tokens: 0
INFO:tensorflow:total tokens: 8
INFO:tensorflow:Seconds elapsed in evaluation: 0.81, eval metric: 0.00%
INFO:tensorflow:Processed 1 documents
INFO:tensorflow:Total processed documents: 1
INFO:tensorflow:num correct tokens: 1
INFO:tensorflow:total tokens: 8
INFO:tensorflow:Seconds elapsed in evaluation: 1.89, eval metric: 12.50%
1	На	_	ADP	_	fPOS=ADP++	2	case	_	_
2	северо-западе	_	NOUN	_	Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|fPOS=NOUN++	6	nmod	_	_
3	Москвы	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing|fPOS=NOUN++	2	dobj	_	_
4	два	_	NUM	_	Case=Nom|Gender=Masc|fPOS=NUM++	5	nummod	_	_
5	подростка	_	NOUN	_	Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing|fPOS=NOUN++	6	nsubj	_	_
6	провалились	_	VERB	_	Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act|fPOS=VERB++	0	ROOT	_	_
7	под	_	ADP	_	fPOS=ADP++	8	case	_	_
8	лед	_	NOUN	_	Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|fPOS=NOUN++	6	dobj	_	_

In [78]:

! cat ../data/sentences.txt | docker run --rm -i inemo/syntaxnet_rus > ../data/sentences.conll

Out[78]:

I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.morphology-set input(1).token.morphology-set input(2).token.morphology-set input(3).token.morphology-set stack.token.morphology-set stack.child(1).token.morphology-set stack.child(1).sibling(-1).token.morphology-set stack.child(-1).token.morphology-set stack.child(-1).sibling(1).token.morphology-set stack.child(2).token.morphology-set stack.child(-2).token.morphology-set stack(1).token.morphology-set stack(1).child(1).token.morphology-set stack(1).child(1).sibling(-1).token.morphology-set stack(1).child(-1).token.morphology-set stack(1).child(-1).sibling(1).token.morphology-set stack(1).child(2).token.morphology-set stack(1).child(-2).token.morphology-set stack(2).token.morphology-set stack(3).token.morphology-set; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;morphology;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;32;64
I syntaxnet/term_frequency_map.cc:103] Loaded 66 terms from ./syntaxnet/models/Russian-SynTagRus/morphology-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-tag input(-2).pred-tag input(-3).pred-tag input(-4).pred-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-morph-tag input(-2).pred-morph-tag input(-3).pred-morph-tag input(-4).pred-morph-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
INFO:tensorflow:Building training network with parameters: feature_sizes: [8 8 4 8 8 8 8 4 8] domain_sizes: [     7  18750      5   8502   8502   7249   7249     34 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [12 20 20 20] domain_sizes: [    37     66     33 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Building training network with parameters: feature_sizes: [8 8 4 8 8 8 8 4 8] domain_sizes: [     7  18750      5   8502   8502   7249   7249    449 103475]
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: stack.child(1).label stack.child(1).sibling(-1).label stack.child(-1).label stack.child(-1).sibling(1).label stack.child(2).label stack.child(-2).label stack(1).child(1).label stack(1).child(1).sibling(-1).label stack(1).child(-1).label stack(1).child(-1).sibling(1).label stack(1).child(2).label stack(1).child(-2).label; input.token.morphology-set input(1).token.morphology-set input(2).token.morphology-set input(3).token.morphology-set stack.token.morphology-set stack.child(1).token.morphology-set stack.child(1).sibling(-1).token.morphology-set stack.child(-1).token.morphology-set stack.child(-1).sibling(1).token.morphology-set stack.child(2).token.morphology-set stack.child(-2).token.morphology-set stack(1).token.morphology-set stack(1).child(1).token.morphology-set stack(1).child(1).sibling(-1).token.morphology-set stack(1).child(-1).token.morphology-set stack(1).child(-1).sibling(1).token.morphology-set stack(1).child(2).token.morphology-set stack(1).child(-2).token.morphology-set stack(2).token.morphology-set stack(3).token.morphology-set; input.token.tag input(1).token.tag input(2).token.tag input(3).token.tag stack.token.tag stack.child(1).token.tag stack.child(1).sibling(-1).token.tag stack.child(-1).token.tag stack.child(-1).sibling(1).token.tag stack.child(2).token.tag stack.child(-2).token.tag stack(1).token.tag stack(1).child(1).token.tag stack(1).child(1).sibling(-1).token.tag stack(1).child(-1).token.tag stack(1).child(-1).sibling(1).token.tag stack(1).child(2).token.tag stack(1).child(-2).token.tag stack(2).token.tag stack(3).token.tag; input.token.word input(1).token.word input(2).token.word input(3).token.word stack.token.word stack.child(1).token.word stack.child(1).sibling(-1).token.word stack.child(-1).token.word stack.child(-1).sibling(1).token.word stack.child(2).token.word stack.child(-2).token.word stack(1).token.word stack(1).child(1).token.word stack(1).child(1).sibling(-1).token.word stack(1).child(-1).token.word stack(1).child(-1).sibling(1).token.word stack(1).child(2).token.word stack(1).child(-2).token.word stack(2).token.word stack(3).token.word 
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: labels;morphology;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 32;32;32;64
I syntaxnet/term_frequency_map.cc:103] Loaded 66 terms from ./syntaxnet/models/Russian-SynTagRus/morphology-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 31 terms from ./syntaxnet/models/Russian-SynTagRus/tag-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-tag input(-2).pred-tag input(-3).pred-tag input(-4).pred-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 34 terms from ./syntaxnet/models/Russian-SynTagRus/label-map.
I syntaxnet/embedding_feature_extractor.cc:35] Features: input.capitalization input(1).capitalization input(2).capitalization input(3).capitalization input(-1).capitalization input(-2).capitalization input(-3).capitalization input(-4).capitalization; input.token.char-ngram input(1).token.char-ngram input(2).token.char-ngram input(3).token.char-ngram input(-1).token.char-ngram input(-2).token.char-ngram input(-3).token.char-ngram input(-4).token.char-ngram; input.digit input.hyphen input.token.punctuation-amount input.token.quote; input.token.prefix(length=2) input(1).token.prefix(length=2) input(2).token.prefix(length=2) input(3).token.prefix(length=2) input(-1).token.prefix(length=2) input(-2).token.prefix(length=2) input(-3).token.prefix(length=2) input(-4).token.prefix(length=2); input.token.prefix(length=3) input(1).token.prefix(length=3) input(2).token.prefix(length=3) input(3).token.prefix(length=3) input(-1).token.prefix(length=3) input(-2).token.prefix(length=3) input(-3).token.prefix(length=3) input(-4).token.prefix(length=3); input.token.suffix(length=2) input(1).token.suffix(length=2) input(2).token.suffix(length=2) input(3).token.suffix(length=2) input(-1).token.suffix(length=2) input(-2).token.suffix(length=2) input(-3).token.suffix(length=2) input(-4).token.suffix(length=2); input.token.suffix(length=3) input(1).token.suffix(length=3) input(2).token.suffix(length=3) input(3).token.suffix(length=3) input(-1).token.suffix(length=3) input(-2).token.suffix(length=3) input(-3).token.suffix(length=3) input(-4).token.suffix(length=3); input(-1).pred-morph-tag input(-2).pred-morph-tag input(-3).pred-morph-tag input(-4).pred-morph-tag; input.token.word input(1).token.word input(2).token.word input(3).token.word input(-1).token.word input(-2).token.word input(-3).token.word input(-4).token.word
I syntaxnet/embedding_feature_extractor.cc:36] Embedding names: capitalization;char_ngram;other;prefix2;prefix3;suffix2;suffix3;tags;words
I syntaxnet/embedding_feature_extractor.cc:37] Embedding dims: 2;16;8;16;16;16;16;16;64
I syntaxnet/term_frequency_map.cc:103] Loaded 18749 terms from ./syntaxnet/models/Russian-SynTagRus/char-ngram-map.
I syntaxnet/term_frequency_map.cc:103] Loaded 103473 terms from ./syntaxnet/models/Russian-SynTagRus/word-map.
INFO:tensorflow:Processed 6 documents
INFO:tensorflow:Total processed documents: 6
INFO:tensorflow:num correct tokens: 0
INFO:tensorflow:total tokens: 48
INFO:tensorflow:Seconds elapsed in evaluation: 0.24, eval metric: 0.00%
INFO:tensorflow:Processed 6 documents
INFO:tensorflow:Total processed documents: 6
INFO:tensorflow:num correct tokens: 0
INFO:tensorflow:total tokens: 48
INFO:tensorflow:Seconds elapsed in evaluation: 0.52, eval metric: 0.00%
INFO:tensorflow:Processed 6 documents
INFO:tensorflow:Total processed documents: 6
INFO:tensorflow:num correct tokens: 6
INFO:tensorflow:total tokens: 48
INFO:tensorflow:Seconds elapsed in evaluation: 1.30, eval metric: 12.50%

In [0]:

Обработка conll файлов:

In [91]:

from nltk import DependencyGraph
import codecs

processed_sentences = []
sentence = []
for line in codecs.open('data.conll', 'r', 'utf-8'):
    if len(line) == 1:
        processed_sentences.append(sentence)
        sentence = []
    else:
        word = line.split("\t")
        sentence.append(word)

deps = []
for sentence in processed_sentences:
    s = u""
    for line in sentence:
        s += u"\t".join(line) + u'\n'
    deps.append(s)

In [0]:

Синтаксические деревья:

In [93]:

for sent_dep in deps:
    graph = DependencyGraph(tree_str=sent_dep)
    tree = graph.tree()
    print(tree.pretty_print())

Out[93]:

               испек                           
   ______________|__________                    
  |       |      |       помощью               
  |       |      |     _____|__________         
  |       |      |    |            интеллекта  
  |       |      |    |                |        
Google печенье   .    с          искусственного

None
             стал             
    __________|_______         
   |     |         звездой    
   |     |     _______|_____   
   |     |    |       |    НХЛ
   |     |    |       |     |  
Овечкин  .  первой   дня    в 

None
    задержали                           
  ______|_______________                 
 |      |         подозреваемого        
 |      |               |                
 |    Кубани         убийстве           
 |      |       ________|___________     
 .      На     в                 двойном

None
    вынес                   
  ____|_______________       
 |    |      |     приговор 
 |    |      |        |      
 |    |   Ростове   банде   
 |    |      |        |      
суд   .      В    «амазонок»

None
                   закрасили                      
     __________________|____________               
    |        |         |         жертвам          
    |        |         |            |              
    |        |         |         теракта          
    |        |         |            |              
    |        |         |          метро           
    |        |         |       _____|________      
Чиновники мемориал     .      в          питерском

None
     продлил              
  ______|____________      
 |      |      |  контракт
 |      |      |     |     
 |      |      |  Гуламом 
 |      |      |     |     
Клуб «Наполи»  .     с    

None

In [0]:

Тройки слово-слово-связь:

In [89]:

for sent_dep in deps:
    graph = DependencyGraph(tree_str=sent_dep)
    print([triple for triple in (graph.triples())])
    print()

Out[89]:

[(('испек', 'VERB'), 'dobj', ('Google', 'NOUN')), (('испек', 'VERB'), 'nmod', ('помощью', 'NOUN')), (('помощью', 'NOUN'), 'case', ('с', 'ADP')), (('помощью', 'NOUN'), 'nmod', ('интеллекта', 'NOUN')), (('интеллекта', 'NOUN'), 'amod', ('искусственного', 'ADJ')), (('испек', 'VERB'), 'nsubj', ('печенье', 'NOUN')), (('испек', 'VERB'), 'punct', ('.', 'PUNCT'))]

[(('стал', 'VERB'), 'nsubj', ('Овечкин', 'NOUN')), (('стал', 'VERB'), 'nmod', ('звездой', 'NOUN')), (('звездой', 'NOUN'), 'amod', ('первой', 'ADJ')), (('звездой', 'NOUN'), 'nmod', ('дня', 'NOUN')), (('звездой', 'NOUN'), 'nmod', ('НХЛ', 'NOUN')), (('НХЛ', 'NOUN'), 'case', ('в', 'ADP')), (('стал', 'VERB'), 'punct', ('.', 'PUNCT'))]

[(('задержали', 'VERB'), 'nmod', ('Кубани', 'NOUN')), (('Кубани', 'NOUN'), 'case', ('На', 'ADP')), (('задержали', 'VERB'), 'dobj', ('подозреваемого', 'NOUN')), (('подозреваемого', 'NOUN'), 'nmod', ('убийстве', 'NOUN')), (('убийстве', 'NOUN'), 'case', ('в', 'ADP')), (('убийстве', 'NOUN'), 'amod', ('двойном', 'ADJ')), (('задержали', 'VERB'), 'punct', ('.', 'PUNCT'))]

[(('вынес', 'VERB'), 'nmod', ('Ростове', 'NOUN')), (('Ростове', 'NOUN'), 'case', ('В', 'ADP')), (('вынес', 'VERB'), 'nsubj', ('суд', 'NOUN')), (('вынес', 'VERB'), 'dobj', ('приговор', 'NOUN')), (('приговор', 'NOUN'), 'nmod', ('банде', 'NOUN')), (('банде', 'NOUN'), 'nmod', ('«амазонок»', 'NOUN')), (('вынес', 'VERB'), 'punct', ('.', 'PUNCT'))]

[(('закрасили', 'VERB'), 'nsubj', ('Чиновники', 'NOUN')), (('закрасили', 'VERB'), 'dobj', ('мемориал', 'NOUN')), (('закрасили', 'VERB'), 'nmod', ('жертвам', 'NOUN')), (('жертвам', 'NOUN'), 'nmod', ('теракта', 'NOUN')), (('теракта', 'NOUN'), 'nmod', ('метро', 'NOUN')), (('метро', 'NOUN'), 'case', ('в', 'ADP')), (('метро', 'NOUN'), 'amod', ('питерском', 'ADJ')), (('закрасили', 'VERB'), 'punct', ('.', 'PUNCT'))]

[(('продлил', 'VERB'), 'nsubj', ('Клуб', 'NOUN')), (('продлил', 'VERB'), 'advmod', ('«Наполи»', 'ADV')), (('продлил', 'VERB'), 'dobj', ('контракт', 'NOUN')), (('контракт', 'NOUN'), 'nmod', ('Гуламом', 'NOUN')), (('Гуламом', 'NOUN'), 'case', ('с', 'ADP')), (('продлил', 'VERB'), 'punct', ('.', 'PUNCT'))]

Тройки субьект-объект-глагол:

In [83]:

for sent_dep in deps:
    graph = DependencyGraph(tree_str=sent_dep)
    sov = {}
    for triple in graph.triples():
        if triple:
            if triple[0][1] == 'VERB':
                sov[triple[0][0]] = {'subj':'','obj':''}
    for triple in graph.triples():
        if triple:
            if triple[1] == 'nsubj':
                if triple[0][1] == 'VERB':
                    sov[triple[0][0]]['subj']  = triple[2][0]
            if triple[1] == 'dobj':
                if triple[0][1] == 'VERB':
                    sov[triple[0][0]]['obj'] = triple[2][0]

    for verb in sov:
        print(verb,sov[verb])

Out[83]:

испек {'subj': 'печенье', 'obj': 'Google'}
стал {'subj': 'Овечкин', 'obj': ''}
задержали {'subj': '', 'obj': 'подозреваемого'}
вынес {'subj': 'суд', 'obj': 'приговор'}
закрасили {'subj': 'Чиновники', 'obj': 'мемориал'}
продлил {'subj': 'Клуб', 'obj': 'контракт'}

Задание 5

Измените код выше так, чтобы учитывались: 1. Однородные члены предложения * (парк, площадка), (Германия, Щвейцария) 2. Сложные сказуемые * (начнет продавать), (запретил провозить) 3. Непрямые объекты * (едет, Польшу), (спел, скандале)