²¹⁰⁸ views

Kernel: Python 3 (Ubuntu Linux)

Визуальный анализ данных

Подключаем необходимые библиотеки.

In [1]:

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

Считываем датасет.

In [2]:

data = pd.read_csv("telecom-churn.csv")

Проверяем, всё ли правильно считалось и "распарсилось".

In [3]:

data.head()

Out[3]:

Можно получить сводку и общее представление о типах данных.

In [4]:

data.info()

Out[4]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
state                     3333 non-null object
account length            3333 non-null int64
area code                 3333 non-null int64
phone number              3333 non-null object
international plan        3333 non-null object
voice mail plan           3333 non-null object
number vmail messages     3333 non-null int64
total day minutes         3333 non-null float64
total day calls           3333 non-null int64
total day charge          3333 non-null float64
total eve minutes         3333 non-null float64
total eve calls           3333 non-null int64
total eve charge          3333 non-null float64
total night minutes       3333 non-null float64
total night calls         3333 non-null int64
total night charge        3333 non-null float64
total intl minutes        3333 non-null float64
total intl calls          3333 non-null int64
total intl charge         3333 non-null float64
customer service calls    3333 non-null int64
churn                     3333 non-null bool
dtypes: bool(1), float64(8), int64(8), object(4)
memory usage: 524.1+ KB

Целевая переменная: churn (лояльность абонента). Это категориальный (более конкретно — бинарный) признак. Попробуем узнать, как распределены его значения.

In [5]:

data['churn'].value_counts()

Out[5]:

False    2850
True      483
Name: churn, dtype: int64

Видим, что 2850 из 3333 абонентов — лояльные. А сколько это в процентах?..

In [6]:

data['churn'].value_counts(normalize=True)

Out[6]:

False    0.855086
True     0.144914
Name: churn, dtype: float64

Визуализируем это.

In [12]:

data['churn'].value_counts(normalize=True).plot(kind='bar', 
                                                title='Признак churn');

Out[12]:

Нам также может быть интересно, у скольких наших клиентов подключён роуминг.

In [14]:

data['international plan'].value_counts(normalize=True).plot(kind='bar');

Out[14]:

А как обстоят дела у нелояльных пользователей (churn=1)?

In [16]:

churn_users = data[data['churn'] == True]
churn_users['international plan'].value_counts(normalize=True).plot(kind='bar');

Out[16]:

Видим, что процент клиентов с роумингом выше, чем в общей выборке.

Можем предположить, что бинарные признаки international plan и churn коррелируют. Нарисуем теперь их вместе.

In [19]:

pd.crosstab(data['churn'], data['international plan'], margins=True)

Out[19]:

In [22]:

sns.countplot(x='international plan', hue='churn', data=data);

Out[22]:

Большинство клиентов, у которых был подключён роуминг, от нас ушли!

In [25]:

data.groupby('international plan')['churn'].count()

Out[25]:

international plan
no     3010
yes     323
Name: churn, dtype: int64

Посмотрим на распределение признака account length.

In [27]:

sns.distplot(data['account length']);

Out[27]:

Похоже на нормальное распределение!

Что можно сказать о связи между account length и лояльностью?

In [28]:

data.groupby('churn')['account length'].mean()

Out[28]:

churn
False    100.793684
True     102.664596
Name: account length, dtype: float64

In [29]:

data.groupby('churn')['account length'].std()

Out[29]:

churn
False    39.88235
True     39.46782
Name: account length, dtype: float64

In [30]:

data.groupby('churn')['account length'].median()

Out[30]:

churn
False    100
True     103
Name: account length, dtype: int64

На первый взгляд, никак не связаны.

In [36]:

fig, ax = plt.subplots(1, 2, sharey=True)
sns.distplot(data[data['churn'] == False]['account length'], 
             ax=ax[0]).set_title('Лояльные');
sns.distplot(churn_users['account length'], 
             ax=ax[1]).set_title('Ушедшие');

Out[36]:

На второй взгляд тоже.

Теперь посмотрим, связаны ли длительности дневных и ночных звонков.

In [37]:

sns.regplot(data['total day minutes'], data['total night minutes']);

Out[37]:

А как насчёт количества звонков?

In [38]:

sns.regplot(data['total day calls'], data['total night calls']);

Out[38]:

Пока никакой связи не видно.

Построим корреляционную матрицу для числовых признаков.

In [54]:

numeric_data = data.select_dtypes(['int64', 'float64'])

In [45]:

numeric_data.head()

Out[45]:

In [60]:

corr_matrix = numeric_data.drop('area code', axis=1).corr()
corr_matrix

Out[60]:

In [62]:

sns.heatmap(corr_matrix);

Out[62]:

In [58]:

sns.pairplot(numeric_data[['total day minutes', 
                           'total day calls', 
                           'total day charge']]);

Out[58]:

In [0]:

Визуальный анализ данных

Product

Resources

Company