Heigth_Weights Data Exploratory¶

1. glance of data¶

import pandas as pd

df = pd.read_csv('01_heights_weights_genders.csv')
df.head()

2. Distribution of height and weight¶

2.1 Histogram and density¶

%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.figsize": (8, 6)})
sns.distplot(df['Height'])

<matplotlib.axes.AxesSubplot at 0x7a42130>

sns.distplot(df['Weight']);

As seen in the two density plot, there's hiden pattern in the data. So the third varible is added to the plot.

sns.distplot(df[df['Gender']=='Male']['Height'],hist=False,label='Male');
sns.distplot(df[df['Gender']=='Female']['Height'],hist=False,label="Female");

Now, it's clear that both male and female have a bell curve and it also expleained why the total curve has two peak.

sns.distplot(df[df['Gender']=='Male']['Weight'],hist=False,label='Male');
sns.distplot(df[df['Gender']=='Female']['Weight'],hist=False,label="Female");

3. Correlation between weight and height¶

sns.regplot(x="Height", y="Weight", data=df)

<matplotlib.axes.AxesSubplot at 0x7d5eeb0>

D:\Python27\lib\site-packages\matplotlib\collections.py:548: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

As shown in the plot, height and weight have a strong correlation.

import scipy
p = scipy.stats.pearsonr(df['Height'],df['Weight'])
print "pearson correlation coeff = %s" % str(p[0])

pearson correlation coeff = 0.924756298741

And next is a scatterplot of heights versus weights colored by gender with linear fit.

from ggplot import *
from sklearn.linear_model import LogisticRegression   
import numpy as np
import matplotlib.pyplot as plt

def score_to_numeric(x):
    if x=='Female':
        return 1
    if x=='Male':
        return 2

df['Gender_copy'] = df['Gender'].apply(score_to_numeric)
classifier = LogisticRegression()  # 使用类，参数全是默认的  
classifier.fit(df[['Height',"Weight"]], df['Gender_copy'])  # 训练数据来学习，不需要返回值  
xx, yy = np.mgrid[min(df['Height']):max(df['Height']):1, min(df['Weight']):max(df['Weight']):1]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = classifier.predict_proba(grid)[:, 1].reshape(xx.shape)
g = sns.lmplot(x="Height", y="Weight", hue='Gender', data=df, fit_reg=False)
plt.contour(xx, yy, probs, levels=[.5], cmap="Greys", vmin=0, vmax=.6)
plt.figure(figsize=(20, 10))

<matplotlib.figure.Figure at 0x99b39b0>

<matplotlib.figure.Figure at 0x99b39b0>

The line we’ve drawn has a very fancy-sounding name: the “separating hyperplane.”It’s a “separating” hyperplane because it splits the data into two groups: on one side, you guess that someone is female given her height and weight, and on the other side, you guess that he’s male. This is a pretty good way to make guesses

	Gender	Height	Weight
0	Male	73.847017	241.893563
1	Male	68.781904	162.310473
2	Male	74.110105	212.740856
3	Male	71.730978	220.042470
4	Male	69.881796	206.349801