大數(shù)據(jù)與云計(jì)算學(xué)習(xí)：數(shù)據(jù)分析（一）

dunizb 發(fā)布于2019-07-30 14:48 / 2844人閱讀

python基礎(chǔ)

先看看基礎(chǔ)

注意點(diǎn)

切割操作

這里發(fā)現(xiàn)我們?cè)谌〕鰈ist中的元素時(shí)候是左開(kāi)右閉的，即[3,6) 索引6對(duì)應(yīng)的元素7并沒(méi)有被輸出

改變list中的元素

添加刪除元素

兩種拷貝list的方式

list2拷貝給y，y改變，list2也變

list2拷貝給y，y改變，list2不變

刪除實(shí)例的屬性和刪除字典屬性的區(qū)別

a = {"a":1,"b":2}
del a["a"]
a = classname()
del classname.attrname

with as

https://www.cnblogs.com/DswCn...

if name == "__main__":

if __name__ == "__main__":

一個(gè)python的文件有兩種使用的方法，
第一是直接作為腳本執(zhí)行，
第二是import到其他的python腳本中被調(diào)用（模塊重用）執(zhí)行。
因此if name == "main":
的作用就是控制這兩種情況執(zhí)行代碼的過(guò)程，
在if name == "main": 下的代碼只有在第一種情況下（即文件作為腳本直接執(zhí)行）才會(huì)被執(zhí)行，
而import到其他腳本中是不會(huì)被執(zhí)行的。...

函數(shù) /方法 正則表達(dá)式

基礎(chǔ)看這里

import re
line = "jwxddxsw33"
if line == "jxdxsw33":
    print("yep")
else:
    print("no")

# ^ 限定以什么開(kāi)頭
regex_str = "^j.*"
if re.match(regex_str, line):
    print("yes")
#$限定以什么結(jié)尾
regex_str1 = "^j.*3$"
if re.match(regex_str, line):
    print("yes")

regex_str1 = "^j.3$"
if re.match(regex_str, line):
    print("yes")
# 貪婪匹配
regex_str2 = ".*(d.*w).*"
match_obj = re.match(regex_str2, line)
if match_obj:
    print(match_obj.group(1))
# 非貪婪匹配
# ？處表示遇到第一個(gè)d 就匹配
regex_str3 = ".*?(d.*w).*"
match_obj = re.match(regex_str3, line)
if match_obj:
    print(match_obj.group(1))
# * 表示>=0次　?。”硎尽?=0次
# ? 表示非貪婪模式
# + 的作用至少>出現(xiàn)一次  所以.+任意字符這個(gè)字符至少出現(xiàn)一次
line1 = "jxxxxxxdxsssssswwwwjjjww123"
regex_str3 = ".*(w.+w).*"
match_obj = re.match(regex_str3, line1)
if match_obj:
    print(match_obj.group(1))
# {2}限定前面的字符出現(xiàn)次數(shù) {2,}2次以上 {2,5}最小兩次最多5次
line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{3}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswwaawwjjjww123"
regex_str3 = ".*(w.{2}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

line2 = "jxxxxxxdxsssssswbwaawwjjjww123"
regex_str3 = ".*(w.{5,}w).*"
match_obj = re.match(regex_str3, line2)
if match_obj:
    print(match_obj.group(1))

# | 或

line3 = "jx123"
regex_str4 = "((jx|jxjx)123)"
match_obj = re.match(regex_str4, line3)
if match_obj:
    print(match_obj.group(1))
    print(match_obj.group(2))
# [] 表示中括號(hào)內(nèi)任意一個(gè)
line4 = "ixdxsw123"
regex_str4 = "([hijk]xdxsw123)"
match_obj = re.match(regex_str4, line4)
if match_obj:
    print(match_obj.group(1))
# [0,9]{9} 0到9任意一個(gè) 出現(xiàn)9次（9位數(shù)）
line5 = "15955224326"
regex_str5 = "(1[234567][0-9]{9})"
match_obj = re.match(regex_str5, line5)
if match_obj:
    print(match_obj.group(1))
# [^1]{9}
line6 = "15955224326"
regex_str6 = "(1[234567][^1]{9})"
match_obj = re.match(regex_str6, line6)
if match_obj:
    print(match_obj.group(1))

# [.*]{9} 中括號(hào)中的.和*就代表.*本身
line7 = "1.*59224326"
regex_str7 = "(1[.*][^1]{9})"
match_obj = re.match(regex_str7, line7)
if match_obj:
    print(match_obj.group(1))

#s 空格
line8 = "你 好"
regex_str8 = "(你s好)"
match_obj = re.match(regex_str8, line8)
if match_obj:
    print(match_obj.group(1))

# S 只要不是空格都可以（非空格）
line9 = "你真好"
regex_str9 = "(你S好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

# w  任意字符 和.不同的是 它表示[A-Za-z0-9_]
line9 = "你adsfs好"
regex_str9 = "(你wwwww好)"
match_obj = re.match(regex_str9, line9)
if match_obj:
    print(match_obj.group(1))

line10 = "你adsf_好"
regex_str10 = "(你wwwww好)"
match_obj = re.match(regex_str10, line10)
if match_obj:
    print(match_obj.group(1))
#W大寫(xiě)的是非[A-Za-z0-9_]
line11 = "你 好"
regex_str11 = "(你W好)"
match_obj = re.match(regex_str11, line11)
if match_obj:
    print(match_obj.group(1))

# unicode編碼 [u4E00-u9FA5] 表示漢字
line12= "鏡心的小樹(shù)屋"
regex_str12= "([u4E00-u9FA5]+)"
match_obj = re.match(regex_str12,line12)
if match_obj:
    print(match_obj.group(1))

print("-----貪婪匹配情況----")
line13 = "reading in 鏡心的小樹(shù)屋"
regex_str13 = ".*([u4E00-u9FA5]+樹(shù)屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

print("----取消貪婪匹配情況----")
line13 = "reading in 鏡心的小樹(shù)屋"
regex_str13 = ".*?([u4E00-u9FA5]+樹(shù)屋)"
match_obj = re.match(regex_str13, line13)
if match_obj:
    print(match_obj.group(1))

#d數(shù)字
line14 = "XXX出生于2011年"
regex_str14 = ".*(d{4})年"
match_obj = re.match(regex_str14, line14)
if match_obj:
    print(match_obj.group(1))

regex_str15 = ".*?(d+)年"
match_obj = re.match(regex_str15, line14)
if match_obj:
    print(match_obj.group(1))

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

###
# 試寫(xiě)一個(gè)驗(yàn)證Email地址的正則表達(dá)式。版本一應(yīng)該可以驗(yàn)證出類(lèi)似的Email：
#someone@gmail.com
#bill.gates@microsoft.com
###

import re
addr = "someone@gmail.com"
addr2 = "bill.gates@microsoft.com"
def is_valid_email(addr):
    if re.match(r"[a-zA-Z_.]*@[a-aA-Z.]*",addr):
        return True
    else:
        return False

print(is_valid_email(addr))
print(is_valid_email(addr2))

# 版本二可以提取出帶名字的Email地址：
#  tom@voyager.org => Tom Paris
# bob@example.com => bob

addr3 = " tom@voyager.org"
addr4 = "bob@example.com"

def name_of_email(addr):
    r=re.compile(r"^(?)([ws]*)@([w.]*)$")
    if not r.match(addr):
        return None
    else:
        m = r.match(addr)
        return m.group(2)

print(name_of_email(addr3))
print(name_of_email(addr4))

案例

找出一個(gè)文本中詞頻最高的單詞

text = "the clown ran after the car and the car ran into the tent and the tent fell down on the clown and the car"
words = text.split()
print(words)

for word in words:# 初始化空列表
    print(word)


#步驟一：獲得單詞列表  相當(dāng)于去重
unique_words = list()
for word in words:
   if(word not in unique_words):# 使用in判斷某個(gè)元素是否在列表里
       unique_words.append(word)
print(unique_words)


#步驟二：初始化詞頻列表

# [e]*n 快速初始化
counts = [0] * len(unique_words)
print(counts)

# 步驟三：統(tǒng)計(jì)詞頻
for word in words:
    index = unique_words.index(word)

    counts[index] = counts[index] + 1
    print(counts[index])
print(counts)
# 步驟四：找出最高詞頻和其對(duì)應(yīng)的單詞
bigcount = None #None 為空，初始化bigcount
bigword = None

for i in range(len(counts)):
    if bigcount is None or counts[i] > bigcount:
        bigword = unique_words[i]
        bigcount = counts[i]
print(bigword,bigcount)

用字典的方式：

# 案例回顧：找出一個(gè)文本中最高詞頻的單詞

text = """the clown ran after the car and the car ran into the tent 
        and the tent fell down on the clown and the car"""
words = text.split() # 獲取單詞的列表

# 使用字典可以極大簡(jiǎn)化步驟
# 獲取單詞-詞頻字典
counts = dict() # 初始化一個(gè)空字典
for word in words:
    counts[word] = counts.get(word, 0) + 1  # 構(gòu)造字典。注意get方法需要設(shè)定默認(rèn)返回值0（當(dāng)單詞第一次出現(xiàn)時(shí)，詞頻為1）
print(counts)

# 在字典中查找最高詞頻的單詞
bigcount = None
bigword = None
for word,count in counts.items():
    if bigcount is None or count > bigcount:
        bigword = word
        bigcount = count

print(bigword, bigcount)

自定義一個(gè)每周工資計(jì)算器函數(shù)

# 使用input()函數(shù)，從鍵盤(pán)讀取輸入的文本
# a = input("請(qǐng)輸入文本:")
# print("您輸入的內(nèi)容是：",a)

def salary_calculator(): #沒(méi)有參數(shù)的函數(shù)
    user = str #初始化user為字符串變量
    print("----工資計(jì)算器----")

    while True:
        user = input("
請(qǐng)輸入你的名字，或者輸入0來(lái)結(jié)束報(bào)告: ")

        if user == "0":
            print("結(jié)束報(bào)告")
            break
        else:
            hours = float(input("請(qǐng)輸入你的工作小時(shí)數(shù)："))
            payrate =float(input("請(qǐng)輸入你的單位時(shí)間工資： ￥"))

            if hours <= 40:
                print("員工姓名:",user)
                print("加班小時(shí)數(shù)：0")
                print("加班費(fèi)：￥0.00")
                regularpay = round(hours * payrate,2) # round函數(shù)保留小數(shù)點(diǎn)后兩位
                print("稅前工資:￥" + str(regularpay))


            elif hours > 40:

                overtimehours = round(hours - 40, 2)

                print("員工姓名: " + user)

                print("加班小時(shí)數(shù): " + str(overtimehours))

                regularpay = round(40 * payrate, 2)

                overtimerate = round(payrate * 1.5, 2)

                overtimepay = round(overtimehours * overtimerate)

                grosspay = round(regularpay + overtimepay, 2)

                print("常規(guī)工資: ￥" + str(regularpay))

                print("加班費(fèi): ￥" + str(overtimepay))

                print("稅前工資: ￥" + str(grosspay))

#調(diào)用 salary_calculator

salary_calculator()

這個(gè)實(shí)例中注意 python中關(guān)于round函數(shù)的小坑

數(shù)據(jù)結(jié)構(gòu)、函數(shù)、條件和循環(huán) 包管理

戳這里看有哪些流行python包——>awesom-python

Numpy 處理數(shù)組/數(shù)據(jù)計(jì)算擴(kuò)展

ndarray 一種多維數(shù)組對(duì)象

利用數(shù)組進(jìn)行數(shù)據(jù)處理

用于數(shù)組的文件輸入輸出

多維操作

線(xiàn)性代數(shù)

隨機(jī)數(shù)生成

隨機(jī)漫步

Numpy高級(jí)應(yīng)用

ndarray 對(duì)象的內(nèi)部機(jī)制

高級(jí)數(shù)組操作

廣播

ufunc高級(jí)應(yīng)用

結(jié)構(gòu)化和記錄式數(shù)組

更多有關(guān)排序

NumPy的matrix類(lèi)

高級(jí)數(shù)組輸入輸出

Matplotlib 數(shù)據(jù)可視化

Pandas 數(shù)據(jù)分析

pandas的數(shù)據(jù)結(jié)構(gòu)

基本功能

匯總和計(jì)算描述統(tǒng)計(jì)

處理缺失數(shù)據(jù)

層次化索引

聚合與分組

邏輯回歸基本原理

jupyter

pip3 install jupyter
jupyter notebook

scipy

描述性統(tǒng)計(jì)

Scikit-learn 數(shù)據(jù)挖掘、機(jī)器學(xué)習(xí)

keras 人工神經(jīng)網(wǎng)絡(luò)

tensorflow 神經(jīng)網(wǎng)絡(luò)

安裝Python包管理工具pip，主要是用于安裝 PyPI 上的軟件包

安裝教程

sudo apt-get install python3-pip
pip3 install numpy
pip3 install scipy
pip3 install matplotlib

或者下這個(gè)安裝腳本 get-pip.py

包的引入方式

因?yàn)閜ython是面向?qū)ο蟮木幊?，推薦引入方式還是

import numpy
numpy.array([1,2,3])

數(shù)據(jù)存儲(chǔ) 數(shù)據(jù)操作 生成數(shù)據(jù)

生成一組二維數(shù)組，有5000個(gè)元素，每個(gè)元素內(nèi)表示 身高和體重

import numpy as np

生成1000個(gè)經(jīng)緯度位置，靠近（117，32），并輸出位csv

import pandas as pd
import numpy as np

# 任意的多組列表
lng = np.random.normal(117,0.20,1000)

lat = np.random.normal(32.00,0.20,1000)

# 字典中的key值即為csv中列名
dataframe = pd.DataFrame({"lng":lng,"lat":lat})


#將DataFrame存儲(chǔ)為csv,index表示是否顯示行名，default=True
dataframe.to_csv("data/lng-lat.csv",index = False, sep="," )

numpy的常用操作

#encoding=utf-8 
import numpy as np 
def main():
    lst = [[1,3,5],[2,4,6]]
    print(type(lst))
    np_lst = np.array(lst)
    print(type(np_lst))
    # 同一種numpy.array中只能有一種數(shù)據(jù)類(lèi)型
    # 定義np的數(shù)據(jù)類(lèi)型
    # 數(shù)據(jù)類(lèi)型有：bool int int8 int16 int32 int64 int128 uint8 uint16 uint32 uint64 uint128 float16/32/64 complex64/128
    np_lst = np.array(lst,dtype=np.float)

    print(np_lst.shape)
    print(np_lst.ndim)#數(shù)據(jù)的維度
    print(np_lst.dtype)#數(shù)據(jù)類(lèi)型
    print(np_lst.itemsize) #每個(gè)元素的大小
    print(np_lst.size)#數(shù)據(jù)大小 幾個(gè)元素

    # numpy array
    print(np.zeros([2,4]))# 生成2行4列都是0的數(shù)組
    print(np.ones([3,5]))

    print("---------隨機(jī)數(shù)Rand-------") 
    print(np.random.rand(2,4))# rand用于產(chǎn)生0～1之間的隨機(jī)數(shù) 2*4的數(shù)組
    print(np.random.rand())
    print("---------隨機(jī)數(shù)RandInt-------")
    print(np.random.randint(1,10)) # 1~10之間的隨機(jī)整數(shù)
    print(np.random.randint(1,10,3))# 3個(gè)1～10之間的隨機(jī)整數(shù)
    print("---------隨機(jī)數(shù)Randn 標(biāo)準(zhǔn)正太分布-------")
    print(np.random.randn(2,4)) # 2行4列的標(biāo)準(zhǔn)正太分布的隨機(jī)整數(shù)
    print("---------隨機(jī)數(shù)Choice-------")
    print(np.random.choice([10,20,30]))# 指定在10 20 30 里面選一個(gè)隨機(jī)數(shù)生成
    print("---------分布Distribute-------")
    print(np.random.beta(1,10,100))# 生成beta分布
if __name__ == "__main__":
    main()

常用函數(shù)舉例

計(jì)算紅酒數(shù)據(jù)每一個(gè)屬性的平均值（即每一列數(shù)據(jù)的平均值）

數(shù)據(jù)分析工具 數(shù)據(jù)可視化

探索數(shù)據(jù)
數(shù)據(jù)展示
數(shù)據(jù) ---> 故事

matplotlib 繪圖基礎(chǔ)

函數(shù)曲線(xiàn)的繪制

圖形細(xì)節(jié)的設(shè)置

案例分析：銷(xiāo)售記錄可視化

條形圖

繪制多圖

餅圖

散點(diǎn)圖

直方圖

seaborn 數(shù)據(jù)可視化包

分類(lèi)數(shù)據(jù)的散點(diǎn)圖

分類(lèi)數(shù)據(jù)的箱線(xiàn)圖

多變量圖

更多內(nèi)容戳這里數(shù)據(jù)可視化

安裝 matplotlib

注意這里會(huì)報(bào)這樣的錯(cuò)誤

ImportError: No module named "_tkinter", please install the python3-tk package

需要安裝 python3-tk

更多示例 線(xiàn)圖

散點(diǎn)圖 & 柱狀圖

數(shù)據(jù)分析

padans

上層數(shù)據(jù)操作

dataframe數(shù)據(jù)結(jié)構(gòu)

 import pandas as pd
brics = pd.read_csv("/home/wyc/study/python_lession/python_lessions/數(shù)據(jù)分析/brics.csv",index_col = 0)

pandas基本操作


import numpy as np
import pandas as pd

def main():

    #Data Structure
    s = pd.Series([i*2 for i in range(1,11)])
    print(type(s))

    dates = pd.date_range("20170301",periods=8)
    df = pd.DataFrame(np.random.randn(8,5),index=dates,columns=list("ABCDE"))
    print(df)
    # basic

    print(df.head(3))
    print(df.tail(3))
    print(df.index)
    print(df.values)
    print(df.T)
    # print(df.sort(columns="C"))
    print(df.sort_index(axis=1,ascending=False))
    print(df.describe())

    #select
    print(type(df["A"]))
    print(df[:3])
    print(df["20170301":"20170304"])
    print(df.loc[dates[0]])
    print(df.loc["20170301":"20170304",["B","D"]])
    print(df.at[dates[0],"C"])


    print(df.iloc[1:3,2:4])
    print(df.iloc[1,4])
    print(df.iat[1,4])

    print(df[df.B>0][df.A<0])
    print(df[df>0])
    print(df[df["E"].isin([1,2])])

    # Set
    s1 = pd.Series(list(range(10,18)),index = pd.date_range("20170301",periods=8))
    df["F"]= s1
    print(df)
    df.at[dates[0],"A"] = 0
    print(df)
    df.iat[1,1] = 1
    df.loc[:,"D"] = np.array([4]*len(df))
    print(df)

    df2 = df.copy()
    df2[df2>0] = -df2
    print(df2)

    # Missing Value
    df1 = df.reindex(index=dates[:4],columns = list("ABCD") + ["G"])
    df1.loc[dates[0]:dates[1],"G"]=1
    print(df1)
    print(df1.dropna())
    print(df1.fillna(value=1))

    # Statistic
    print(df.mean())
    print(df.var())

    s = pd.Series([1,2,4,np.nan,5,7,9,10],index=dates)
    print(s)
    print(s.shift(2))
    print(s.diff())
    print(s.value_counts())
    print(df.apply(np.cumsum))
    print(df.apply(lambda x:x.max()-x.min()))

    #Concat
    pieces = [df[:3],df[-3:]]
    print(pd.concat(pieces))

    left = pd.DataFrame({"key":["x","y"],"value":[1,2]})
    right = pd.DataFrame({"key":["x","z"],"value":[3,4]})
    print("LEFT",left)
    print("RIGHT", right)
    print(pd.merge(left,right,on="key",how="outer"))
    df3 = pd.DataFrame({"A": ["a","b","c","b"],"B":list(range(4))})
    print(df3.groupby("A").sum())



if __name__ == "__main__":
    main()

# 首先產(chǎn)生一個(gè)叫g(shù)dp的字典
gdp = {"country":["United States", "China", "Japan", "Germany", "United Kingdom"],
       "capital":["Washington, D.C.", "Beijing", "Tokyo", "Berlin", "London"],
       "population":[323, 1389, 127, 83, 66],
       "gdp":[19.42, 11.8, 4.84, 3.42, 2.5],
       "continent":["North America", "Asia", "Asia", "Europe", "Europe"]}

import pandas as pd
gdp_df = pd.DataFrame(gdp)
print(gdp_df)

# 我們可以通過(guò)index選項(xiàng)添加自定義的行標(biāo)簽(label)
# 使用column選項(xiàng)可以選擇列的順序
gdp_df = pd.DataFrame(gdp, columns = ["country", "capital", "population", "gdp", "continent"],index = ["us", "cn", "jp", "de", "uk"])
print(gdp_df)

#修改行和列的標(biāo)簽
# 也可以使用index和columns直接修改
gdp_df.index=["US", "CN", "JP", "DE", "UK"]
gdp_df.columns = ["Country", "Capital", "Population", "GDP", "Continent"]
print(gdp_df)
# 增加rank列，表示他們的GDP處在前5位
gdp_df["rank"] = "Top5 GDP"
# 增加國(guó)土面積變量,以百萬(wàn)公里計(jì)（數(shù)據(jù)來(lái)源：http://data.worldbank.org/）
gdp_df["Area"] = [9.15, 9.38, 0.37, 0.35, 0.24]
print(gdp_df)


# 一個(gè)最簡(jiǎn)單的series
series = pd.Series([2,4,5,7,3],index = ["a","b","c","d","e"])
print(series)
# 當(dāng)我們使用點(diǎn)操作符來(lái)查看一個(gè)變量時(shí)，返回的是一個(gè)pandas series
# 在后續(xù)的布爾篩選中使用點(diǎn)方法可以簡(jiǎn)化代碼
# US,...,UK是索引
print(gdp_df.GDP)


# 可以直接查看索引index
print(gdp_df.GDP.index)
# 類(lèi)型是pandas.core.series.Series
print(type(gdp_df.GDP))

#返回一個(gè)布爾型的series，在后面講到的DataFrame的布爾索引中會(huì)大量使用
print(gdp_df.GDP > 4)

# 我們也可以將series視為一個(gè)長(zhǎng)度固定且有順序的字典，一些用于字典的函數(shù)也可以用于series
gdp_dict = {"US": 19.42, "CN": 11.80, "JP": 4.84, "DE": 3.42, "UK": 2.5}
gdp_series = pd.Series(gdp_dict)
print(gdp_series)

# 判斷 ’US" 標(biāo)簽是否在gdp_series中

print("US" in gdp_series)
# 使用變量名加[[]]選取列
print(gdp_df[["Country"]])
# 可以同時(shí)選取多列
print(gdp_df[["Country", "GDP"]])


# 如果只是用[]則產(chǎn)生series
print(type(gdp_df["Country"]))
# 行選取和2d數(shù)組類(lèi)似
# 如果使用[]選取行，切片方法唯一的選項(xiàng)
print(gdp_df[2:5]) #終索引是不被包括的！

#loc方法
# 在上面例子中，我們使用行索引選取行，能不能使用行標(biāo)簽實(shí)現(xiàn)選取呢？
# loc方法正是基于標(biāo)簽選取數(shù)據(jù)的方法
print(gdp_df.loc[["JP","DE"]])
# 以上例子選取了所有的列
# 我們可以加入需要的列標(biāo)簽
print(gdp_df.loc[["JP","DE"],["Country","GDP","Continent"]])

# 選取所有的行，我們可以使用:來(lái)表示選取所有的行
print(gdp_df.loc[:,["Country","GDP","Continent"]])

# 等價(jià)于gdp_df.loc[["JP","DE"]]
print(gdp_df.iloc[[2,3]])

print(gdp_df.loc[["JP","DE"],["Country", "GDP", "Continent"]])
print(gdp_df.iloc[[2,3],[0,3,4]])

# 選出亞洲國(guó)家，下面兩行命令產(chǎn)生一樣的結(jié)果
print(gdp_df[gdp_df.Continent == "Asia"])

print(gdp_df.loc[gdp_df.Continent == "Asia"])
# 選出gdp大于3兆億美元的歐洲國(guó)家
print(gdp_df[(gdp_df.Continent == "Europe") & (gdp_df.GDP > 3)])

缺失值處理 數(shù)據(jù)挖掘

案例:Iris鳶尾花數(shù)據(jù)集
讓我們來(lái)看一下經(jīng)典的iris數(shù)據(jù):

鳶尾花卉數(shù)據(jù)集，來(lái)源 UCI 機(jī)器學(xué)習(xí)數(shù)據(jù)集

它最初是埃德加·安德森采集的

四個(gè)特征被用作樣本的定量分析，它們分別是花萼(sepal)和花瓣(petal)的長(zhǎng)度(length)和寬度(width)

#####
#數(shù)據(jù)的導(dǎo)入和觀察
#####
import pandas as pd
# 用列表存儲(chǔ)列標(biāo)簽
col_names = ["sepal_length", "sepal_width", "petal_length", "petal_width", "species"]
# 讀取數(shù)據(jù)，并指定每一列的標(biāo)簽
iris = pd.read_csv("data/iris.txt", names = col_names)

# 使用head/tail查看數(shù)據(jù)的頭和尾

print(iris.head(10))

# 使用info 方法查看數(shù)據(jù)的總體信息
iris.info()

# 使用shape可以查看DataFrame的行數(shù)與列數(shù)
# iris有150個(gè)觀察值，5個(gè)變量
print(iris.shape)
# 這里的品種(species)是分類(lèi)變量(categorical variable)
# 可以使用unique方法來(lái)對(duì)查看series中品種的名字
print(iris.species.unique())


# 統(tǒng)計(jì)不同品種的數(shù)量
# 使用DataFrame的value_counts方法來(lái)實(shí)現(xiàn)
print(iris.species.value_counts())

#選取花瓣數(shù)據(jù)，即 petal_length 和 petal_width 這兩列
# 方法一：使用[[ ]]
petal = iris[["petal_length","petal_width"]]
print(petal.head())
# 方法二：使用 .loc[ ]
petal = iris.loc[:,["petal_length","petal_width"]]
print(petal.head())
# 方法三：使用 .iloc[ ]
petal = iris.iloc[:,2:4]
print(petal.head())

# 選取行索引為5-10的數(shù)據(jù)行
# 方法一：使用[]
print(iris[5:11])
# 方法二：使用 .iloc[]
print(iris.iloc[5:11,:])

# 選取品種為 Iris-versicolor 的數(shù)據(jù)
versicolor = iris[iris.species == "Iris-versicolor"]
print(versicolor.head())


####
#數(shù)據(jù)的可視化
####
#散點(diǎn)圖
import matplotlib.pyplot as plt
# 我們首先畫(huà)散點(diǎn)圖（sactter plot），x軸上畫(huà)出花瓣的長(zhǎng)度，y軸上畫(huà)出花瓣的寬度
# 我們觀察到什么呢？
iris.plot(kind = "scatter", x="petal_length", y="petal_width")
# plt.show()

# 使用布爾索引的方法分別獲取三個(gè)品種的數(shù)據(jù)
setosa = iris[iris.species == "Iris-setosa"]
versicolor = iris[iris.species == "Iris-versicolor"]
virginica = iris[iris.species == "Iris-virginica"]

ax = setosa.plot(kind="scatter", x="petal_length", y="petal_width", color="Red", label="setosa", figsize=(10,6))
versicolor.plot(kind="scatter", x="petal_length", y="petal_width", color="Green", ax=ax, label="versicolor")
virginica.plot(kind="scatter", x="petal_length", y="petal_width", color="Orange", ax=ax, label="virginica")
plt.show()

#箱圖
#使用mean()方法獲取花瓣寬度均值
print(iris.petal_width.mean())
#使用median()方法獲取花瓣寬度的中位數(shù)
print(iris.petal_width.median())
# 可以使用describe方法來(lái)總結(jié)數(shù)值變量
print(iris.describe())


# 繪制花瓣寬度的箱圖
# 箱圖展示了數(shù)據(jù)中的中位數(shù)，四分位數(shù)，最大值，最小值
iris.petal_width.plot(kind="box")
# plt.show()

# 按品種分類(lèi)，分別繪制不同品種花瓣寬度的箱圖
iris[["petal_width","species"]].boxplot(grid=False,by="species",figsize=(10,6))
# plt.show()

setosa.describe()

# 計(jì)算每個(gè)品種鳶尾花各個(gè)屬性（花萼、花瓣的長(zhǎng)度和寬度）的最小值、平均值又是分別是多少？ （提示：使用min、mean 方法。）
print(iris.groupby(["species"]).agg(["min","mean"]))

#計(jì)算鳶尾花每個(gè)品種的花萼長(zhǎng)度（sepal_length) 大于6cm的數(shù)據(jù)個(gè)數(shù)。
# 方法1
print(iris[iris["sepal_length"]> 6].groupby("species").size())
# 方法2
def more_len(group,length=6):
    return len(group[group["sepal_length"] > length])
print(iris.groupby(["species"]).apply(more_len,6))

缺失值處理、數(shù)據(jù)透視表

缺失值處理：pandas中的fillna()方法

pandas用nan(not a number)表示缺失數(shù)據(jù)，處理缺失數(shù)據(jù)有以下幾種方法：

dropna去除nan數(shù)據(jù)

fillna使用默認(rèn)值填入

isnull 返回一個(gè)含有布爾值的對(duì)象，表示哪些是nan，哪些不是

notnull isnull的否定式

數(shù)據(jù)透視表：pandas中的pivot_table函數(shù)

我們用案例分析 - 泰坦尼克數(shù)據(jù) 來(lái)說(shuō)明這個(gè)兩個(gè)問(wèn)題
缺失值處理：

真實(shí)數(shù)據(jù)往往某些變量會(huì)有缺失值。

這里，cabin有超過(guò)70%以上的缺失值，我們可以考慮直接丟掉這個(gè)變量。 -- 刪除某一列數(shù)據(jù)

像Age這樣的重要變量，有20%左右的缺失值，我們可以考慮用中位值來(lái)填補(bǔ)。-- 填補(bǔ)缺失值

我們一般不提倡去掉帶有缺失值的行，因?yàn)槠渌侨笔У淖兞靠赡芴峁┯杏玫男畔ⅰ?- 刪除帶缺失值的行

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數(shù)據(jù)
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數(shù)據(jù)
print(titanic_df.head())

# 數(shù)據(jù)的統(tǒng)計(jì)描述
# describe函數(shù)查看部分變量的分布
# 因?yàn)镾urvived是0-1變量，所以均值就是幸存人數(shù)的百分比，這個(gè)用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來(lái)查看分類(lèi)變量
# count: 非缺失值的個(gè)數(shù)
# unique: 非重復(fù)值得個(gè)數(shù)
# top: 最高頻值
# freq: 最高頻值出現(xiàn)次數(shù)

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數(shù)統(tǒng)計(jì)， len() 獲取數(shù)據(jù)長(zhǎng)度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個(gè)乘客
# Age有714個(gè)非缺失值，Cabin只有204個(gè)非缺失值。我們將會(huì)講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結(jié)果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補(bǔ)年齡數(shù)據(jù)中的缺失值
# 直接使用所有人年齡的中位數(shù)來(lái)填補(bǔ)
# 在處理之前，查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")

# 計(jì)算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數(shù)據(jù)titanic_df上直接進(jìn)行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數(shù)來(lái)填補(bǔ)
# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計(jì)算男女年齡的中位數(shù)， 得到一個(gè)Series數(shù)據(jù)，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設(shè)置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())

#同時(shí)考慮性別和艙位因素

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計(jì)算不同艙位男女年齡的中位數(shù)， 得到一個(gè)Series數(shù)據(jù)，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設(shè)置Pclass, Sex為索引， inplace=True表示在原數(shù)據(jù)titanic_df上直接進(jìn)行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統(tǒng)計(jì)值
titanic_df.Age.describe()

將連續(xù)型變量離散化

連續(xù)型變量離散化是建模中一種常用的方法

離散化指的是將某個(gè)變量的所在區(qū)間分割為幾個(gè)小區(qū)間，落在同一個(gè)區(qū)間的觀測(cè)值用同一個(gè)符號(hào)表示

以年齡為例，最小值是0.42（嬰兒），最大值是80，如果我們想產(chǎn)生一個(gè)五個(gè)級(jí)（levels），我們可使用cut或者qcut函數(shù)

cut函數(shù)將年齡的區(qū)間均勻分割為5分，而qcut則選取區(qū)間以至于每個(gè)區(qū)間里的觀察值個(gè)數(shù)都是一樣的（五等分），這里演示中使用cut函數(shù)。

# 讀取常用的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#讀取數(shù)據(jù)
titanic_df = pd.read_csv("data/titanic.csv")

#查看前五行數(shù)據(jù)
print(titanic_df.head())

# 數(shù)據(jù)的統(tǒng)計(jì)描述
# describe函數(shù)查看部分變量的分布
# 因?yàn)镾urvived是0-1變量，所以均值就是幸存人數(shù)的百分比，這個(gè)用法非常有用
print(titanic_df[["Survived","Age","SibSp","Parch"]].describe())

# 使用include=[np.object]來(lái)查看分類(lèi)變量
# count: 非缺失值的個(gè)數(shù)
# unique: 非重復(fù)值得個(gè)數(shù)
# top: 最高頻值
# freq: 最高頻值出現(xiàn)次數(shù)

print(titanic_df.describe(include=[np.object]))

#不同艙位的分布情況是怎樣的呢？
# 方法1: value_counts
# 查看不同艙位的分布
# 頭等艙：24%； 二等艙：21%； 三等艙：55%
# value_counts 頻數(shù)統(tǒng)計(jì)， len() 獲取數(shù)據(jù)長(zhǎng)度
print(titanic_df.Pclass.value_counts() / len(titanic_df))
# 總共有891個(gè)乘客
# Age有714個(gè)非缺失值，Cabin只有204個(gè)非缺失值。我們將會(huì)講解如何處理缺失值
print(titanic_df.info())

#方法2：group_by
# sort_values 將結(jié)果排序
(titanic_df.groupby("Pclass").agg("size")/len(titanic_df)).sort_values(ascending=False)

# 填補(bǔ)年齡數(shù)據(jù)中的缺失值
# 直接使用所有人年齡的中位數(shù)來(lái)填補(bǔ)
# 在處理之前，查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")

# 計(jì)算所有人年齡的均值
age_median1 = titanic_df.Age.median()

# 使用fillna填充缺失值,inplace=True表示在原數(shù)據(jù)titanic_df上直接進(jìn)行修改
titanic_df.Age.fillna(age_median1,inplace=True)
#查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())
#print(titanic_df.info())

# 考慮性別因素，分別用男女乘客各自年齡的中位數(shù)來(lái)填補(bǔ)
# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計(jì)算男女年齡的中位數(shù)， 得到一個(gè)Series數(shù)據(jù)，索引為Sex
age_median2 = titanic_df.groupby("Sex").Age.median()
# 設(shè)置Sex為索引
titanic_df.set_index("Sex",inplace=True)
# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median2, inplace=True)
# 重置索引，即取消Sex索引
titanic_df.reset_index(inplace=True)
# 查看Age列的統(tǒng)計(jì)值
print(titanic_df.Age.describe())

#同時(shí)考慮性別和艙位因素

# 重新載入原始數(shù)據(jù)
titanic_df=pd.read_csv("data/titanic.csv")
# 分組計(jì)算不同艙位男女年齡的中位數(shù)， 得到一個(gè)Series數(shù)據(jù)，索引為Pclass,Sex
age_median3 = titanic_df.groupby(["Pclass", "Sex"]).Age.median()
# 設(shè)置Pclass, Sex為索引， inplace=True表示在原數(shù)據(jù)titanic_df上直接進(jìn)行修改
titanic_df.set_index(["Pclass","Sex"], inplace=True)
print(titanic_df)

# 使用fillna填充缺失值，根據(jù)索引值填充
titanic_df.Age.fillna(age_median3, inplace=True)
# 重置索引，即取消Pclass,Sex索引
titanic_df.reset_index(inplace=True)

# 查看Age列的統(tǒng)計(jì)值
titanic_df.Age.describe()


###
#分析哪些因素會(huì)決定生還概率
###

# 艙位與生還概率
#計(jì)算每個(gè)艙位的生還概率
# 方法1：使用經(jīng)典的分組-聚合-計(jì)算
# 注意：因?yàn)镾urvived是0-1函數(shù)，所以均值即表示生還百分比
print(titanic_df[["Pclass", "Survived"]].groupby("Pclass").mean() 
    .sort_values(by="Survived", ascending=False))

# 方法2：我們還可以使用pivot_table函數(shù)來(lái)實(shí)現(xiàn)同樣的功能（本次課新內(nèi)容）
# pivot table中文為數(shù)據(jù)透視表
# values: 聚合后被施加計(jì)算的值，這里我們施加mean函數(shù)
# index: 分組用的變量
# aggfunc: 定義施加的函數(shù)
print(titanic_df.pivot_table(values="Survived", index="Pclass", aggfunc=np.mean))

# 繪制艙位和生還概率的條形圖
# 使用sns.barplot做條形圖，圖中y軸給出 Survived 均值的點(diǎn)估計(jì)
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",ci=None)
# plt.show()

#####
#性別與生還概率
#####
# 方法1：groupby
print(titanic_df[["Sex", "Survived"]].groupby("Sex").mean() 
    .sort_values(by="Survived", ascending=False))
# 方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="Sex",aggfunc=np.mean))

# 繪制條形圖
#sns.barplot(data=titanic_df,x="Sex",y="Survived",ci=None)
#plt.show()


#####
#綜合考慮艙位和性別的因素，與生還概率的關(guān)系
#####
# 方法1：groupby
print(titanic_df[["Pclass","Sex", "Survived"]].groupby(["Pclass", "Sex"]).mean())

# 方法2：pivot_table
titanic_df.pivot_table(values="Survived", index=["Pclass", "Sex"], aggfunc=np.mean)

# 方法3：pivot_talbe
# columns指定另一個(gè)分類(lèi)變量，只不過(guò)我們將它列在列里而不是行里，這也是為什么這個(gè)變量稱(chēng)為columns
print(titanic_df.pivot_table(values="Survived",index="Pclass",columns="Sex",aggfunc=np.mean))

#繪制條形圖：使用sns.barplot
#sns.barplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
# plt.show()

# 繪制折線(xiàn)圖：使用sns.pointplot
sns.pointplot(data=titanic_df,x="Pclass",y="Survived",hue="Sex",ci=None)
#plt.show()

####
#年齡與生還情況
####
#與上面的艙位、性別這些分類(lèi)變量不同，年齡是一個(gè)連續(xù)的變量

#生還組和罹難組的年齡分布直方圖
#使用seaborn包中的 FacetGrid().map() 來(lái)快速生成高質(zhì)量圖片
# col="Survived"指定將圖片在一行中做出生還和罹難與年齡的關(guān)系圖
sns.FacetGrid(titanic_df,col="Survived").
    map(plt.hist,"Age",bins=20,normed=True)
# plt.show()


###
#將連續(xù)型變量離散化
###
#我們使用cut函數(shù)
#我們可以看到每個(gè)區(qū)間的大小是固定的,大約是16歲

titanic_df["AgeBand"] = pd.cut(titanic_df["Age"],5)
print(titanic_df.head())

#查看落在不同年齡區(qū)間里的人數(shù)
#方法1：value_counts(), sort=False表示不需要將結(jié)果排序
print(titanic_df.AgeBand.value_counts(sort=False))

#方法2：pivot_table
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc="count"))

#查看各個(gè)年齡區(qū)間的生還率
print(titanic_df.pivot_table(values="Survived",index="AgeBand",aggfunc=np.mean))
sns.barplot(data=titanic_df,x="AgeBand",y="Survived",ci=None)
plt.xticks(rotation=60)
plt.show()


####
# 年齡、性別 與生還概率
####
# 查看落在不同區(qū)間里男女的生還概率
print(titanic_df.pivot_table(values="Survived",index="AgeBand", columns="Sex", aggfunc=np.mean))

sns.pointplot(data=titanic_df, x="AgeBand", y="Survived", hue="Sex", ci=None)
plt.xticks(rotation=60)

plt.show()

####
#年齡、艙位、性別 與生還概率
####
titanic_df.pivot_table(values="Survived",index="AgeBand", columns=["Sex", "Pclass"], aggfunc=np.mean)



# 回顧sns.pointplot 繪制艙位、性別與生還概率的關(guān)系圖
sns.pointplot(data=titanic_df, x="Pclass", y="Survived", hue="Sex", ci=None)

人工神經(jīng)網(wǎng)絡(luò)

https://keras.io

機(jī)器學(xué)習(xí) 特征工程

特征工程到底是什么？

案例分析：共享單車(chē)需求
特征工程（feature engineering）

數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)的上限，而一個(gè)好的模型只是逼近那個(gè)上限而已

我們的目標(biāo)是盡可能得從原始數(shù)據(jù)上獲取有用的信息，一些原始數(shù)據(jù)本身往往不能直接作為模型的變量。

特征工程是利用數(shù)據(jù)領(lǐng)域的相關(guān)知識(shí)來(lái)創(chuàng)建能夠使機(jī)器學(xué)習(xí)算法達(dá)到最佳性能的特征的過(guò)程。

日期型變量的處理

以datetime為例子，這個(gè)特征里包含了日期和時(shí)間點(diǎn)兩個(gè)重要信息。我們還可以進(jìn)一步從日期中導(dǎo)出其所對(duì)應(yīng)的月份和星期數(shù)。

#租車(chē)人數(shù)是由哪些因素決定的？
#導(dǎo)入數(shù)據(jù)分析包
import numpy as np
import pandas as pd

#導(dǎo)入繪圖工具包
import matplotlib.pyplot as plt
import seaborn as sns

#導(dǎo)入日期時(shí)間變量處理相關(guān)的工具包
import calendar
from datetime import datetime

# 讀取數(shù)據(jù)
BikeData = pd.read_csv("data/bike.csv")


#####
#了解數(shù)據(jù)大小
#查看前幾行/最后幾行數(shù)據(jù)
#查看數(shù)據(jù)類(lèi)型與缺失值
####
# 第一步：查看數(shù)據(jù)大小

print(BikeData.shape)

# 第二步：查看前10行數(shù)據(jù)
print(BikeData.head(10))


# 第三步：查看數(shù)據(jù)類(lèi)型與缺失值
# 大部分變量為整數(shù)型，溫度和風(fēng)速為浮點(diǎn)型變量
# datetime類(lèi)型為object，我們將在下面進(jìn)一步進(jìn)行處理
# 沒(méi)有缺失值！
print(BikeData.info())


####
#日期型變量的處理
####

# 取datetime中的第一個(gè)元素為例，其數(shù)據(jù)類(lèi)型為字符串，所以我們可以使用split方法將字符串拆開(kāi)
# 日期+時(shí)間戳是一個(gè)非常常見(jiàn)的數(shù)據(jù)形式
ex = BikeData.datetime[1]
print(ex)

print(type(ex))

# 使用split方法將字符串拆開(kāi)
ex.split()

# 獲取日期數(shù)據(jù)
ex.split()[0]

# 首先獲得日期，定義一個(gè)函數(shù)使用split方法將日期+時(shí)間戳拆分為日期和
def get_date(x):
    return(x.split()[0])

# 使用pandas中的apply方法，對(duì)datatime使用函數(shù)get_date
BikeData["date"] = BikeData.datetime.apply(get_date)

print(BikeData.head())

# 生成租車(chē)時(shí)間(24小時(shí)）
# 為了取小時(shí)數(shù)，我們需要進(jìn)一步拆分
print(ex.split()[1])
#":"是分隔符
print(ex.split()[1].split(":")[0])

# 將上面的內(nèi)容定義為get_hour的函數(shù)，然后使用apply到datatime這個(gè)特征上
def get_hour(x):
    return (x.split()[1].split(":")[0])
# 使用apply方法，獲取整列數(shù)據(jù)的時(shí)間
BikeData["hour"] = BikeData.datetime.apply(get_hour)

print(BikeData.head())

####
# 生成日期對(duì)應(yīng)的星期數(shù)
####
# 首先引入calendar中的day_name，列舉了周一到周日
print(calendar.day_name[:])

#獲取字符串形式的日期
dateString = ex.split()[0]

# 使用datatime中的strptime函數(shù)將字符串轉(zhuǎn)換為日期時(shí)間類(lèi)型
# 注意這里的datatime是一個(gè)包不是我們dataframe里的變量名
# 這里我們使用"%Y-%m-%d"來(lái)指定輸入日期的格式是按照年月日排序，有時(shí)候可能會(huì)有月日年的排序形式
print(dateString)
dateDT = datetime.strptime(dateString,"%Y-%m-%d")
print(dateDT)
print(type(dateDT))

# 然后使用weekday方法取出日期對(duì)應(yīng)的星期數(shù)
# 是0-6的整數(shù)，星期一對(duì)應(yīng)0， 星期日對(duì)應(yīng)6
week_day = dateDT.weekday()

print(week_day)
# 將星期數(shù)映射到其對(duì)應(yīng)的名字上
print(calendar.day_name[week_day])


# 現(xiàn)在將上述的過(guò)程融合在一起變成一個(gè)獲取星期的函數(shù)
def get_weekday(dateString):
    week_day = datetime.strptime(dateString,"%Y-%m-%d").weekday()
    return (calendar.day_name[week_day])

# 使用apply方法，獲取date整列數(shù)據(jù)的星期
BikeData["weekday"] = BikeData.date.apply(get_weekday)

print(BikeData.head())


####
# 生成日期對(duì)應(yīng)的月份
####

# 模仿上面的過(guò)程，我們可以提取日期對(duì)應(yīng)的月份
# 注意：這里month是一個(gè)attribute不是一個(gè)函數(shù)，所以不用括號(hào)

def get_month(dateString):
    return (datetime.strptime(dateString,"%Y-%m-%d").month)
# 使用apply方法，獲取date整列數(shù)據(jù)的月份
BikeData["month"] = BikeData.date.apply(get_month)
print(BikeData.head())

####
#數(shù)據(jù)可視化舉例
####

#繪制租車(chē)人數(shù)的箱線(xiàn)圖， 以及人數(shù)隨時(shí)間（24小時(shí)）變化的箱線(xiàn)圖
# 設(shè)置畫(huà)布大小
fig = plt.figure(figsize=(18,5))

# 添加第一個(gè)子圖
# 租車(chē)人數(shù)的箱線(xiàn)圖
ax1 = fig.add_subplot(121)
sns.boxplot(data=BikeData,y="count")
ax1.set(ylabel="Count",title="Box Plot On Count")


# 添加第二個(gè)子圖
# 租車(chē)人數(shù)和時(shí)間的箱線(xiàn)圖
# 商業(yè)洞察：租車(chē)人數(shù)由時(shí)間是如何變化的?
ax2 = fig.add_subplot(122)
sns.boxplot(data=BikeData,y="count",x="hour")
ax2.set(xlabel="Hour",ylabel="Count",title="Box Plot On Count Across Hours")
plt.show()

機(jī)器學(xué)習(xí)

機(jī)器學(xué)習(xí)（Machine Learning）是人工智能的分支，其目標(biāo)是通過(guò)算法從現(xiàn)有的數(shù)據(jù)中建立模型（學(xué)習(xí)）來(lái)解決問(wèn)題。

機(jī)器學(xué)習(xí)是一門(mén)交叉學(xué)科，涉及概率統(tǒng)計(jì)（probability and statistics），優(yōu)化（optimization），和計(jì)算機(jī)編程（computer programming）等等。

用途極為廣泛：從預(yù)測(cè)信用卡違約風(fēng)險(xiǎn)，癌癥病人五年生存概率到汽車(chē)無(wú)人駕駛，都有著機(jī)器學(xué)習(xí)的身影。

備受重視：人們?cè)跊Q策分析的時(shí)候越來(lái)越多得用定量方法（quantitative approach）來(lái)衡量一個(gè)決策的優(yōu)劣。

監(jiān)督學(xué)習(xí)：

監(jiān)督學(xué)習(xí)（Supervised Learning）：從給定的訓(xùn)練數(shù)據(jù)集中學(xué)習(xí)出一個(gè)函數(shù)，當(dāng)新的數(shù)據(jù)到來(lái)時(shí)，可以根據(jù)這個(gè)函數(shù)預(yù)測(cè)結(jié)果。監(jiān)督學(xué)習(xí)的訓(xùn)練集（training data）要求是包括輸入和輸出，也可以說(shuō)是特征和目標(biāo)。

監(jiān)督學(xué)習(xí)中又可進(jìn)一步分為兩大類(lèi)主要問(wèn)題：預(yù)測(cè)與分類(lèi)。房?jī)r(jià)預(yù)測(cè)是一個(gè)典型的預(yù)測(cè)問(wèn)題，房?jī)r(jià)作為目標(biāo)是一個(gè)連續(xù)型變量。信用卡違約預(yù)測(cè)是一個(gè)典型的分類(lèi)問(wèn)題，是否違約作為一個(gè)目標(biāo)是一個(gè)分類(lèi)變量。

無(wú)監(jiān)督學(xué)習(xí)

無(wú)監(jiān)督學(xué)習(xí)（Unsupervised Learning）：訓(xùn)練集沒(méi)有人為標(biāo)注的結(jié)果。我們從輸入數(shù)據(jù)本身探索規(guī)律。

無(wú)監(jiān)督學(xué)習(xí)的例子包括圖片聚類(lèi)分析，文章主題分類(lèi)，基因序列分析，和高緯數(shù)據(jù)（high dimensional data) 降維等等。

案例分析：波士頓地區(qū)房?jī)r(jià)
注意波士頓房?jī)r(jià)數(shù)據(jù)是scikit-learn中的Toy datasets 可通過(guò)函數(shù)datasets.load_boston()直接加載

學(xué)習(xí)資源

機(jī)器學(xué)習(xí)教程及筆記
https://www.datacamp.com/
http://matplotlib.org/2.1.0/g...
https://www.kesci.com/
https://keras.io

競(jìng)賽

https://www.kaggle.com/
天池大數(shù)據(jù)競(jìng)賽和Kaggle、DataCastle的比較，哪個(gè)比較好？
天池新人實(shí)戰(zhàn)賽

參考

The Python Tutorial
python寫(xiě)入csv文件的幾種方法總結(jié)
常見(jiàn)安裝第三方庫(kù)問(wèn)題
慕課網(wǎng) Python在數(shù)據(jù)科學(xué)中的應(yīng)用
慕課網(wǎng) Python數(shù)據(jù)分析-基礎(chǔ)技術(shù)篇
《利用python進(jìn)行數(shù)據(jù)分析》
DataLearningTeam/PythonData
Visualization
使用 NumPy 進(jìn)行科學(xué)計(jì)算
使用Python進(jìn)行描述性統(tǒng)計(jì)
Documentation of scikit-learn 0.19.1
Seaborn tutorial
特征工程