使用機器學習預測天氣(第一部分)

liukai90 發(fā)布于2019-07-30 15:13 / 3699人閱讀

摘要：概述本章是使用機器學習預測天氣系列教程的第一部分，使用和機器學習來構建模型，根據(jù)從收集的數(shù)據(jù)來預測天氣溫度。數(shù)據(jù)類型是機器學習領域經(jīng)常會用到的數(shù)據(jù)結構。

概述

??本章是使用機器學習預測天氣系列教程的第一部分，使用Python和機器學習來構建模型，根據(jù)從Weather Underground收集的數(shù)據(jù)來預測天氣溫度。該教程將由三個不同的部分組成，涵蓋的主題是：

數(shù)據(jù)收集和處理（本文）

線性回歸模型（第2章）

神經(jīng)網(wǎng)絡模型（第3章）

??本教程中使用的數(shù)據(jù)將從Weather Underground的免費層API服務中收集。我將使用python的requests庫來調用API，得到從2015年起Lincoln, Nebraska的天氣數(shù)據(jù)。一旦收集完成，數(shù)據(jù)將需要進行處理并匯總轉成合適的格式，然后進行清理。
??第二篇文章將重點分析數(shù)據(jù)中的趨勢，目標是選擇合適的特性并使用python的statsmodels和scikit-learn庫來構建線性回歸模型。我將討論構建線性回歸模型，必須進行必要的假設，并演示如何評估數(shù)據(jù)特征以構建一個健壯的模型。并在最后完成模型的測試與驗證。
??最后的文章將著重于使用神經(jīng)網(wǎng)絡。我將比較構建神經(jīng)網(wǎng)絡模型和構建線性回歸模型的過程，結果，準確性。

Weather Underground介紹

??Weather Underground是一家收集和分發(fā)全球各種天氣測量數(shù)據(jù)的公司。該公司提供了大量的API，可用于商業(yè)和非商業(yè)用途。在本文中，我將介紹如何使用非商業(yè)API獲取每日天氣數(shù)據(jù)。所以，如果你跟隨者本教程操作的話，您需要注冊他們的免費開發(fā)者帳戶。此帳戶提供了一個API密鑰，這個密鑰限制，每分鐘10個，每天500個API請求。
??獲取歷史數(shù)據(jù)的API如下：

http://api.wunderground.com/api/API_KEY/history_YYYYMMDD/q/STATE/CITY.json

API_KEY: 注冊賬戶獲取

YYYYMMDD: 你想要獲取的天氣數(shù)據(jù)的日期

STATE: 州名縮寫

CITY: 你請求的城市名

調用API

??本教程調用Weather Underground API獲取歷史數(shù)據(jù)時，用到如下的python庫。

名稱	描述	來源
datetime	處理日期	標準庫
time	處理時間	標準庫
collections	使用該庫的namedtuples來結構化數(shù)據(jù)	標準庫
pandas	處理數(shù)據(jù)	第三方
requests	HTTP請求處理庫	第三方
matplotlib	制圖庫	第三方

??好，我們先導入這些庫：

from datetime import datetime, timedelta  
import time  
from collections import namedtuple  
import pandas as pd  
import requests  
import matplotlib.pyplot as plt

接下里，定義常量來保存API_KEY和BASE_URL，注意，例子中的API_KEY不可用，你要自己注冊獲取。代碼如下：

API_KEY = "7052ad35e3c73564"  
# 第一個大括號是API_KEY，第二個是日期
BASE_URL = "http://api.wunderground.com/api/{}/history_{}/q/NE/Lincoln.json"

然后我們初始化一個變量，存儲日期，然后定義一個list，指明要從API返回的內(nèi)容里獲取的數(shù)據(jù)。然后定義一個namedtuple類型的變量DailySummary來存儲返回的數(shù)據(jù)。代碼如下：

target_date = datetime(2016, 5, 16)  
features = ["date", "meantempm", "meandewptm", "meanpressurem", "maxhumidity", "minhumidity", "maxtempm",  
            "mintempm", "maxdewptm", "mindewptm", "maxpressurem", "minpressurem", "precipm"]
DailySummary = namedtuple("DailySummary", features)

定義一個函數(shù)，調用API，獲取指定target_date開始的days天的數(shù)據(jù)，代碼如下：

def extract_weather_data(url, api_key, target_date, days):  
    records = []
    for _ in range(days):
        request = BASE_URL.format(API_KEY, target_date.strftime("%Y%m%d"))
        response = requests.get(request)
        if response.status_code == 200:
            data = response.json()["history"]["dailysummary"][0]
            records.append(DailySummary(
                date=target_date,
                meantempm=data["meantempm"],
                meandewptm=data["meandewptm"],
                meanpressurem=data["meanpressurem"],
                maxhumidity=data["maxhumidity"],
                minhumidity=data["minhumidity"],
                maxtempm=data["maxtempm"],
                mintempm=data["mintempm"],
                maxdewptm=data["maxdewptm"],
                mindewptm=data["mindewptm"],
                maxpressurem=data["maxpressurem"],
                minpressurem=data["minpressurem"],
                precipm=data["precipm"]))
        time.sleep(6)
        target_date += timedelta(days=1)
    return records

首先，定義個list records，用來存放上述的DailySummary，使用for循環(huán)來遍歷指定的所有日期。然后生成url，發(fā)起HTTP請求，獲取返回的數(shù)據(jù)，使用返回的數(shù)據(jù)，初始化DailySummary，最后存放到records里。通過這個函數(shù)的出，就可以獲取到指定日期開始的N天的歷史天氣數(shù)據(jù)，并返回。

獲取500天的天氣數(shù)據(jù)

??由于API接口的限制，我們需要兩天的時間才能獲取到500天的數(shù)據(jù)。你也可以下載我的測試數(shù)據(jù)，來節(jié)約你的時間。

records = extract_weather_data(BASE_URL, API_KEY, target_date, 500)

格式化數(shù)據(jù)為Pandas DataFrame格式

??我們使用DailySummary列表來初始化Pandas DataFrame。DataFrame數(shù)據(jù)類型是機器學習領域經(jīng)常會用到的數(shù)據(jù)結構。

df = pd.DataFrame(records, columns=features).set_index("date")

特征提取

??機器學習是帶有實驗性質的，所以，你可能遇到一些矛盾的數(shù)據(jù)或者行為。因此，你需要在你用機器學習處理問題是，你需要對處理的問題領域有一定的了解，這樣可以更好的提取數(shù)據(jù)特征。
??我將采用如下的數(shù)據(jù)字段，并且，使用過去三天的數(shù)據(jù)作為預測。

mean temperature

mean dewpoint

mean pressure

max humidity

min humidity

max dewpoint

min dewpoint

max pressure

min pressure

precipitation

首先我需要在DataFrame里增加一些字段來保存新的數(shù)據(jù)字段，為了方便測試，我創(chuàng)建了一個tmp變量，存儲10個數(shù)據(jù)，這些數(shù)據(jù)都有meantempm和meandewptm屬性。代碼如下：

tmp = df[["meantempm", "meandewptm"]].head(10)  
tmp

對于每一行的數(shù)據(jù)，我們分別獲取他前一天、前兩天、前三天對應的數(shù)據(jù)，存在本行，分別以屬性_index來命名，代碼如下：

# 1 day prior
N = 1

# target measurement of mean temperature
feature = "meantempm"

# total number of rows
rows = tmp.shape[0]

# a list representing Nth prior measurements of feature
# notice that the front of the list needs to be padded with N
# None values to maintain the constistent rows length for each N
nth_prior_measurements = [None]*N + [tmp[feature][i-N] for i in range(N, rows)]

# make a new column name of feature_N and add to DataFrame
col_name = "{}_{}".format(feature, N)  
tmp[col_name] = nth_prior_measurements  
tmp

我們現(xiàn)在把上面的處理過程封裝成一個函數(shù)，方便調用。

def derive_nth_day_feature(df, feature, N):  
    rows = df.shape[0]
    nth_prior_measurements = [None]*N + [df[feature][i-N] for i in range(N, rows)]
    col_name = "{}_{}".format(feature, N)
    df[col_name] = nth_prior_measurements

好，我們現(xiàn)在對所有的特征，都取過去三天的數(shù)據(jù)，放在本行。

for feature in features:  
    if feature != "date":
        for N in range(1, 4):
            derive_nth_day_feature(df, feature, N)

處理完后，我們現(xiàn)在的所有數(shù)據(jù)特征為：

df.columns  

Index(["meantempm", "meandewptm", "meanpressurem", "maxhumidity",  
       "minhumidity", "maxtempm", "mintempm", "maxdewptm", "mindewptm",
       "maxpressurem", "minpressurem", "precipm", "meantempm_1", "meantempm_2",
       "meantempm_3", "meandewptm_1", "meandewptm_2", "meandewptm_3",
       "meanpressurem_1", "meanpressurem_2", "meanpressurem_3",
       "maxhumidity_1", "maxhumidity_2", "maxhumidity_3", "minhumidity_1",
       "minhumidity_2", "minhumidity_3", "maxtempm_1", "maxtempm_2",
       "maxtempm_3", "mintempm_1", "mintempm_2", "mintempm_3", "maxdewptm_1",
       "maxdewptm_2", "maxdewptm_3", "mindewptm_1", "mindewptm_2",
       "mindewptm_3", "maxpressurem_1", "maxpressurem_2", "maxpressurem_3",
       "minpressurem_1", "minpressurem_2", "minpressurem_3", "precipm_1",
       "precipm_2", "precipm_3"],
      dtype="object")

數(shù)據(jù)清洗

??數(shù)據(jù)清洗時機器學習過程中最重要的一步，而且非常的耗時、費力。本教程中，我們會去掉不需要的樣本、數(shù)據(jù)不完整的樣本，查看數(shù)據(jù)的一致性等。
??首先去掉我不感興趣的數(shù)據(jù)，來減少樣本集。我們的目標是根據(jù)過去三天的天氣數(shù)據(jù)預測天氣溫度，因此我們只保留min, max, mean三個字段的數(shù)據(jù)。

# make list of original features without meantempm, mintempm, and maxtempm
to_remove = [feature  
             for feature in features 
             if feature not in ["meantempm", "mintempm", "maxtempm"]]

# make a list of columns to keep
to_keep = [col for col in df.columns if col not in to_remove]

# select only the columns in to_keep and assign to df
df = df[to_keep]  
df.columns
Index(["meantempm", "maxtempm", "mintempm", "meantempm_1", "meantempm_2",  
       "meantempm_3", "meandewptm_1", "meandewptm_2", "meandewptm_3",
       "meanpressurem_1", "meanpressurem_2", "meanpressurem_3",
       "maxhumidity_1", "maxhumidity_2", "maxhumidity_3", "minhumidity_1",
       "minhumidity_2", "minhumidity_3", "maxtempm_1", "maxtempm_2",
       "maxtempm_3", "mintempm_1", "mintempm_2", "mintempm_3", "maxdewptm_1",
       "maxdewptm_2", "maxdewptm_3", "mindewptm_1", "mindewptm_2",
       "mindewptm_3", "maxpressurem_1", "maxpressurem_2", "maxpressurem_3",
       "minpressurem_1", "minpressurem_2", "minpressurem_3", "precipm_1",
       "precipm_2", "precipm_3"],
      dtype="object")

為了更好的觀察數(shù)據(jù)，我們使用Pandas的一些內(nèi)置函數(shù)來查看數(shù)據(jù)信息，首先我們使用info()函數(shù)，這個函數(shù)會輸出DataFrame里存放的數(shù)據(jù)信息。

df.info()
  
DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27  
Data columns (total 39 columns):  
meantempm          1000 non-null object  
maxtempm           1000 non-null object  
mintempm           1000 non-null object  
meantempm_1        999 non-null object  
meantempm_2        998 non-null object  
meantempm_3        997 non-null object  
meandewptm_1       999 non-null object  
meandewptm_2       998 non-null object  
meandewptm_3       997 non-null object  
meanpressurem_1    999 non-null object  
meanpressurem_2    998 non-null object  
meanpressurem_3    997 non-null object  
maxhumidity_1      999 non-null object  
maxhumidity_2      998 non-null object  
maxhumidity_3      997 non-null object  
minhumidity_1      999 non-null object  
minhumidity_2      998 non-null object  
minhumidity_3      997 non-null object  
maxtempm_1         999 non-null object  
maxtempm_2         998 non-null object  
maxtempm_3         997 non-null object  
mintempm_1         999 non-null object  
mintempm_2         998 non-null object  
mintempm_3         997 non-null object  
maxdewptm_1        999 non-null object  
maxdewptm_2        998 non-null object  
maxdewptm_3        997 non-null object  
mindewptm_1        999 non-null object  
mindewptm_2        998 non-null object  
mindewptm_3        997 non-null object  
maxpressurem_1     999 non-null object  
maxpressurem_2     998 non-null object  
maxpressurem_3     997 non-null object  
minpressurem_1     999 non-null object  
minpressurem_2     998 non-null object  
minpressurem_3     997 non-null object  
precipm_1          999 non-null object  
precipm_2          998 non-null object  
precipm_3          997 non-null object  
dtypes: object(39)  
memory usage: 312.5+ KB

注意：每一行的數(shù)據(jù)類型都是object，我們需要把數(shù)據(jù)轉成float。

df = df.apply(pd.to_numeric, errors="coerce")  
df.info()
  
DatetimeIndex: 1000 entries, 2015-01-01 to 2017-09-27  
Data columns (total 39 columns):  
meantempm          1000 non-null int64  
maxtempm           1000 non-null int64  
mintempm           1000 non-null int64  
meantempm_1        999 non-null float64  
meantempm_2        998 non-null float64  
meantempm_3        997 non-null float64  
meandewptm_1       999 non-null float64  
meandewptm_2       998 non-null float64  
meandewptm_3       997 non-null float64  
meanpressurem_1    999 non-null float64  
meanpressurem_2    998 non-null float64  
meanpressurem_3    997 non-null float64  
maxhumidity_1      999 non-null float64  
maxhumidity_2      998 non-null float64  
maxhumidity_3      997 non-null float64  
minhumidity_1      999 non-null float64  
minhumidity_2      998 non-null float64  
minhumidity_3      997 non-null float64  
maxtempm_1         999 non-null float64  
maxtempm_2         998 non-null float64  
maxtempm_3         997 non-null float64  
mintempm_1         999 non-null float64  
mintempm_2         998 non-null float64  
mintempm_3         997 non-null float64  
maxdewptm_1        999 non-null float64  
maxdewptm_2        998 non-null float64  
maxdewptm_3        997 non-null float64  
mindewptm_1        999 non-null float64  
mindewptm_2        998 non-null float64  
mindewptm_3        997 non-null float64  
maxpressurem_1     999 non-null float64  
maxpressurem_2     998 non-null float64  
maxpressurem_3     997 non-null float64  
minpressurem_1     999 non-null float64  
minpressurem_2     998 non-null float64  
minpressurem_3     997 non-null float64  
precipm_1          889 non-null float64  
precipm_2          889 non-null float64  
precipm_3          888 non-null float64  
dtypes: float64(36), int64(3)  
memory usage: 312.5 KB

現(xiàn)在得到我想要的數(shù)據(jù)了。接下來我們調用describe()函數(shù)，這個函數(shù)會返回一個DataFrame，這個返回值包含了總數(shù)、平均數(shù)、標準差、最小、25%、50%、75%、最大的數(shù)據(jù)信息。

??接下來，使用四分位的方法，去掉25%數(shù)據(jù)里特別小的和75%數(shù)據(jù)里特別大的數(shù)據(jù)。

# Call describe on df and transpose it due to the large number of columns
spread = df.describe().T

# precalculate interquartile range for ease of use in next calculation
IQR = spread["75%"] - spread["25%"]

# create an outliers column which is either 3 IQRs below the first quartile or
# 3 IQRs above the third quartile
spread["outliers"] = (spread["min"]<(spread["25%"]-(3*IQR)))|(spread["max"] > (spread["75%"]+3*IQR))

# just display the features containing extreme outliers
spread.ix[spread.outliers,]

??評估異常值的潛在影響是任何分析項目的難點。一方面，您需要關注引入虛假數(shù)據(jù)樣本的可能性，這些樣本將嚴重影響您的模型。另一方面，異常值對于預測在特殊情況下出現(xiàn)的結果是非常有意義的。我們將討論每一個包含特征的異常值，看看我們是否能夠得出合理的結論來處理它們。

??第一組特征看起來與最大濕度有關。觀察這些數(shù)據(jù)，我可以看出，這個特征類別的異常值是非常低的最小值。這數(shù)據(jù)看起來沒價值，我想我想仔細看看它，最好是以圖形方式。要做到這一點，我會使用直方圖。

%matplotlib inline
plt.rcParams["figure.figsize"] = [14, 8]  
df.maxhumidity_1.hist()  
plt.title("Distribution of maxhumidity_1")  
plt.xlabel("maxhumidity_1")  
plt.show()

查看maxhumidity字段的直方圖，數(shù)據(jù)表現(xiàn)出相當多的負偏移。在選擇預測模型和評估最大濕度影響的強度時，我會牢記這一點。許多基本的統(tǒng)計方法都假定數(shù)據(jù)是正態(tài)分布的。現(xiàn)在我們暫時不管它，但是記住這個異常特性。

??接下來我們看另外一個字段的直方圖

df.minpressurem_1.hist()  
plt.title("Distribution of minpressurem_1")  
plt.xlabel("minpressurem_1")  
plt.show()

??要解決的最后一個數(shù)據(jù)質量問題是缺失值。由于我構建DataFrame的時候，缺少的值由NaN表示。您可能會記得，我通過推導代表前三天測量結果的特征，有意引入了收集數(shù)據(jù)前三天的缺失值。直到第三天我們才能開始推導出這些特征，所以很明顯我會想把這些頭三天從數(shù)據(jù)集中排除出去。
再回頭再看一下上面info()函數(shù)輸出的信息，可以看到包含NaN值的數(shù)據(jù)特征非常的少，除了我提到的幾個字段，基本就沒有了。因為機器學習需要樣本字段數(shù)據(jù)的完整性，因為如果我們因為降水量那個字段為空，就去掉樣本，那么會造成大量的樣本不可用，對于這種情況，我們可以給為空的降水量字段的樣本填入一個值。根據(jù)經(jīng)驗和盡量減少由于填入的值對模型的影響，我決定給為空的降水量字段填入值0。

# iterate over the precip columns
for precip_col in ["precipm_1", "precipm_2", "precipm_3"]:  
    # create a boolean array of values representing nans
    missing_vals = pd.isnull(df[precip_col])
    df[precip_col][missing_vals] = 0

填入值后，我們就可以刪掉字段值為空的樣本了，只用調用dropna()函數(shù)。

df = df.dropna()

總結

??這篇文章主要介紹了數(shù)據(jù)的收集、處理、清洗的流程，本篇文章處理完的處理，將用于下篇文章的模型訓練。
??對你來說，這篇文章可能很枯燥，沒啥干貨，但好的樣本數(shù)據(jù)，才能訓練處好的模型，因此，樣本數(shù)據(jù)的收集和處理能力，直接影響你后面的機器學習的效果。

英文原文

轉自我的博客，捕蛇者說

GPU云服務器云服務器機器學習預測第一部分深度學習分第一商務機器

文章版權歸作者所有，未經(jīng)允許請勿轉載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉載請注明本文地址：http://m.hztianpu.com/yun/41118.html

使用機器學習預測天氣(第二部分)

摘要：為了建立線性回歸模型，我要用到里非常重要的兩個機器學習相關的庫和。使用逐步回歸建立一個健壯的模型一個強大的線性回歸模型必須選取有意義的重要的統(tǒng)計指標的指標作為預測指標。概述 ??這篇文章我們接著前一篇文章，使用Weather Underground網(wǎng)站獲取到的數(shù)據(jù)，來繼續(xù)探討用機器學習的方法預測內(nèi)布拉斯加州林肯市的天氣??上一篇文章我們已經(jīng)探討了如何收集、整理、清洗數(shù)據(jù)。這篇文章我們...

gecko23 2019-07-30 15:19 評論0 收藏0
使用機器學習預測天氣(第三部分神經(jīng)網(wǎng)絡)

摘要：概述這是使用機器學習預測平均氣溫系列文章的最后一篇文章了，作為最后一篇文章，我將使用的開源機器學習框架來構建一個神經(jīng)網(wǎng)絡回歸器。請注意，我把這個聲明推廣到整個機器學習的連續(xù)體，而不僅僅是神經(jīng)網(wǎng)絡。概述 ??這是使用機器學習預測平均氣溫系列文章的最后一篇文章了，作為最后一篇文章，我將使用google的開源機器學習框架tensorflow來構建一個神經(jīng)網(wǎng)絡回歸器。關于tensorflow...

mrcode 2019-07-30 15:22 評論0 收藏0

發(fā)表評論

登陸后可評論

0條評論

liukai90

男|高級講師

我要關注我要私信

TA的文章

tensorflow

閱讀 2254·2023-04-25 15:00
unixserv.eu，荷蘭VPS，1歐/月，1核/512M內(nèi)存/10G NVMe/不限流量/10G

閱讀 2431·2021-11-18 13:14
【嵌入式】利用FinSH控制臺控制LED燈

閱讀 1337·2021-11-15 11:37
華中科技大學與UCloud優(yōu)刻得達成戰(zhàn)略合作

閱讀 3191·2021-09-24 13:55
CSS篇－line-height計算方法（父子元素）

閱讀 1298·2019-08-30 15:52
360°產(chǎn)品展示

閱讀 2701·2019-08-29 12:35
正在暑假中的《課多周刊》(第1期)

閱讀 3427·2019-08-29 11:04
一個基于Node.js的本地快速測試服務器

閱讀 1284·2019-08-26 12:13

成人无码视频,亚洲精品久久久久av无码,午夜精品久久久久久毛片,亚洲中文字幕日韩无码

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

使用機器學習預測天氣(第一部分)

相關文章

使用機器學習預測天氣(第二部分)

使用機器學習預測天氣(第三部分神經(jīng)網(wǎng)絡)

發(fā)表評論

0條評論

liukai90

男|高級講師

TA的文章

tensorflow

unixserv.eu，荷蘭VPS，1歐/月，1核/512M內(nèi)存/10G NVMe/不限流量/10G

【嵌入式】利用FinSH控制臺控制LED燈

華中科技大學與UCloud優(yōu)刻得達成戰(zhàn)略合作

CSS篇－line-height計算方法（父子元素）

360°產(chǎn)品展示

正在暑假中的《課多周刊》(第1期)

一個基于Node.js的本地快速測試服務器

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

使用機器學習預測天氣(第一部分)

相關文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！