概述

在Part 1的B问中，我们使用了多元线性回归模型分别实现4个影响因素的回归分析。分别为Coal，Petroleum，Natural和Energy，本次介绍下我利用sklearn的实现多元线性回归。

基础

所使用的主要是sklearn.linear_model.LinearRegression类。
官方文档文档中说明使用的是普通线性最小二乘算法

Ordinary least squares Linear Regression

类中的具体方法这里不再详细介绍，可以参考官方手册。

代码实现

对于每个影响因素，我们从指标中寻找5个相关指标进行拟合，拟合的目标输出是每个变量的总量，这个可以由指标的描述信息进行确定，例如coal的总量为CLTCB。
我先把输入数据和拟合目标从表中进行提取，其中dataMSNNum是一个包含数据类型号的数组，长度为6，第1个为拟合的目标，余下为输入数据。具体实现如下：

# get the data of 6 certain MSNs' dataset
# the test number is 77,39,44,48,54,72
# the 77 is the aim
def getRegData(stateNum):
    regData = []
    for i in range(0,6):
        MSNNum = dataMSNNum[i]
        MSNName = sheet2.cell(MSNNum,0).value
        stateName = getNameOfState(stateNum)

        oriX = 0
        oriY = []
        for i in range(1,105744):
            if sheet1.cell(i,0).value == MSNName and sheet1.cell(i,1).value == stateName:
                oriX = oriX + 1
                oriY.append(sheet1.cell(i,3).value)
        regData.append(oriY)

    return regData

之后进行多元线性拟合，实现如下：

# to achieve the regression
def regreAim():
    oriData = getRegData(1)
    X = oriData[1:6]
    X = trans(X)
    Y = oriData[0]

    # define the model and train
    model = linear_model.LinearRegression()
    model.fit(X,Y)

    # get the aimCoefficients and aimIndependent
    aimCoefficients =  model.coef_
    aimIndependent = model.intercept_

    print 'aimCoefficients:'
    print aimCoefficients
    print 'aimIndependent:'
    print aimIndependent

    # get the output by the originalData
    test_x = X
    testPredict = model.predict(test_x)

    # get the means of difference between oriY and new Y
    comparation = (testPredict - Y)**2
    mean_Y = numpy.mean(comparation)  # get the mean to detect

    # ouput the meas
    print 'mean_Y:'
    print mean_Y


    # output is the original data:Y and the prodict:testPredict
    return Y,testPredict

第一个输出为拟合的数据，第二个数据为以样本为输入，利用训练好的模型得到的输出，并打印出模型的系数，y轴截距和均方误差。之后将原始数据和模型输出作图对比：

# get the photo to compare to the original data
def getPhotoAimPredict():
    ori_Y,new_Y = regreAim()


    X = range(1960,2010)
    plt.subplot(311)
    plt.title('Original')
    l1 = plt.plot(X, ori_Y , marker='.', mec='r', mfc='w',label=u'Original')
    plt.ylabel('Billion Btu')
    plt.legend(loc='upper left')

    plt.subplot(312)
    plt.title('Predict')
    l2 = plt.plot(X, new_Y , marker='.', mec='b', mfc='w',label=u'Predict')
    plt.ylabel('Billion Btu')
    plt.legend(loc='upper left')

    plt.subplot(313)
    l1 = plt.plot(X, ori_Y , marker='.', mec='r', mfc='w',label=u'Original')
    l2 = plt.plot(X, new_Y , marker='.', mec='b', mfc='w',label=u'Predict')
    plt.xlabel('time/year')
    plt.ylabel('Billion Btu')
    plt.legend(loc='upper left')
    plt.title('Compare')

以Coal为例，输出为：

$ python regression2.py 
aimCoefficients:
[ 0.99999913  1.00000457  1.          1.          1.00000498]
aimIndependent:
2.20703077503e-06
MSE_Y:
8.9296998712e-10

输出图像如下：
$photo1$
由于我把大部分函数封装了起来，所以对于其余因素进行回归的时候，只需要在给出其对应的dataMSNNum数组即可。

小结

模型的回归分析在相关因素寻找合适的时候，误差是较小的，但是所寻找的相关项不合适的时候，其误差也是比较大的，例如一开始natual的相关项选择的并不合适，发现结果如下：

$ python regression2.py 
aimCoefficients:
[ 0.          2.58540075 -2.05603511  0.85646644  4.78450319]
aimIndependent:
-20150.6439576
MSE_Y:
162743622.205

对比图输出如下：
$photo2$
可以发现第一个因素为无关变量，并且均方误差较大。不过这样也侧面说明其他变量影响因素选择比较合适。