概述
在进行完简单的数据分组之后我们发现,一部分数据是重复的仅仅是单位不同。随意我们先进行一步筛选处理,之后再进行一步降维处理,考虑到人工降维的效率比较低,并且效果不一定好,因此采用PCA进行降维,具体实现如下。
代码实现
数据筛选
我们发现605个指标中许多指标仅仅是单位不同而所描述的是一个变量,比如下面两个变量分别为第1个2个指标的折线表示:
发现其仅仅是单位的不同。所以我们从全部的单位中选择出9个单位,进行了第一步筛选.实现的时候先进行一个指标的筛选,从第一个表中筛选出所有符合条件的列标,实现如下:1
2
3
4
5
6
7
8
9
10
11# sort the MSN,the num is the kindNum of MSN
# retern is the ArrNum Of the same kind
def getkindArrNum(kindNum):
kindName = getkindName(kindNum)
kindArr = []
kindNum = 0
for i in range(1,605):
kindNum = kindNum + 1
if sheet2.cell(i,2).value == kindName:
kindArr.append(kindNum)
return kindArr
之后再利用列标把数据变为一个整体的数组,考虑到之后需要进行PCA处理,但是每组数据的长度不统一,由于长度为50的占85%以上,所以我们仅仅筛选出长度为50的数据进行分析。实现如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27# get the kindArrData
#return is the data Arr of the certain kindNum
def getkindArrData(kindNum,stateNum):
kindArr = getkindArrNum(kindNum)
kindArrData = []
kindArrSize = len(kindArr)
for i in range(0,kindArrSize-1):
kindData = getDataArr(stateNum,kindArr[i])
kindArrData.append(kindData)
return kindArrData
# int the last fun we get the len of each Arr is defferent,e.g 0 30 40 50
#for PCA,we should screen the Arr in the same len
def screenKindArrData(kindNum,stateNum):
kindArr = getkindArrNum(kindNum)
kindArrData = []
scrKindArrData = []
kindArrSize = len(kindArr)
for i in range(0,kindArrSize-1):
kindData = getDataArr(stateNum,kindArr[i])
kindArrData.append(kindData)
lenkindArr = len(kindArrData)
for i in range(0,lenkindArr):
if len(kindArrData[i]) == 50:
scrKindArrData.append(kindArrData[i])
return scrKindArrData
进行9个指标的筛选原理同上,实现如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30# !!! important def in the Code
# whether is in the Arr to be down
def whetherIndex():
kindArr = []
for kindNum in range(0,9):
kindName = getkindName(kindNum)
kindNum = 0
for i in range(1,605):
kindNum = kindNum + 1
if sheet2.cell(i,2).value == kindName :
kindArr.append(kindNum)
return kindArr
# important def in the Code
# to screen in the 9 kinds
def screenKindArrData2(stateNum):
kindArrData = []
scrKindArrData = []
for kindNum in range(0,9):
kindArr = getkindArrNum(kindNum)
kindArrSize = len(kindArr)
for i in range(0,kindArrSize-1):
kindData = getDataArr(stateNum,kindArr[i])
kindArrData.append(kindData)
lenkindArr = len(kindArrData)
for i in range(0,lenkindArr):
if len(kindArrData[i]) == 50:
scrKindArrData.append(kindArrData[i])
return scrKindArrData
降维
采用PCA方法降维,PCA函数如下:1
2
3
4
5
6
7
8
9
10
11
12# use PCA to down the dataset
def pca(dataMat,topNfeat = 9999999):
meanVals = numpy.mean(dataMat,axis=0)
meanRemoved = dataMat - meanVals
covMat = numpy.cov(meanRemoved,rowvar = 0)
eigVals,eigVects = numpy.linalg.eig(numpy.mat(covMat))
eigValInd = numpy.argsort(eigVals)
eigValInd = eigValInd[:-(topNfeat+1):-1]
redEigVects = eigVects[:,eigValInd]
lowDDataMat = meanRemoved * redEigVects
reconMat = (lowDDataMat * redEigVects.T) + meanVals
return lowDDataMat,reconMat
运用PCA进行降维处理,并进行绘图。其中在绘图的时候发现,“mat”类型的数据无法绘制为折线图,因此先将“mat”类型转回”list”,再进行折线图的绘制,输出正常,输出的为降维后每个维度的折线图,实现如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28# to down the data
def downData():
kindArrData1 = numpy.mat(screenKindArrData2(1))
kindArrData1 = kindArrData1.T
lowDMat,reconMat = pca(kindArrData1,5)
print numpy.shape(reconMat)
print numpy.shape(lowDMat)
return lowDMat.T
# to get the photo of the downData
def getPhotoDownData():
xcord1 = numpy.mat(range(0,50))
xcord1 = xcord1.tolist() # to get the line chart
xcord1 = xcord1[0]
dowmData = downData()
for i in range(0,5):
ycord1 = dowmData[i]
ycord1 = ycord1.tolist()
ycord1 = ycord1[0]
plt.plot(xcord1,ycord1,'bo-')
plt.plot(xcord1, ycord1, marker='o', mec='r', mfc='w',label=u'test1')
print '第' + str(i) + '种' + '.png'
f_name = '第' + str(i) + '种' + '.png'
f_path = source + f_name
plt.savefig(f_path)
plt.close()
输出
输出数据我直接输出到了txt文件中,封装了一个函数:1
2
3
4
5
6
7
8def writeToTxt(list_name,file_path):
try:
fp = open(file_path,"rb+")
for item in list_name:
fp.write(str(item)+"\n")
fp.close()
except IOError:
print("fail to open file")
小结
在运用PCA的时候一开始维度关系有点弄混了,发现降维后数据维度不正确,原因是数组忘记转置,在数组进行转置后输出正常。一开始将287维的数据降为30维,发现进行5组数据是特征比较明显的,后尝试将维度缩小后发现效果变好。