HW3
Submission requirements:
Please submit your solutions to our class website.
Q1
Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:
A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)
The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only
(a) The three cluster’s centers after the first round execution
(b) The final three clusters
答:
本题所用的K-Means算法如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 import mathPoi=[[4 ,2 ,5 ],[10 ,5 ,2 ],[5 ,8 ,7 ], [1 ,1 ,1 ],[2 ,3 ,2 ],[3 ,6 ,8 ], [11 ,9 ,2 ],[1 ,4 ,6 ],[9 ,1 ,7 ],[5 ,6 ,7 ]] def k_Means (poi,k=3 ,epochs=100 ): cluster=[[0 ]*(len (poi[0 ]))+[0 ] for i in range (k)] PoiClass=[-1 for i in range (len (poi))] Break=[-1 for i in range (len (poi))] centroid=[poi[1 ],poi[4 ],poi[7 ]] def edis (x,y ): return math.sqrt(sum ([(x[i]-y[i])**2 for i in range (len (x))])) for _ in range (epochs): for pId,p in enumerate (poi): dis,idx=2e31 ,0 for Idx,c in enumerate (centroid): if (v:=edis(p,c))<dis: dis,idx=v,Idx for i in range (len (p)): cluster[idx][i]+=p[i] cluster[idx][-1 ]+=1 PoiClass[pId]=idx for i,v in enumerate (cluster): new_c=[] for j in range (len (v)-1 ): new_c.append(v[j]/v[-1 ]) centroid[i]=new_c cluster[i]=[0 ]*len (poi[0 ])+[0 ] if _==0 : print (centroid) if sum (PoiClass[i]==Break[i] for i in range (len (poi)))>=len (poi): return PoiClass Break=PoiClass[:] print (k_Means(Poi))
a. 在第一轮结束时输出三个中心:
1 [[10.0 , 5.0 , 3.67 ], [2.33 , 2.0 , 2.67 ], [3.5 , 6.0 , 7.0 ]]
b. 将A 2 , B 2 , C 2 A2,B2,C2 A 2 , B 2 , C 2 作为初始中心点输入后,得到的结果如下:
点
类
A1
1
A2
0
A3
2
B1
1
B2
1
B3
2
C1
0
C2
2
C3
0
C4
2
Q2
Product 1
Product 2
Product 3
Product 4
User 1
1
1
5
3
User 2
3
?
5
4
User 3
1
3
1
1
User 4
4
3
2
1
User 5
2
2
2
4
(a) List the top 3 most similar users of user 2 based on Cosine Similarity.
(b) Predict User 2’s rating for Product 2.
答:
(a). 通过计算可得:
s i m ( U 1 , U 2 ) = 1 ∗ 3 + 5 ∗ 5 + 3 ∗ 4 1 2 + 5 2 + 3 2 3 2 + 5 2 + 4 2 = 0.956 sim(U_1,U_2)=\frac{1*3+5*5+3*4}{\sqrt{1^2+5^2+3^2}\sqrt{3^2+5^2+4^2}}=0.956
s im ( U 1 , U 2 ) = 1 2 + 5 2 + 3 2 3 2 + 5 2 + 4 2 1 ∗ 3 + 5 ∗ 5 + 3 ∗ 4 = 0.956
s i m ( U 3 , U 2 ) = 1 ∗ 3 + 1 ∗ 5 + 1 ∗ 4 1 2 + 1 2 + 1 2 3 2 + 5 2 + 4 2 = 0.979 sim(U_3,U_2)=\frac{1*3+1*5+1*4}{\sqrt{1^2+1^2+1^2}\sqrt{3^2+5^2+4^2}}=0.979
s im ( U 3 , U 2 ) = 1 2 + 1 2 + 1 2 3 2 + 5 2 + 4 2 1 ∗ 3 + 1 ∗ 5 + 1 ∗ 4 = 0.979
s i m ( U 4 , U 2 ) = 4 ∗ 3 + 2 ∗ 5 + 1 ∗ 4 4 2 + 3 2 + 1 2 3 2 + 5 2 + 4 2 = 0.802 sim(U_4,U_2)=\frac{4*3+2*5+1*4}{\sqrt{4^2+3^2+1^2}\sqrt{3^2+5^2+4^2}}=0.802
s im ( U 4 , U 2 ) = 4 2 + 3 2 + 1 2 3 2 + 5 2 + 4 2 4 ∗ 3 + 2 ∗ 5 + 1 ∗ 4 = 0.802
s i m ( U 5 , U 2 ) = 2 ∗ 3 + 2 ∗ 5 + 4 ∗ 4 2 2 + 2 2 + 4 2 3 2 + 5 2 + 4 2 = 0.924 sim(U_5,U_2)=\frac{2*3+2*5+4*4}{\sqrt{2^2+2^2+4^2}\sqrt{3^2+5^2+4^2}}=0.924
s im ( U 5 , U 2 ) = 2 2 + 2 2 + 4 2 3 2 + 5 2 + 4 2 2 ∗ 3 + 2 ∗ 5 + 4 ∗ 4 = 0.924
top3的用户值为:
U 3 , U 1 , U 5 U_3,U_1,U_5
U 3 , U 1 , U 5
(b).
User2的平均得分为:
r ˉ 2 = 3 + 5 + 4 3 = 4 \bar r_2=\frac{3+5+4}{3}=4
r ˉ 2 = 3 3 + 5 + 4 = 4
其余三位用户的平均得分为:
r ˉ 1 = 1 + 1 + 5 + 3 4 = 2.5 \bar r_1=\frac{1+1+5+3}{4}=2.5
r ˉ 1 = 4 1 + 1 + 5 + 3 = 2.5
r ˉ 3 = 1 + 3 + 1 + 1 4 = 1.5 \bar r_3=\frac{1+3+1+1}{4}=1.5
r ˉ 3 = 4 1 + 3 + 1 + 1 = 1.5
r ˉ 5 = 2 + 2 + 2 + 4 4 = 2.5 \bar r_5=\frac{2+2+2+4}{4}=2.5
r ˉ 5 = 4 2 + 2 + 2 + 4 = 2.5
User2在Product2上的评分为:
r u 2 , p 2 = r ˉ 2 + 0.956 ∗ ( 2.5 − 1 ) + 0.979 ∗ ( 3 − 1.5 ) + 0.924 ∗ ( 2.5 − 2 ) 0.956 + 0.979 + 0.924 = 4 + 1.176 = 5.176 r_{u2,p2}=\bar r_2+\frac{0.956*(2.5-1)+0.979*(3-1.5)+0.924*(2.5-2)}{0.956+0.979+0.924}\\
\
\\=4+1.176=5.176
r u 2 , p 2 = r ˉ 2 + 0.956 + 0.979 + 0.924 0.956 ∗ ( 2.5 − 1 ) + 0.979 ∗ ( 3 − 1.5 ) + 0.924 ∗ ( 2.5 − 2 ) = 4 + 1.176 = 5.176
Part II: Lab
Q1.
决策树的混淆矩阵为
神经网络的混淆矩阵为
逻辑回归的混淆矩阵为
对于决策树,其评价指标为:
P r e c i s i o n 1 = T P T P + N P = 990 990 + 43 = 0.958 P r e c i s i o n 1 = F P T P + F N = 990 990 + 0 = 1 Precision_1=\frac{TP}{TP+NP}=\frac{990}{990+43}=0.958
\\
\
\\
Precision_1=\frac{FP}{TP+FN}=\frac{990}{990+0}=1
P rec i s i o n 1 = TP + NP TP = 990 + 43 990 = 0.958 P rec i s i o n 1 = TP + FN FP = 990 + 0 990 = 1
对于神经网络,其评价指标为
P r e c i s i o n 2 = T P T P + N P = 987 987 + 37 = 0.964 P r e c i s i o n 2 = F P T P + F N = 987 987 + 3 = 0.997 Precision_2=\frac{TP}{TP+NP}=\frac{987}{987+37}=0.964
\\
\
\\
Precision_2=\frac{FP}{TP+FN}=\frac{987}{987+3}=0.997
P rec i s i o n 2 = TP + NP TP = 987 + 37 987 = 0.964 P rec i s i o n 2 = TP + FN FP = 987 + 3 987 = 0.997
对于逻辑回归,其评价指标为:
P r e c i s i o n 3 = T P T P + N P = 988 988 + 36 = 0.965 P r e c i s i o n 2 = F P T P + F N = 988 988 + 2 = 0.998 Precision_3=\frac{TP}{TP+NP}=\frac{988}{988+36}=0.965
\\
\
\\
Precision_2=\frac{FP}{TP+FN}=\frac{988}{988+2}=0.998
P rec i s i o n 3 = TP + NP TP = 988 + 36 988 = 0.965 P rec i s i o n 2 = TP + FN FP = 988 + 2 988 = 0.998
相较之下,逻辑回归比神经网络多预测对了一个,所以召回率和精度都要更高,而决策树倒是将所有的例子都预测为了0,召回率达到了1,但精度相较其他两个略有不足。
Q2
关联规则表为:
后项
前项
支持度百分比
置信度百分比
提升度
milk
pasta
35.03449171
45.85519412
0.993991348
milk
water
27.85070173
46.70393664
1.012389323
milk
biscuits
20.47445019
51.53147444
1.117034628
milk
brioches
15.31907532
49.67532468
1.076799343
milk
yoghurt
15.23473823
52.16465578
1.130759939
milk
coffee
15.02713924
49.87768024
1.081185753
pasta
tomato souce
11.59743096
53.23512959
1.519506264
milk
tomato souce
11.59743096
51.27727018
1.111524307
milk
beer
10.92057176
45.42574257
0.984682236
milk
coke
10.70648531
47.30357504
1.025387531
milk
tunny
10.38643687
46.42931501
1.00643642
milk
water and pasta
9.551715935
55.9882273
1.213642523
milk
juices
8.25638475
53.27396543
1.154806161
milk
biscuits and pasta
7.763337154
57.8551532
1.2541114
pasta
coffee and milk
7.495188461
45.52798615
1.299518958
按照支持度 排名为:
后项
前项
支持度
milk
pasta
35.03%
milk
water
27.85%
milk
biscuits
20.47%
milk
brioches
15.32%
milk
yohurt
15.14%
按照支持度排序,可以发现,用户同时购买牛奶和意大利面的支持度最高,其次是同时购买牛奶和水。在支持度前五的项目中,有四项都是牛奶跟比较干燥的食物配比,因而可以考虑将他们做一些促销。
按照置信度 排名为:
后项
前项
置信度
milk
biscuits,pasta
57.86%
milk
water,pasta
55.99%
milk
juices
53.27%
pasta
tomato souce
53.24%
milk
yohurt
52.17%
按照置信度排序,用户在买完牛奶后,比较倾向于购买配套的食物,比如饼干,意大利面等,也有用户会选择在买完牛奶后购买饮品。而意大利面和番茄酱的组合比较受欢迎。
按照提升度 排名为:
后项
前项
提升度
pasta
tomato souce
1.52
pasta
coffee,milk
1.3
milk
biscuits,pasta
1.254
milk
water,pasta
1.214
milk
juices
1.115
按照提升度排序,可以发现,意大利面和番茄酱的相关性最高,其次是意大利面和咖啡、牛奶。牛奶跟食物、饮品都表现出了不错的相关性。