HW3

Submission requirements:

Please submit your solutions to our class website.


Q1

Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:

A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)

The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only

(a) The three cluster’s centers after the first round execution

(b) The final three clusters

答:

本题所用的K-Means算法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import math

Poi=[[4,2,5],[10,5,2],[5,8,7],
[1,1,1],[2,3,2],[3,6,8],
[11,9,2],[1,4,6],[9,1,7],[5,6,7]]

def k_Means(poi,k=3,epochs=100):

# 初始化
cluster=[[0]*(len(poi[0]))+[0] for i in range(k)]
PoiClass=[-1 for i in range(len(poi))]
Break=[-1 for i in range(len(poi))]

# 选取中心点
centroid=[poi[1],poi[4],poi[7]]

# 计算每个点的欧氏距离
def edis(x,y):
return math.sqrt(sum([(x[i]-y[i])**2 for i in range(len(x))]))

for _ in range(epochs):

# 计算每个点到中心的距离
for pId,p in enumerate(poi):
dis,idx=2e31,0
for Idx,c in enumerate(centroid):
if (v:=edis(p,c))<dis:
dis,idx=v,Idx
for i in range(len(p)):
cluster[idx][i]+=p[i]
cluster[idx][-1]+=1
# 更新映射表
PoiClass[pId]=idx

# 重新计算每个簇的中心,这个中心是平均值,并不一定是存在的点
for i,v in enumerate(cluster):
new_c=[]
for j in range(len(v)-1):
new_c.append(v[j]/v[-1])
centroid[i]=new_c
cluster[i]=[0]*len(poi[0])+[0]
if _==0:
print(centroid)
# 设置终止迭代条件
if sum(PoiClass[i]==Break[i] for i in range(len(poi)))>=len(poi):
return PoiClass
Break=PoiClass[:]
print(k_Means(Poi))

a. 在第一轮结束时输出三个中心:

1
[[10.0, 5.0, 3.67], [2.33, 2.0, 2.67], [3.5, 6.0, 7.0]]

b. 将A2,B2,C2A2,B2,C2作为初始中心点输入后,得到的结果如下:

A1 1
A2 0
A3 2
B1 1
B2 1
B3 2
C1 0
C2 2
C3 0
C4 2

Q2

Product 1 Product 2 Product 3 Product 4
User 1 1 1 5 3
User 2 3 5 4
User 3 1 3 1 1
User 4 4 3 2 1
User 5 2 2 2 4

(a) List the top 3 most similar users of user 2 based on Cosine Similarity.

(b) Predict User 2’s rating for Product 2.

答:

(a). 通过计算可得:

sim(U1,U2)=13+55+3412+52+3232+52+42=0.956sim(U_1,U_2)=\frac{1*3+5*5+3*4}{\sqrt{1^2+5^2+3^2}\sqrt{3^2+5^2+4^2}}=0.956

sim(U3,U2)=13+15+1412+12+1232+52+42=0.979sim(U_3,U_2)=\frac{1*3+1*5+1*4}{\sqrt{1^2+1^2+1^2}\sqrt{3^2+5^2+4^2}}=0.979

sim(U4,U2)=43+25+1442+32+1232+52+42=0.802sim(U_4,U_2)=\frac{4*3+2*5+1*4}{\sqrt{4^2+3^2+1^2}\sqrt{3^2+5^2+4^2}}=0.802

sim(U5,U2)=23+25+4422+22+4232+52+42=0.924sim(U_5,U_2)=\frac{2*3+2*5+4*4}{\sqrt{2^2+2^2+4^2}\sqrt{3^2+5^2+4^2}}=0.924

top3的用户值为:

U3,U1,U5U_3,U_1,U_5

(b).

User2的平均得分为:

rˉ2=3+5+43=4\bar r_2=\frac{3+5+4}{3}=4

其余三位用户的平均得分为:

rˉ1=1+1+5+34=2.5\bar r_1=\frac{1+1+5+3}{4}=2.5

rˉ3=1+3+1+14=1.5\bar r_3=\frac{1+3+1+1}{4}=1.5

rˉ5=2+2+2+44=2.5\bar r_5=\frac{2+2+2+4}{4}=2.5

User2在Product2上的评分为:

ru2,p2=rˉ2+0.956(2.51)+0.979(31.5)+0.924(2.52)0.956+0.979+0.924 =4+1.176=5.176r_{u2,p2}=\bar r_2+\frac{0.956*(2.5-1)+0.979*(3-1.5)+0.924*(2.5-2)}{0.956+0.979+0.924}\\ \ \\=4+1.176=5.176


Part II: Lab

Q1.

决策树的混淆矩阵为

0 1
0 990 0
1 43 0

神经网络的混淆矩阵为

0 1
0 987 3
1 37 6

逻辑回归的混淆矩阵为

0 1
0 988 2
1 36 7

对于决策树,其评价指标为:

Precision1=TPTP+NP=990990+43=0.958 Precision1=FPTP+FN=990990+0=1Precision_1=\frac{TP}{TP+NP}=\frac{990}{990+43}=0.958 \\ \ \\ Precision_1=\frac{FP}{TP+FN}=\frac{990}{990+0}=1

对于神经网络,其评价指标为

Precision2=TPTP+NP=987987+37=0.964 Precision2=FPTP+FN=987987+3=0.997Precision_2=\frac{TP}{TP+NP}=\frac{987}{987+37}=0.964 \\ \ \\ Precision_2=\frac{FP}{TP+FN}=\frac{987}{987+3}=0.997

对于逻辑回归,其评价指标为:

Precision3=TPTP+NP=988988+36=0.965 Precision2=FPTP+FN=988988+2=0.998Precision_3=\frac{TP}{TP+NP}=\frac{988}{988+36}=0.965 \\ \ \\ Precision_2=\frac{FP}{TP+FN}=\frac{988}{988+2}=0.998

相较之下,逻辑回归比神经网络多预测对了一个,所以召回率和精度都要更高,而决策树倒是将所有的例子都预测为了0,召回率达到了1,但精度相较其他两个略有不足。


Q2

关联规则表为:

后项 前项 支持度百分比 置信度百分比 提升度
milk pasta 35.03449171 45.85519412 0.993991348
milk water 27.85070173 46.70393664 1.012389323
milk biscuits 20.47445019 51.53147444 1.117034628
milk brioches 15.31907532 49.67532468 1.076799343
milk yoghurt 15.23473823 52.16465578 1.130759939
milk coffee 15.02713924 49.87768024 1.081185753
pasta tomato souce 11.59743096 53.23512959 1.519506264
milk tomato souce 11.59743096 51.27727018 1.111524307
milk beer 10.92057176 45.42574257 0.984682236
milk coke 10.70648531 47.30357504 1.025387531
milk tunny 10.38643687 46.42931501 1.00643642
milk water and pasta 9.551715935 55.9882273 1.213642523
milk juices 8.25638475 53.27396543 1.154806161
milk biscuits and pasta 7.763337154 57.8551532 1.2541114
pasta coffee and milk 7.495188461 45.52798615 1.299518958

按照支持度排名为:

后项 前项 支持度
milk pasta 35.03%
milk water 27.85%
milk biscuits 20.47%
milk brioches 15.32%
milk yohurt 15.14%

按照支持度排序,可以发现,用户同时购买牛奶和意大利面的支持度最高,其次是同时购买牛奶和水。在支持度前五的项目中,有四项都是牛奶跟比较干燥的食物配比,因而可以考虑将他们做一些促销。

按照置信度排名为:

后项 前项 置信度
milk biscuits,pasta 57.86%
milk water,pasta 55.99%
milk juices 53.27%
pasta tomato souce 53.24%
milk yohurt 52.17%

按照置信度排序,用户在买完牛奶后,比较倾向于购买配套的食物,比如饼干,意大利面等,也有用户会选择在买完牛奶后购买饮品。而意大利面和番茄酱的组合比较受欢迎。

按照提升度排名为:

后项 前项 提升度
pasta tomato souce 1.52
pasta coffee,milk 1.3
milk biscuits,pasta 1.254
milk water,pasta 1.214
milk juices 1.115

按照提升度排序,可以发现,意大利面和番茄酱的相关性最高,其次是意大利面和咖啡、牛奶。牛奶跟食物、饮品都表现出了不错的相关性。