HW3

Submission requirements:

Please submit your solutions to our class website.

Q1

Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:

A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)

The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only

(a) The three cluster’s centers after the first round execution

(b) The final three clusters

答：

本题所用的K-Means算法如下：

import math

Poi=[[4,2,5],[10,5,2],[5,8,7],
     [1,1,1],[2,3,2],[3,6,8],
     [11,9,2],[1,4,6],[9,1,7],[5,6,7]]

def k_Means(poi,k=3,epochs=100):

    # 初始化
    cluster=[[0]*(len(poi[0]))+[0] for i in range(k)] 
    PoiClass=[-1 for i in range(len(poi))]
    Break=[-1 for i in range(len(poi))] 

    # 选取中心点
    centroid=[poi[1],poi[4],poi[7]]

    # 计算每个点的欧氏距离
    def edis(x,y):
        return math.sqrt(sum([(x[i]-y[i])**2 for i in range(len(x))]))

    for _ in range(epochs):

        # 计算每个点到中心的距离
        for pId,p in enumerate(poi):
            dis,idx=2e31,0
            for Idx,c in enumerate(centroid):
                if (v:=edis(p,c))<dis:
                    dis,idx=v,Idx 
            for i in range(len(p)):
                cluster[idx][i]+=p[i]
            cluster[idx][-1]+=1 
            # 更新映射表
            PoiClass[pId]=idx

        # 重新计算每个簇的中心，这个中心是平均值，并不一定是存在的点
        for i,v in enumerate(cluster):
            new_c=[]
            for j in range(len(v)-1):
                new_c.append(v[j]/v[-1])
            centroid[i]=new_c
            cluster[i]=[0]*len(poi[0])+[0]
        if _==0:
            print(centroid)
        # 设置终止迭代条件
        if sum(PoiClass[i]==Break[i] for i in range(len(poi)))>=len(poi):
            return PoiClass
        Break=PoiClass[:]
print(k_Means(Poi))

a. 在第一轮结束时输出三个中心：

1	[[10.0, 5.0, 3.67], [2.33, 2.0, 2.67], [3.5, 6.0, 7.0]]

b. 将 $A2,B2,C2$ 作为初始中心点输入后，得到的结果如下：

点	类
A1	1
A2	0
A3	2
B1	1
B2	1
B3	2
C1	0
C2	2
C3	0
C4	2

Q2

	Product 1	Product 2	Product 3	Product 4
User 1	1	1	5	3
User 2	3	？	5	4
User 3	1	3	1	1
User 4	4	3	2	1
User 5	2	2	2	4

(a) List the top 3 most similar users of user 2 based on Cosine Similarity.

(b) Predict User 2’s rating for Product 2.

答：

(a). 通过计算可得：

$sim(U_1,U_2)=\frac{1*3+5*5+3*4}{\sqrt{1^2+5^2+3^2}\sqrt{3^2+5^2+4^2}}=0.956$

$sim(U_3,U_2)=\frac{1*3+1*5+1*4}{\sqrt{1^2+1^2+1^2}\sqrt{3^2+5^2+4^2}}=0.979$

$sim(U_4,U_2)=\frac{4*3+2*5+1*4}{\sqrt{4^2+3^2+1^2}\sqrt{3^2+5^2+4^2}}=0.802$

$sim(U_5,U_2)=\frac{2*3+2*5+4*4}{\sqrt{2^2+2^2+4^2}\sqrt{3^2+5^2+4^2}}=0.924$

top3的用户值为：

$U_3,U_1,U_5$

(b).

User2的平均得分为：

$\bar r_2=\frac{3+5+4}{3}=4$

其余三位用户的平均得分为：

$\bar r_1=\frac{1+1+5+3}{4}=2.5$

$\bar r_3=\frac{1+3+1+1}{4}=1.5$

$\bar r_5=\frac{2+2+2+4}{4}=2.5$

User2在Product2上的评分为：

$r_{u2,p2}=\bar r_2+\frac{0.956*(2.5-1)+0.979*(3-1.5)+0.924*(2.5-2)}{0.956+0.979+0.924}\\ \ \\=4+1.176=5.176$

Part II: Lab

Q1.

决策树的混淆矩阵为

	0	1
0	990	0
1	43	0

神经网络的混淆矩阵为

	0	1
0	987	3
1	37	6

逻辑回归的混淆矩阵为

	0	1
0	988	2
1	36	7

对于决策树，其评价指标为：

$Precision_1=\frac{TP}{TP+NP}=\frac{990}{990+43}=0.958 \\ \ \\ Precision_1=\frac{FP}{TP+FN}=\frac{990}{990+0}=1$

对于神经网络，其评价指标为

$Precision_2=\frac{TP}{TP+NP}=\frac{987}{987+37}=0.964 \\ \ \\ Precision_2=\frac{FP}{TP+FN}=\frac{987}{987+3}=0.997$

对于逻辑回归，其评价指标为：

$Precision_3=\frac{TP}{TP+NP}=\frac{988}{988+36}=0.965 \\ \ \\ Precision_2=\frac{FP}{TP+FN}=\frac{988}{988+2}=0.998$

相较之下，逻辑回归比神经网络多预测对了一个，所以召回率和精度都要更高，而决策树倒是将所有的例子都预测为了0，召回率达到了1，但精度相较其他两个略有不足。

Q2

关联规则表为：

后项	前项	支持度百分比	置信度百分比	提升度
milk	pasta	35.03449171	45.85519412	0.993991348
milk	water	27.85070173	46.70393664	1.012389323
milk	biscuits	20.47445019	51.53147444	1.117034628
milk	brioches	15.31907532	49.67532468	1.076799343
milk	yoghurt	15.23473823	52.16465578	1.130759939
milk	coffee	15.02713924	49.87768024	1.081185753
pasta	tomato souce	11.59743096	53.23512959	1.519506264
milk	tomato souce	11.59743096	51.27727018	1.111524307
milk	beer	10.92057176	45.42574257	0.984682236
milk	coke	10.70648531	47.30357504	1.025387531
milk	tunny	10.38643687	46.42931501	1.00643642
milk	water and pasta	9.551715935	55.9882273	1.213642523
milk	juices	8.25638475	53.27396543	1.154806161
milk	biscuits and pasta	7.763337154	57.8551532	1.2541114
pasta	coffee and milk	7.495188461	45.52798615	1.299518958

按照支持度排名为：

后项	前项	支持度
milk	pasta	35.03%
milk	water	27.85%
milk	biscuits	20.47%
milk	brioches	15.32%
milk	yohurt	15.14%

按照支持度排序，可以发现，用户同时购买牛奶和意大利面的支持度最高，其次是同时购买牛奶和水。在支持度前五的项目中，有四项都是牛奶跟比较干燥的食物配比，因而可以考虑将他们做一些促销。

按照置信度排名为：

后项	前项	置信度
milk	biscuits,pasta	57.86%
milk	water,pasta	55.99%
milk	juices	53.27%
pasta	tomato souce	53.24%
milk	yohurt	52.17%

按照置信度排序，用户在买完牛奶后，比较倾向于购买配套的食物，比如饼干，意大利面等，也有用户会选择在买完牛奶后购买饮品。而意大利面和番茄酱的组合比较受欢迎。

按照提升度排名为：

后项	前项	提升度
pasta	tomato souce	1.52
pasta	coffee,milk	1.3
milk	biscuits,pasta	1.254
milk	water,pasta	1.214
milk	juices	1.115

按照提升度排序，可以发现，意大利面和番茄酱的相关性最高，其次是意大利面和咖啡、牛奶。牛奶跟食物、饮品都表现出了不错的相关性。