HW2

Submission requirements:

Please submit your solutions to our class website.


Part I: written assignment

img

a) Compute the Information Gain for Gender, Car Type and Shirt Size. (15 points)

b) Construct a decision tree with Information Gain. (10 points)


Answer A

整体熵

C0 C1
10 20

根据公式

E(s)=inplog2(p)E(s)=-\sum_i^nplog_2(p)

有:

E(s)=1020log2(1020)1020log2(1020)=1E(s)=-\frac{10}{20}log_2(\frac{10}{20})-\frac{10}{20}log_2(\frac{10}{20}) \\ =1

下面我们计算各个属性列的信息增益

Gender属性列:

C0 C1
F 6 4
M 6 4

E(F)=(610)log2(610)(410)log2(410)=0.97095E(F)=-(\frac{6}{10})log_2(\frac{6}{10})-(\frac{4}{10})log_2(\frac{4}{10}) \\=0.97095

E(M)=(610)log2(610)(410)log2(410)=0.97095E(M)=-(\frac{6}{10})log_2(\frac{6}{10})-(\frac{4}{10})log_2(\frac{4}{10}) \\=0.97095

根据信息增益公式:

g(D,A)=E(D)E(DA)g(D,A)=E(D)-E(D|A)

可得到:

Gain(Gender)=E(s)1020E(F)1020E(M)=0.029Gain(Gender)=E(s)-\frac{10}{20}*E(F)-\frac{10}{20}*E(M) \\ =0.029

同理,CarType属性列

C0 C1
F 1 3
L 1 7
S 8 0

E(F)=(14)log2(14)(34)log2(34)=0.81127E(F)=-(\frac{1}{4})log_2(\frac{1}{4})-(\frac{3}{4})log_2(\frac{3}{4})\\=0.81127

E(L)=(18)log2(18)(78)log2(78)=0.543564E(L)=-(\frac{1}{8})log_2(\frac{1}{8})-(\frac{7}{8})log_2(\frac{7}{8})\\=0.543564

E(S)=(1)log2(1)=0E(S)=-(1)log_2(1)=0

最终结果为

Gain(CarType)=E(s)p(F)E(F)p(L)E(L)p(S)E(S)=1420E(F)820E(L)820E(S)=0.6205Gain(CarType)=E(s)-p(F)E(F)-p(L)E(L)-p(S)E(S)\\=1-\frac{4}{20}*E(F)-\frac{8}{20}*E(L)-\frac{8}{20}*E(S)\\=0.6205

对于Shirt Size属性列

C0 C1
E 2 2
L 2 2
M 3 4
S 3 2

E(E)=(24)log2(24)(24)log2(24)=1E(E)=-(\frac{2}{4})log_2(\frac{2}{4})-(\frac{2}{4})log_2(\frac{2}{4})\\=1

E(L)=(24)log2(24)(24)log2(24)=1E(L)=-(\frac{2}{4})log_2(\frac{2}{4})-(\frac{2}{4})log_2(\frac{2}{4})\\=1

E(M)=(37)log2(37)(47)log2(47)=0.985228E(M)=-(\frac{3}{7})log_2(\frac{3}{7})-(\frac{4}{7})log_2(\frac{4}{7})\\=0.985228

E(S)=(35)log2(35)(25)log2(25)=0.97095E(S)=-(\frac{3}{5})log_2(\frac{3}{5})-(\frac{2}{5})log_2(\frac{2}{5})\\=0.97095

最终结果为:

Gain(CarType)=E(s)p(E)E(E)p(L)E(L)p(S)E(S)p(M)E(M)=1420E(E)420E(L)520E(S)720E(M=0.012432Gain(CarType)=E(s)-p(E)E(E)-p(L)E(L)-p(S)E(S)-p(M)E(M)\\=1-\frac{4}{20}*E(E)-\frac{4}{20}*E(L)-\frac{5}{20}*E(S)-\frac{7}{20}E(M)\\=0.012432

Val
Gain(Gender) 0.029
Gain(CarType) 0.621
Gain(ShirtSize) 0.012

Answer B

通过计算结果可知,CarType属性对Class属性的信息增益率最大,因而决策树的第一个节点可以选择CarType

而此时CarType有三个分支,分别是

Entropy
Family 0.81
Luxury 0.54
Sports 0

我们通过这三个分支划分数据,值得注意的是,Sports分支没必要再分了

所以第一个分裂节点为:

image-20221023165024486

我们进行第二次迭代,用family划分的数据集为:

image-20221023165335924

Luxury划分的数据集为:

image-20221023165343968

迭代计算信息增益,family的信息增益为:

E(f)=0.81127E(f)=0.81127

Gender划分后的信息增益:

E(G)=14log21434log234=0.811278E(G)=-\frac{1}{4}log_2{\frac{1}{4}}-\frac{3}{4}log_2{\frac{3}{4}}\\ =0.811278

ShirtSize划分后:

E(S)=14[0+0+0+0]=0E(S)=\frac{1}{4}[0+0+0+0]=0

Gain
Gender 0
ShirtSize 0.81127

我们选择ShirtSize进一步划分。此时,Gender已经不能提供任何的信息了,可以直接完成叶子结点的工作。

image-20221023170534079

再来看luxury划分的数据集,

C0 C1
F 1 6
M 0 1

G(Gender)=E(luxury)(078(17log21767log267))=0.022G(Gender)=E(luxury)-(0-\frac{7}{8}(\frac{1}{7}log_2{\frac{1}{7}}-\frac{6}{7}log_2\frac{6}{7}))\\ =0.022

C0 C1
E 0 1
L 1 1
M 0 3
S 0 2

G(ShirtSize)=0.515G(ShirtSize)=0.515

所以我们用ShirtSize进行划分。

image-20221023171616919

再往后的节点为:

image-20221023171725261

此时,该节点已经没办法带来任何的信息收益了,我们可以考虑将其进行剪枝,那最终的一个决策树结果为

image-20221023172055218

Q2.

a) Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers. (10 points)

(b) Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(M, Family, Small)". Indicate your initial weight values and biases and the learning rate used. (10 points)


a.前馈神经网络如下

image-20221030013510049

b.设定一个权重初始值

隐层

w11 w21 w31 w41 w51 w61 w71 w81 w91
0.1 0.2 0.3 -0.4 -0.1 -0.2 -0.3 0.4 0.5
w12 w22 w32 w42 w52 w62 w72 w82 w92
0.2 0.4 0.1 -0.2 -0.4 0.3 0.2 0.4 -0.2

输出层

w1c w2c
0.7 0.5

初始值为

1,0,1,0,0,1,0,0,0

学习率

lr=0.01

偏置

θh1\theta_{h1} -0.1
θh2\theta_{h2} 0.2
θc\theta_{c} 0.1

对于每个节点,净输入值表示为:

I=θ+i=1nwiviI=\theta+\sum_{i=1}^nw_iv_i

其中,θ\theta 表示偏置量,wiw_i表示分支权重,viv_i表示节点值。

输出值表示为:

O=11+eIO=\frac{1}{1+e^{-I}}

输出结果

单元 净输入 输出
H1 0.1+0.11+0.310.21=0.1-0.1+0.1*1+0.3*1-0.2*1=0.1 11+e0.1=0.525\frac{1}{1+e^{-0.1}}=0.525
H2 0.2+0.21+0.11+0.31=0.80.2+0.2*1+0.1*1+0.3*1=0.8 11+e0.8=0.690\frac{1}{1+e^{-0.8}}=0.690
C 0.1+0.70.55+0.50.65=0.810.1+0.7*0.55+0.5*0.65=0.81 11+e0.81=0.692\frac{1}{1+e^{-0.81}}=0.692

输出层的误差值计算为

Err=O(1O)(TO)Err=O*(1-O)*(T-O)

其中,OO表示节点输出,TT表示真实值

隐层节点的误差计算为:

Err=O(1O)i=1nErriwiErr=O*(1-O)*\sum_{i=1}^nErr_i*w_i

其中,ErrErr表示从高层传过来的误差。

计算每个节点的误差

单元 误差
C 0.692(10.692)(10.692)=0.0660.692*(1-0.692)*(1-0.692)=0.066
H1 0.525(10.525)0.0660.7=0.0120.525*(1-0.525)*0.066*0.7=0.012
H2 0.69(10.69)0.0660.5=0.0070.69*(1-0.69)*0.066*0.5=0.007

偏置量的更新方程表示为:

θnew=θold+lr(Err)\theta_{new}=\theta_{old}+lr*(Err)

权重的更新方程表示为:

wnew=wold+lrErrOw_{new}=w_{old}+lr*Err*O

更新权重和偏置

这里只给出了链式求导用到的节点和权重

w1cw_{1c} 0.7+0.01(0.0660.525)=0.70034650.7+0.01*(0.066*0.525)=0.7003465
w2cw_{2c} 0.5+0.01(0.0660.69)=0.50045540.5+0.01*(0.066*0.69)=0.5004554
w11w_{11} 0.1+0.01(0.012)1=0.100120.1+0.01*(0.012)*1=0.10012
w12w_{12} 0.2+0.01(0.007)1=0.200070.2+0.01*(0.007)*1=0.20007
w31w_{31} 0.3+0.01(0.012)1=0.300120.3+0.01*(0.012)*1=0.30012
w32w_{32} 0.1+0.01(0.007)1=0.10070.1+0.01*(0.007)*1=0.1007
w61w_{61} 0.2+0.01(0.012)1=0.19988-0.2+0.01*(0.012)*1=-0.19988
w62w_{62} 0.3+0.01(0.007)1=0.30070.3+0.01*(0.007)*1=0.3007
θc\theta_{c} 0.1+0.01(0.066)=0.100660.1+0.01*(0.066)=0.10066
θh1\theta_{h1} 0.1+0.01(0.012)=0.09988-0.1+0.01*(0.012)=-0.09988
θ42\theta_{42} 0.2+0.01(0.007)=0.200070.2+0.01*(0.007)=0.20007

Q3

a) Suppose the fraction of undergraduate students who smoke is 15% and the fraction of graduate students who smoke is 23%. If one-fifth of the college students are graduate students and the rest are undergraduates, what is the probability that a student who smokes is a graduate student? (5 points)

b) Given the information in part (a), is a randomly chosen college student more likely to be a graduate or undergraduate student? (5 points)

c) Suppose 30% of the graduate students live in a dorm but only 10% of the undergraduate students live in a dorm. If a student smokes and lives in the dorm, is he or she more likely to be a graduate or undergraduate student? You can assume independence between students who live in a dorm and those who smoke. (5 points)

image-20221030021838522

Part II: Lab

Question 1

Assume this supermarket would like to promote milk. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy milk or not.

  1. Build a decision tree using data set “transactions” that predicts milk as a function of the other fields. Set the “type” of each field to “Flag”, set the “direction” of “milk” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree. (5 points)

  2. Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy milk. **Hand-in:**your prediction for each of the 20 customers. (5 points)

​ 3.Hand-in: rules for positive (yes) prediction of milk purchase identified from the decision tree (up to the fifth level. The root is considered as level 1). (5 points)


1.决策树输出如下

image-20221029211843694

2.具体数据在附件table1.csv

pasta water biscuits coffee brioches yoghurt frozen vegetables tunny beer tomato souce coke rice juices crackers oil frozen fish ice cream mozzarella tinned meat $C-milk $CC-milk
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0.633
0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0.56
1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0 1 0.613
1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0.566
1 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0.633
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0.673
0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.626
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.65
1 0 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0.613
1 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0.521
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.568
1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0.548
1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0.53
1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0.629
1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0.53
1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0.633
0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0.55
1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0.633
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.633
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0.62

3.输出规则集如下:

image-20221029211817778

Q2

1.决策树a的混淆矩阵(56):

image-20221030004711003

决策树b的混淆矩阵(15):

image-20221030004511700

决策树c的混淆矩阵(10):

image-20221030004759601

2.我将使用决策树b作为最后预测数据的模型。该决策树在精度上高于a树和b树,具有更好的性能。


3.表格内容在table2.csv附件中

age sex region income married children car save_act current_act mortgage RECOMMEND PEP (Y/N)
22 0 1 14000 0 3 0 1 1 0 0
34 1 0 33000 0 0 1 1 0 0 1
47 0 0 16700 1 1 0 1 1 0 0
54 1 1 43400 1 1 1 1 1 0 1
65 1 2 60000 1 1 0 1 1 0 1
37 0 0 27700 0 1 1 0 0 0 1
44 0 0 38784 1 0 0 1 1 0 0
20 1 0 10200 1 0 0 1 1 1 0
46 0 0 22000 1 1 1 1 0 1 1
40 1 1 37400 1 2 0 1 1 0 1