HW1

Submission requirements:

Please submit your solutions to our class website.

Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

(b) Starting with the base cuboid [date, spectator, location, game]，what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?

step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all

step 5. Dice with "status=student" and "location_name=Los Angeles"

优点

位图索引是一种高效的索引结构，在查询、过滤等方面上，由于进行的是位运算，所以比常规的查询方式快很多。例如在本仓库中，假设对于spectator表的子列status，我们有：

spectator_key	status	gender
0	学生	男
1	成人	女
2	学生	男
3	学生	女
4	老人	女

status就可以建立以下位图索引：

1
2
3

status="学生" : 10110
status="成人" : 01000
status="老人" : 00001

gender可以建立以下位图索引：

1 2	gender="男": 10100 gender="女": 01011

例如，我们想要查询学生，只需要用10110去过滤原始数据就行。

我们想混合查询，比如同时查询status="学生"和gender="男"的数据，只需要进行并操作就行了：

1	10110 & 10100 = 10100

可以大大提高计算速度。

此外，位图索引可以在一定程度上绕开原始数据，进一步提高处理速度。例如，我们想统计满足上面条件的人数，只需要:

ans=0
x=(10110&10100)
while x:
	x&=(x-1)
	ans+=1

缺点

位图索引比较适合枚举类型，也就是离散型变量，对于连续变量，位图索引并不适用，往往需要先做离散化。比如本仓库中，phone number字段可能就不太适合(也许这个字段没有存在的必要？)

而当属性列非常多时，我们做位图索引的开销也比较大。

Q2．某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构，以便用户从多个维度进行查询和挖掘。

Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:

age	23	23	27	27	39	41	47	49	50	52	54	54	56	57	58	58	60	61
%fat	9.5	26.5	7.8	17.8	31.4	25.9	27.4	27.2	31.2	34.6	42.5	28.8	33.4	30.2	34.1	32.9	41.2	35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.

             age       %fat
mean   46.444444  28.783333
std    13.218624   9.254395
median      51.0       30.7

(b) Draw the boxplots for age and %fat.

(d) Normalize age based on min-max normalization.

x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())
print(X,Y)

Result is:

0     0.000000
1     0.000000
2     0.105263
3     0.105263
4     0.421053
5     0.473684
6     0.631579
7     0.684211
8     0.710526
9     0.763158
10    0.815789
11    0.815789
12    0.868421
13    0.894737
14    0.921053
15    0.921053
16    0.973684
17    1.000000

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?

1 2	print(np.corrcoef(x,y)) print("相关系数" ,stats.pearsonr(x,y)[0])

Result is

1
2
3

[[1.        0.8176188]
 [0.8176188 1.       ]]
相关系数 0.8176187964565874

I think they are positively correlated.

(f) Smooth the fat data by bin means, using a bin depth of 6.

def mean(x):
    return round(sum(x)/len(x),2)

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=[mean(v)]*len(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

1
2
3

bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]
bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]
bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.

这里因为我们是对排好序的数据做处理，所以可以通过二分法进行优化，获取中间分界。

def close(x,a,b):
    # 是否靠近下界
    return (x-a)<=(b-x)

def boundary(x):
    Min=x[0]
    Max=x[-1]

    l,r=0,len(x)-1
    while l<=r:
        mid=(r-l)//2+l
        if close(x[mid],Min,Max):
            if not close(x[mid+1],Min,Max):
                l=mid
                break
            l=mid+1
        else:
            if close(x[mid-1],Min,Max):
                l=mid
                break
            r=mid-1
    return [[Min]*l+[Max]*(len(x)-l)]

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=boundary(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

1
2
3

bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]
bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]
bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]