【数据分析】HW1
HW1
Submission requirements:
Please submit your solutions to our class website.
Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game],what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?
1 | step 1. Roll-up on date from date_key to all |
© Bitmap indexing is a very useful optimization technique. Please present the pros and cons of using bitmap indexing in this given data warehouse.
优点
位图索引是一种高效的索引结构,在查询、过滤等方面上,由于进行的是位运算,所以比常规的查询方式快很多。例如在本仓库中,假设对于spectator表的子列status,我们有:
| spectator_key | status | gender |
|---|---|---|
| 0 | 学生 | 男 |
| 1 | 成人 | 女 |
| 2 | 学生 | 男 |
| 3 | 学生 | 女 |
| 4 | 老人 | 女 |
status就可以建立以下位图索引:
1 | status="学生" : 10110 |
gender可以建立以下位图索引:
1 | gender="男": 10100 |
例如,我们想要查询学生,只需要用10110去过滤原始数据就行。
我们想混合查询,比如同时查询status="学生"和gender="男"的数据,只需要进行并操作就行了:
1 | 10110 & 10100 = 10100 |
可以大大提高计算速度。
此外,位图索引可以在一定程度上绕开原始数据,进一步提高处理速度。例如,我们想统计满足上面条件的人数,只需要:
1 | ans=0 |
缺点
位图索引比较适合枚举类型,也就是离散型变量,对于连续变量,位图索引并不适用,往往需要先做离散化。比如本仓库中,phone number字段可能就不太适合(也许这个字段没有存在的必要?)
而当属性列非常多时,我们做位图索引的开销也比较大。
Q2.某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构,以便用户从多个维度进行查询和挖掘。
Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:
| age | 23 | 23 | 27 | 27 | 39 | 41 | 47 | 49 | 50 | 52 | 54 | 54 | 56 | 57 | 58 | 58 | 60 | 61 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| %fat | 9.5 | 26.5 | 7.8 | 17.8 | 31.4 | 25.9 | 27.4 | 27.2 | 31.2 | 34.6 | 42.5 | 28.8 | 33.4 | 30.2 | 34.1 | 32.9 | 41.2 | 35.7 |
(a) Calculate the mean, median, and standard deviation of age and %fat.
1 | age %fat |
(b) Draw the boxplots for age and %fat.
© Draw a scatter plot based on these two variables.
(d) Normalize age based on min-max normalization.
1 | x=data["age"] |
Result is:
1 | 0 0.000000 |
(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?
1 | print(np.corrcoef(x,y)) |
Result is
1 | [[1. 0.8176188] |
I think they are positively correlated.
(f) Smooth the fat data by bin means, using a bin depth of 6.
1 | def mean(x): |
1 | bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12] |
(g) Smooth the fat data by bin boundaries, using a bin depth of 6.
这里因为我们是对排好序的数据做处理,所以可以通过二分法进行优化,获取中间分界。
1 | def close(x,a,b): |
1 | bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]] |


