一、General

1. Concept

DM / Dimensional Modeling / 维度模型

The process and outcome of designing logical database schemas created to support OLAP and data warehousing solutions.

Dimensional data structure

Target of the ETL, include Fact tables, Dimension tables, Surrogate key mapping tables.

Dimension / 维

Descriptive attributes, for query constraining and labeling, e.g.CCY, region, customer, date, gender.

Dimension table 描述fact的数据,denormalized flat tables, seldom changed data.

Fact / 事实

Business measures.

Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.

Metadata 元数据

All the information in the data warehouse that is not the actual data itself.

grain / granularity / hierarchy / 粒度

细粒度如存取记录数,粗粒度如资产、负债

E.g.

2. Flow

Identify reporting grain;

Identify dimensions that apply to each facttable;

Identify measures that will populate thefact tables;

 

二、Dimension

1. 模型

star dimension model

1 fact to many dimensions

snowflaked dimension model

1 fact to many dimensions/Bridge Tables, 1 dimension/bridge to many subdimensions. (fact-dim,dim-subdim)

parent-child

field 1 1-to-many field 2 in the same dimension table (seldom see)

2. Points

1.      In contrast to a fact table,dimension tables are usually small and change relatively slowly (because DIM establishone-to-many relationships with facts, changes to DIM forces OLAP cube torebuilt, so I believe SCDs (especially type 1 and 3) is mainly in fact table).

2.      Dimension tables are seldomkeyed to date.

3.      rapidly changing / large DIMsolution: 1) split (e.g. separate rapidly changing part like demographics); 2)treat as fact (no foreign key base on it, this may be old-fashion solution)

4.      Not all dimension need tocreated, as it may be too many foreign keys. It depends on query need.

5.      Kimball suggest to usesurrogate keys(1,2,3…) instead of (20100113, 20100114 …) to be the key of timedimension.

6.      Multidimensional model isusually stored within a relational database (multidimensional data is stored ina relational database).

3. 主要分类

1)     Conformed dimension

Normal dimension, cuts across many facts.

2)     Junk dimension

Combines several low-cardinality flags andattributes into a single dimension table rather than modeling them as separatedim. The attributes are not closely related.

Junk dimension is nothing but miscellaneousdata that does not fit in any base dimension hence stored in a separate table.

Characters: 1. The group of dimensions,depend on correlation. 2. Remaining when obvious dimensions have beenidentified (鸡肋).

Function: Use to reduce the number offoreign keys in a fact table.

Reason: If all of the yes/no flags arerepresented as single level hierarchy dimensions, you may end up with 30 ormore foreign keys for one fact table. Clearly, this is an overly complex design(cluttered design).

e.g. comment, yes/no, true/false. To solvethe situation like more than 30 dim attributes, combine or junk dim.

low cardinality – 低基数, means very few distinct valuesfor the column, e.g. gender. If need to build index on it, it’s better to usebitmap index.

Ideally, we’d keep the size of the junkdimension to less than 100,000 rows.

3)     Role-playing dimension / dimensional roles

A dimension attached multiple times to thesame fact table.(e.g.COL_CLARK, COL_MGR map to same employee dim.) Use 1 tablemultiple views as solution.

一个维度,可以被多个Fact表引用,这个时候,是建立多个维度表,还是引用同一个维度表?

Kimball 的答案是建立一个维度表,从这个维度表,引出多个View

e.g. a "Date" dimension can be usedfor "Date of Sale", as well as "Date of Delivery", or"Date of Hire".

4)     Degenerate dimension (DD,退化维)

Definition: a dimension key in the facttable that does not have its own dimension table. Want to have in fact but notmeasures.

1、退化维具有普通维的各种操作,比如:上卷,切片,切块等。

2、如果存在退化维,那么在ETL的过程将会变得容易。

3、它可以让group by等操作变得更快。

e.g. order no, ticket, credit cardtransaction, check no.

Use when a huge join as both Fact andDimension would have the same granularity, to better performance.

5)     Slowly Changing Dimensions (SCDs, 缓慢维度变化)

A dimension that changes with time. 3types:

SCD 1

Overwrite old data with new data; (overwrite, master table, in-place update)

SCD 2

tracks historical data by creating multiply records in the dimensional tables;(partitioning history, ACDH, row, history table) (e.g. effective date)

Surrogate Keys required! e.g.

id  effect_dt   active_ind

id1 2006-09-29 N

id1 2006-12-02 Y

SCD 3

tracks changes using separate columns and its limited by a number of columns we design. (separate column, ACDM, column, rarely use)

* can be Hybrid SCD

* Can handle SCD by SQL MERGE / ROW NUMBERetc.

* SCD 2最流行,简言之SCD 1 in-place update,SCD 2 insert row,SCD 3 add column。

4. 其它分类

Big dimension

(e.g. commercial customer, 客户资料表) often has millions of records and a hundred or more fields in each record

Small dimensions

(e.g. transaction type) are often unique to each source system and thus do not need to be conformed

multilevel hierarchies

e.g. country (1st col.), state (2nd col.), city (3rd col.), postal code (4th col.)

single level hierarchy

Vise versa

按性质可分为机构、人员、产品、时间、交易、科目、客户、合同等

5. 例子

CCY, region, customer, date, gender,status, code, category, title, date, weight, area, role

职务、单位、证号、代码、邮编、号、等级、状况、状态、类别、标志、备注、名称

三、Fact

1. 主要分类

1)     Conformed fact

基本事实表.

 

2)     Factless fact

Factless fact table – fact table that nomeasure available (only contains dimension keys). 1st type recordsan event, e.g. attendance of the student, 2nd type coverage table.

3)     Early-arriving fact

Normally 1st load Dims, then 2ndload Facts. Early-arriving fact means fact arrived but dimension is not yetready.

 

2. 其它分类

1. Transaction grain (e.g.retail sales transaction, largest)

2. Periodic snapshot (e.g.balance, monthly grain, Surrogate Keys required!)

3. Accumulating snapshot (e.g.order fulfillment, definite beginning and end, large number of date foreign key), .e.g. order that created, committed, returned …; submitted date, approval date, processed date, settlement date …

3. 例子

quantity, count, amount, percent

四、CDC

Change Data Capture改变数据捕获,以下归纳了数种方法。

1. timestamp / version / status

2. record / key-words compare

3. log scanner

4. trigger on tables

5. Tools: Attunity, Oracle golden gate, IBMInfoSphere Change Data Capture …

五、建议

Kimball suggests putting free-form textinto a separate dim rather than carrying that on every fact record.

多维合并除了Junkdimension外,还可考虑建立维度的snapshot表,将信息冗余,每隔一段时间全量刷新一次。

尽量把雪花模型转变成星型模型。

Dimensional Modeling相关推荐

  1. 6000字详解数据仓库建设

    01 前言 互联网行业,除了数据量大之外,业务时效性要求也很高,甚至很多是要求实时的.另外,互联网行业的业务变化非常快,不可能像传统行业一样,可以使用自顶向下的方法建立数据仓库,一劳永逸,它要求新的业 ...

  2. SQL Server 2008/2012中SQL应用系列及BI学习笔记系列--目录索引

    SQL Server 2008中的一些特性总结及BI学习笔记系列,欢迎与邀月交流. 3w@live.cn  ◆0.SQL应用系列 1.SQL Server 2008中SQL增强之一:Values新用途 ...

  3. hive中的绣花模型_hive建模方法

    概述数据仓库这个概念是由 Bill Inmon 所提出的,其功能是将组织通过联机事务处理(OLTP)所积累的大量的资料和数据,通过数据仓库理论所特点有的信息存储架构,进行系统的分析整理,利用各种的分析 ...

  4. 数据仓库与数据集市建模

    前言 数据仓库建模包含了几种数据建模技术,除了之前在数据库系列中介绍过的ER建模和关系建模,还包括专门针对数据仓库的维度建模技术. 本文将详细介绍数据仓库维度建模技术,并重点讨论三种基于ER建模/关系 ...

  5. 大数据架构详解_【数据如何驱动增长】(3)大数据背景下的数仓建设 amp; 数据分层架构设计...

    背景 了解数据仓库.数据流架构的搭建原理对于合格的数据分析师或者数据科学家来说是一项必不可少的能力.它不仅能够帮助分析人员更高效的开展分析任务,帮助公司或者业务线搭建一套高效的数据处理架构,更是能够从 ...

  6. 连载:阿里巴巴大数据实践—数据建模综述

    简介:数据模型就是数据组织和存储方法,它强调从业务.数据存取和使用角度合理存储数据. 前言: -更多关于数智化转型.数据中台内容请加入阿里云数据中台交流群-数智俱乐部 和关注官方微信公总号(文末扫描二 ...

  7. 数仓建模的edw_浅谈数仓分层和模型

    数仓分层 ODS层基础层-ODS(Operational Data Store-操作型数据存储):主要是未经过加⼯的原始数据 中间层-CDM\EDW(Enterprise Data Warehouse ...

  8. 用SQL Server 2017图形数据库替换数据仓库中的桥表

    Just like in Santa's Bag of Goodies, every release of SQL Server often has something for everyone – ...

  9. ssas计算度量_如何在Analysis Services(SSAS)中创建中间度量

    ssas计算度量 The whole premise of Analysis Services (SSAS) is to place business logic into a central rep ...

最新文章

  1. [转]解析字符串的方法
  2. mask-conditional contrast-GAN
  3. 记一次升级Oracle驱动引发的死锁
  4. dede服务器建站_建站就是这么简单(内容系统管理CMS篇)
  5. hdu 1798(几何问题)
  6. 能不做自己写个类,也叫java.lang.String
  7. Matlab心电信号的PQRST模拟-实验报告
  8. PHP 5.6 开启CURL HTTPS 类型
  9. 一起来学SpringBoot | 第四篇:整合Thymeleaf模板
  10. 配置zabbix及安装agent
  11. 【信号与系统】(十三)傅里叶变换与频域分析——周期信号的傅里叶级数
  12. python实现求两个数的最大公因数
  13. 芜湖计算机专业学校录取分数线,芜湖市各类高中2018年中考录取分数线是多少...
  14. 读 《周爱民--大道之简》 笔记
  15. Linux 块子系统优化
  16. 微软苏州校招1月3日在线编程题1——constellations
  17. Python_Day06_1 - 字典
  18. Word文档很乱怎么办 杂乱的文章word排版教程
  19. 使用oracle数据库建表语句,怎么使用sql查询oracle建表语句
  20. 读万卷书,不如行万里路 后三句

热门文章

  1. 还在用纸质表格打钩盘点固定资产吗?
  2. Linux实战教学笔记14:用户管理初级(上)
  3. android齐刘海屏幕适配,GitHub - biganans/cocos2x-adaptation: cocos2dx 横版各种适配 iphoneX适配 齐刘海 凹凸屏...
  4. 动态快创建之对称拉伸折断线
  5. 小程序客服消息之如何引导自动关注公众号(5种方法)
  6. 先放大再滤波,还是先滤波再放大-仪表放大器-AD620
  7. java财务管理项目_基于jsp的个人财务管理-JavaEE实现个人财务管理 - java项目源码...
  8. 【2021-11-06 修订】【梳理】计算机网络:自顶向下方法 第三章 运输层(docx)
  9. java的前景色是什么_java – JEditorPane为不同的单词设置前景色
  10. java中的static属性详细介绍