COMP207笔记

SQL Basics

DBMS: Database Management System

tuples: 行

attributes:列

schema: " 表名(列名,列名,列名, ...) "

DDL: Data definition language 定义数据(框架)语句

  • Create/alter/delete database, tables and their attributes.

DML: Data Manipulation language 操作数据(内容)语句

  • Add/remove/update and query rows in tables.

DDL

CREATE TABLE 表名(

列名1 数据类型1,

列名2 数据类型2 PRIMARY KEY,

CONSTRAINT ‘外键名称’ FOREIGN KEY(列名) REFERENCES 源表(列名)

);

CONSTRAINT ‘主键名称’ PRIMARY KEY(列名1,列名2, ...);组合主键

drop database/table ... 名称; 删除某物

alter table 表名 drop column 列名; 删除某列

alter table 表名 add 列名1 数据类型1; 加入

alter table 表名 modify 列名1 数据类型2; 改变数据类型

DML:

insert into table_name (column1, column2, column3, ...) VALUES (value1, value2, value3, ...);

杂项

CREATE VIEW (CREATE VIEW <name> AS)才能创造可被再次引用的新表格

注意题目给的语句有无 “ ;

image-20220117014623990
image-20220117014741644

transactions

start transaction;

SQL statements (select...update...balabala...)

commit; / rollback;

read -> change -> write

// read: 寻址 -> ( ? ) 拷入缓存 -> 拷入变量

// write: ? 寻址 -> ( ? ) 拷入缓存

serial schedule: 串行调度——1做好再做2

ACID

A: Atomicity 原子性 a transaction 要么整个成功要么整个失败, it's an atom

C: Consistency 一致性 transaction不会使数据库变为不合法的状态,保证数据正确性和完整性。

I: Isolation 隔离性 操作可串行,互不影响

  • 1 Read uncommitted -> 2 Read committed -> 3 Repeatable read -> 4 Serializable(*Isolation)

D: Durability 持久性 数据一遭更改便无从抹去

serializable 串行化

Serial Schedules 串行调度

  • 好!但是慢

Concurrent Schedules 并行调度

  • 问题多多。但在多用户下效率高。

对于同一个变量,并行调度其实操作的是同一片区域,于是操作会互相影响,产生错误。

Serializable Schedules 可串行化调度:

A schedule S is serializable if there is a serial schedule S’that has the same effect as S on every initial database state

至少:正确且一致(Consistency),不求满足隔离性(isolation)但可以解决。

问题:难以测试可串行化与否、非数据库操作会很复杂。

Conflict-Serializable 冲突可串行化

Conflict:对同一变量,俩操作至少有一个在写,则称这俩操作是冲突(Conflict)的(顺序有意义)。

若S经过多次1对1非冲突交换可得到串行调度S',则称S是Conflict-Serializable的,S与S'是conflict-equivalent的。

判断:

  1. 对于同一变量的操作,把二者不属于统一操作的 read—>write, write—>read, write—>write,也就是所有的冲突操作从前指后连起来。
  2. 根据 1 中操作,在操作(T1,T2,T3...)之间连线,无环量则Conflict-Serializable.
杂项

abort 终止

UNDER THE DBMS CONTROL

Errors while executing transactions

  • Violation of integrity constraints, other run-time errors

Explicit request - I.e. ROLLBACK;

BEYOND THE DBMS’S CONTROL Media failures

  • The medium holding the database becomes partially or completely unreadable
  • Example: changes of bits, head crashes

Catastrophic events - The medium holding the database is destroyed - Examples: explosions, fires, etc.

System failures - Information about the active transaction’s state is lost - Examples: power failures, software errors

Lock 锁

Simple Locks 和锁的作用

  1. 当某进程对某事物操作前,会先对该事物上所锁。
  2. 该事物上已存在的锁和进程能否上某种锁相关。详见:锁的种类
  3. 如果进程无法上锁,就无法开始操作。
  4. 事物终会被解锁。
Deadlocks
  • E.g., when using two-phase locking
  • Concurrency control requests to abort transactions
2PL Two-Phase Locking 两阶段锁

In each transaction, all lock operations precede all unlocks. If a transaction start unlock, it cannot get any lock anymore.

阶段1 Phase 1: request locks

阶段2 Phase 2: unlock

使用2PL锁得到的调度不存在环。

锁的种类

sli(X) Shared Locks 读锁: 用于锁住读取操作。因为只读所以非独占。

xli(X) Exclusive Locks 写锁: 用于锁住写入操作。因为要写所以独占。

uli(X) Update Locks 更新锁: 一开始用于锁住读取操作,但可以升级成Exclusive Lock。所以可获得已被Shared Locks 锁住的事物,但是不带有分享性。

已有 Shared Update Exclusive
Shared YES YES X
Update X X X
Exclusive X X X

更新锁可以有效避免死锁,更新锁之间相互排斥所以不会出现并发循环等待。同时因为可以和读锁共享,所以效率又相对较高。

Granularity 颗粒度

锁是一种用于控制进程的工具,这个工具它控制的范围可大可小,锁内的语句可多可少。

颗粒度从从大到小:Relation(表) > Block(页) > Tuple(行)

too coarse 太粗,虽然占用内存少,但是并行程度很低,比较慢。

too fine 太细,虽然并行程度高了,但是占用内存大大增加了。

同时还要避免一些可能会产生串行化问题的情况,比如一个transaction持有某个tuple的shared lock,结果另一个transaction又持有这个tuple所在relation的exclusive lock...

为了高效解决这个问题,就有了intention locks

intention locks 意向锁: 有关Granularity的锁

算是一种修饰,毕竟颗粒度是额外的性质。

IS: Intention to request a shared lock on a sub-item IX: Intention to request an exclusive lock on a sub-item

意向锁顾名思义,代表了一个意向。如果我要对某表里面的某行动手动脚,就要和表表达一个,啊,“我要进来了!”的意向。于是就搞一个意向锁出来。这样如果出现了大锁和小锁的冲突,系统就不需要整个表里一行行搜索有没有行锁,直接看有没有意向锁就完事了。

已有 S Shared X Exclusive IS IX
S Shared YES NO YES NO
X Exclusive NO NO NO NO
IS YES NO YES YES
IX NO NO YES YES

log 日志

日志记录:“<...> 代表日志的记录内容”

undo log 回滚日志

回滚日志保证了原子性,因为做到一半如果出现了错误就给你滚回去了。

<START T>: Transaction T has started. <COMMIT T>: Transaction T has committed.*当T提交,所有变化都已记录,则要commit <ABORT T>: Transaction T was aborted. <T, X, v>: Transaction T has updated the value of database item X, and the old value of X was v. 用于恢复。

flush_log:将日志记录写入磁盘

image-20211124085959730

redo log 重做日志

New meaning of <T, X, v>: “Transaction T has updated the value of database item X & the new value of X is v.” 用于重做committed transactions

Undo/Redo Logging

<T, X, v, w>: “Transaction T has updated the value of database item X, and the old/new value of X is v/w.”

Ensuring atomicity and durability

Undo essentially ensures Atomicity,Can ensure durability using Force

  • Force:writing of updates to disk before commit

Redo essentially ensures Durability,Can ensure atomicity using No Steal

  • No Steal means that uncommitted data may not overwrite committed data on disk

NO-FORCE意味着commit的东西可能没成功被写回去,那就需要redo(保证Durability);

STEAL 意味着没commit的东西可能被写回去了,那就需要undo(维护Atomicity)

(包括以下表格和一些PPT,摘自QianPin的blog Database System: Logging and Recovery 写的很好。)

No Steal Steal
No Force redo redo + undo
Force 不需要 undo

Checkpoints

They can be used to make the size of the log file smaller

for undo:

已经全部提交到磁盘了 <- Checkpoint -> 可能会出错undo

ARIES Checkpoints(for Undo/Redo logging)

<start T1>

*uncommited transactions

<Checkpoint(T1,T2...)>

*committed transactions

<END CHECKPOINT > *没有这行相当于没有CHECKPOINTimage-20220114204941070

Recoverable 可恢复性

commit 不能乱c,可能出大问题 (durability and so on...

Recoverable schedules: T commits only if all Ts that T has read from have committed

Cascadeless Schedules:Read committed

  • Cascadeless schedules are in general not serialisable.

Strict Schedules: Reads and writes committed

Strict Two-Phase Locking (Strict 2PL): A transaction T must not release any lock (that allows T to write data) until: T has committed or aborted, and the commit/abort log record has been written to disk.

  • Strict 2PL is conflict-serialisable and strict.
  • Risk of Deadlocks 死锁依旧
  • Detect deadlocks & fix them (such as timestamp
  • Enforce deadlock-free schedules

image-20211125044659244

解决deadlock

Timeouts,Wait-for graphs(好像没什么用,只是能看环),Timestamps.

Timestamps 时间戳

Each transaction T is assigned a unique integer TS(T) upon arrival.(时间戳)

If T1 arrived earlier than T2, we require TS(T1) < TS(T2) *越小越早

  1. Wait-Die Scheme:older transactions always wait for unlocks
    • T1 wait,T2 roll back (die.
  2. Wound-Wait Scheme:older transactions never wait for unlocks
    • T1 do,T2 wait for T1

Read Time of X: RT(X) Timestamp of youngest transaction that read X *读取X的T的最大(最迟)时间戳

Write Time of X: WT(X) Timestamp of youngest transaction that wrote X *写入X的T的最大(最迟)时间戳

Requests:

  1. If T1 requests to read X: Abort & restart T1 if WT(X) > TS(T1) *说明有小辈提前写过了,值变了,要坏事。

  2. If T1 requests to write X: Abort & restart T1 if RT(X) or WT(X) > TS(T1) *说明有小辈提前读/写过了,要坏事。

Timestamp-based Scheduling

Nice properties: ◦ Enforces conflict-serialisable schedules ◦ Deadlocks don’t occur

Bad properties: ◦ Cascading rollbacks ◦ Starvation can occur (cyclic aborts & restarts of transactions)

MVCC Multiversion concurrency control:

写入不覆盖,而是创建新版本。读取最新版本。

​ For writes: Abort & restart T1 if RT(X) > TS(T1) and otherwise, grant request ​ For reads: Always granted

SQL query

Basic relational algebra:

Selection (𝜎) 𝜎condition(R): 将符合条件的行从R中选出

image-20211125090152706

Projection (𝜋) 𝜋xx(R): 将R中的xx作为列选出

image-20211125090106025

Cartesian product (×) 两边互增的合并(不相关就是列数相乘,1对n)

select * from a,b = a×b

image-20211125090339313

Union (∪) 多次操作结果叠放

image-20211125090735993

Renaming (𝜌) 𝜌xx->yy(R): 将R中的xx重命名为yy

image-20211125090125907

Natural join (⋈) 选出有相同公共属性的行合并(会受𝜌影响)

自然连接是在广义笛卡尔积R×S选出同名属性上符合相等条件元组再进行投影,去掉重复的同名属性组成新的关系。

但如果没有公共属性则变为笛卡尔积。

image-20211125090523872

Semijoin (⋉) Natural join,但是只返回左边表的值。

image-20211125091647771

各类连接总结 https://www.cnblogs.com/wishyouhappy/p/3678852.html

index

Forms of Indexes:

B+ Trees

  • Good if selection condition specifies a range

    Most widely used

Hash tables

  • Good if selection involvesequality only

B+ Tree

从叶子朝向根节点,还有一些线性扫描之类的……按照直觉来就完事了

老师:性质全按网站来:https://www.cs.usfca.edu/~galles/visualization/BPlusTree.html

Optimising query plans

PPT 不长 https://liverpool.instructure.com/courses/46572/pages/optimising-query-plans?module_item_id=1218348

data 数据

structured,semi-structured,unstructured data、

structured 结构化

  • Data has to fit to schema 存成表了
  • highly efficient query processing 好查,快速

unstructured 非结构化

  • Completely free on how to organise data 形式随心所欲
  • No description of structure or data: programs have to know how to read & interpret the data 无结构,读取需引入读取规则

semi-structured 半结构化

  • Self-describing: no schema required 有结构,不是表
  • Flexible: e.g., can add & remove attributes on demand 较为灵活

半结构化数据一般都是一种树形结构

  • Leaf nodes have associated data
  • Inner nodes have edges going to other nodes
  • Each edge has a label *关系
  • root:No incoming edges;Each node reachable from the root

例子:XML, JSON,Graphs...

XML Extensible MarkUp Language 可扩展标记语言

网上教程,可做参考

XML语法:

树:

  • 一个element(元素)东西:<东西>...</东西>

    • ... 中包含了该"东西"节点可访问到的所有子结构
  • 东西若有(Attributes)属性<东西 大小=“这么大” 材料=“你妈”>

Attributes 和 Element 的区别:

一个Element可以有多个相同(指child名)的子Element,但Attributes不会重复。

其他:

  • 头部声明:<?xml version="1.0" encoding="utf-8" standalone="yes" ?>

XML结构:

1
2
3
4
5
6
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<root>
<child Attributes>
<subchild Attributes>.....</subchild>
</child>
</root>

XML遍历:平面迷宫,逆时针绕一圈回到root

实现xml结构有两种方式(规范),DTD 和 XML Schema

DTD: Document Type Definitions

定义方式:

1
2
3
<!DOCTYPE root tag name [
The DTD is included here
]>

例子:(#PCDATA=Parsed Character Data 可解析文本,不含有子元素)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<?xml version="1.0" encoding="gb2312" ?>
<!DOCTYPE bookinfo [
<!ELEMENT bookinfo (title,author,publish,price)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT publish (publisher,ISBN,pubdate)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT ISBN (#PCDATA)>
<!ELEMENT pubdate (#PCDATA)>
<!ELEMENT price (#PCDATA)>
]>
<bookinfo>
<title>计算机导论</title>
<author>丁跃潮等</author>
<publish>
<publisher>高等教育出版社</publisher>
<ISBN>7-04-014768-8</ISBN>
<pubdate>2004.6</pubdate>
</publish>
<price>19.7</price>
</bookinfo>

可见,DTD是写在xml主体前的一种类似于头文件的东西,我们先读取DTD,这样我们就知道底下那个XML结构里面的东西都是些啥了。

一些数量规则表示 (name, phone?, email?, teaches*, ... ) 这种 *: indicates zero or more occurrences for an element; 出现0~n次 +: indicates one or more occurrences for an element; 出现1~n次 ?: indicates either zero occurrences or exactly one occurrence for an element. 出现0~1次

Attribute Data types 数据类型

CDATA: character data, containing any text.

ID: used to identify individual elements in document

  • You can not define what type you are pointing at exactly, but just that it is some ID for some object

List of names: values that attribute can hold (enumerated type)

  • Format: (en1|en2|...) meaning the value can be either en1 or en2 or ...

IDREF/IDREFS: must correspond to value of ID attribute(s) for some element in document

About ID and IDREF/IDREFS

就像数据库里的 primary key 和 foreign key.

image-20220106092537712

Two levels of document processing

well-formed and valid. 格式良好和有效的

Non-validating processor 只会检查文件是否满足 XML document 格式的要求,比如所有元素都在一个根元素里。这是 well-formed

Validating processor 还会检查DTD/XML Schema, 确保这个文件能用——valid.

XML Schema

考试不作要求,待会儿再看。

XPath

测试:https://www.videlibri.de/cgi-bin/xidelcgi

对于一个已存在xml文件,每一个元素/属性都有其地址用于查找。我们可以像windows系统文件的地址一样作表示。

Attributes: replace the last tag name by @Attribute name

考虑这样一个XML:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<lecturers>									XPath:
<lecturer id="123"> /lecturers/lecturer/@id
<name>J.Fearnley</name> /lecturers/lecturer/name
<phone>7954···</phone> /lecturers/lecturer/phone
<memberOf>Economics...</memberOf> /lecturers/lecturer/memberOf
<teaches>
<code>COMP105</code> /lecturers/lecturer/teaches/code
<title>Progr. Paradigms</title> /lecturers/lecturer/teaches/title
</teaches>
</lecturer>
<lecturer tall="180"> /lecturers/lecturer/@tall
<name>M. Gairing</name> /lecturers/lecturer/name
<email type="work"> /lecturers/lecturer/email/@type
gairing@... /lecturers/lecturer/email
</email>
<teaches>
<code>COMP211</code> /lecturers/lecturer/teaches/code
<title>Internet...</title> /lecturers/lecturer/teaches/title
<year>2</year> /lecturers/lecturer/teaches/year
</teaches>
</lecturer>
</lecturers>

Wildcards 通配符

A wildcard (*****) can be used to stand for any tag name or attribute name

以上述XML为例,

/lecturers/lecturer/@* returns ”123”, “180”

/lecturers/lecturer/* return all elements directly below lecturer elements

More advanced

对于任意的XPath,满足以下格式:

/axis1::E1/axis2::E2/axis3::E3/.../axisn::En

" axis i :: " : 类型标识符,例如E2是一个Attribute, 则 “axis2::” = “@”

image-20220108003752287

image-20220108004226906

Conditions 条件

/axis1::E1[C1]/axis2::E2[C2]/axis3::E3[C3]/.../axisn::En[Cn]

example : //book[category=“CS”]/title

All titles of books in category “CS”

...E[i]: true if current node is the i-th E-child of its parent

//*[category="CS" or category="Scifi"][1]

//*[category="CS" or category="Scifi"][last()]

...E[string]: true iff string is not ”” //book[@price/data()] returns the books with a non-empty price attribute

...Expath: true iff the xpath gives a non-empty node-set //book[category]/title returns the titles of all books that has a category //book[@price] returns a book if it has a price attribute

XQuary 菜鸟教程

测试:https://www.videlibri.de/cgi-bin/xidelcgi

对于一个已知的XML文件,doc() 函数用于打开 "balabala.xml" 文件——doc(example.xml)

我们在doc()后面添加路径来查找元素:doc(example.xml) XPath 例:doc(example.xml)/element1/element2

“$”用于定义变量

let variable := XQuery expression

在地址前加$, 相当于把返回结果当作变量,于是可以比较:$s/year >= 2

XQuery 有FLWOR, 是 “for、let、where、order by、return” 的缩写。

这都是可用于 XQuery 表达式以便得到更精确结果的子句。

XPath/XQuery expressions can be used as conditions, Interpreted as true if the result is non-empty

... where $s/module return true if s/module exist.

还可以加 some/every $var. 例:

let $uni_doc := doc(”mydoc.xml") for $l in $uni_doc/university/lecturer where every $m in $l/teaches satisfies $m/year <= 2 return $l/name

More advanced

Goal: return all pairs of title and author, sorted by author name descending (ascending is default)

let $doc := doc("mydoc.xml") for $b in $doc/books/book for $author in $b/author order by $author descending return <pair>{$b/title}, {$author}</pair>

image-20220109062246126

image-20220108052224581

image-20220108052210509

noSQL 非关系型数据库

NoSQL databases are often distributed 分布式的

Designed to guarantee: Availability: every non-failing node always executes queries Consistency: every read receives the most recent write or an error Scalability: more capacity (users, storage, ...) by adding new nodes High performance: often achieved by very simple interface(e.g., support lookups/inserts of keys, but do this fast) Partition-tolerance: even if nodes (or messages) fail, the remaining subnetworks can continue their work

The CAP Theorem:

image-20220109084309283

BASE

NoSQL databases drop ACID and often provide the weaker guarantees of BASE:

Basically Available Soft state Eventually consistent

Soft state & eventual consistency: the database state might occasionally be inconsistent... but it will eventually be made consistent

NoSQL的各种类型:

image-20220109085020287

Key-Value Stores

find(k): returns value for key k write(k, v): inserts value v under key k

可以理解为key是数组下标

PPT里的解释和图有点跳步骤,我的建议是看 DynamoDB实现原理分析

Data analysis

我们平时使用的一些数据库的操作,如果放在那种现实中的,大量数据的公司级别数据库应用里,其规模效应会造成很多问题。

Queries in these applications tend to... ◦ be complex: use aggregates & other advanced features ◦ Examine large parts of the data

这种操作常常需要大面积的检索,应该避免在多用户DBMS上出现

Data Warehouses:为了支持Data analysis而建立,通常集成了很多数据源,作为操作者和数据库的中转站。那些日常操作就需要通过它实现。

OLTP (Online Transaction Processing,联机事务处理) 面向用户,即顾客,被用于日常事务的查询和处理。

  • queries and updates that can be executed fast
  • affect a small portion of a database
  • consists of many queries at once, but where each are typically simple

OLAP (Online Analytic Processing,联机分析处理) 面向市场/面向分析,使用者是数据行业从业者(knowledge worker)

  • Refers to the process of analysing complex data stored in a data warehouse
  • Often involving GROUP BY and some WHERE clause and can depend on very much data
Star Schemas 星形模式

参考:数据仓库与OLAP

Star Schemas 是 One of the most common data warehouse architectures

它有多个Dimension tables: describe values along each axis

它有一个Unique 的 fact table: contains points in the data cube

  • data cube:fact table 包含所有维表的主键,这些主键形成了一个n维空间,也就是data cube,主键坐标确定了每行Dependent Attributes在cube中的位置。

用sql的话说,所有的foreign key共同组成了fact table的primary key。(再加上汇总数据Dependent Attributes,通常是numeric value.

image-20220110023506336

img

Characteristics of Star Schemas

Key feature: denormalised schema 非正规模式 ◦ Main data in one table (fact table) ◦ Rest of the data can be joined with fact table very quickly

Gains: ◦ Queries don’t require many joins ◦ Performance gains, especially when processing queries that would require many joins in a normalised database ◦ Faster and easier aggregation of data

image-20220110024312013

img

Data-Mining

基于OLAP。不仅仅是单纯的query,而是找到数据之间的关系。

image-20220110033949819

Types of Discovered Knowledge 挖掘出的东西:

Association rules 相互关联的规则

Classification hierarchies 分类分层

Sequential patterns 序列模式(因果关系)

Clustering 聚类

market-basket model 购物篮模型

Data that can be described by: ◦ A set of items I ◦ A set of baskets B: each basket bB is a subset of I

\(I =\{o_1,o_2,o_3,\cdots\}\) , \(b_i⊆I\) , \(B=\{b_1,b_2,b_3,\cdots\,b_n\}\)

Support of an Itemset

考虑一个Itemset \(J\), \(J⊆I\) ,则:

Support of \(J\) = \(\sum\{b|b\cap J=J\}/n\)

更复杂的购物篮:

image-20220110042539668

Support of {milk, bread}: ◦ With respect to the purchases = 2/4 = 0.5 ◦ With respect to the customers = 1/3 ≈0.33

With support threshold s = 0.5 ◦ {milk, bread} is frequent with respect to the purchases ◦ {milk, bread} is not frequent with respect to the customers

image-20220110042714986

Association Rules 挖掘

Confidence: percentage of baskets for \(J_1\) containing \(J_2\)

General form: \(J_1 ⇒J_2\) , 有: \[ J_1 ⇒J_2 = \frac{Support\ of\ (J_1 \cup J_2)}{Support\ of\ J_1} \] 我们会认为设定一个 threshold 阈值 s

\(J_1 ⇒J_2 ≥ s\) 则称其是 high enough

A-Priori Algorithm 先验算法

If \(J\) has support ≥ s, then all subsets of \(J\) have support ≥ s.