COMP9315: Storage: Devices, Files, Pages, Tuples, Buffers, Catalogs
Storage Management |
Storage Management | 2/234 |
Aims of storage management in DBMS:
... Storage Management | 3/234 |
The storage manager provides mechanisms for:
DB
Rel
Page
Tuple
PageId
TupleId
... Storage Management | 4/234 |
Examples of references (addresses) used in DBMSs:
PageID
PageID = FileID + Offset
Offset
TupleID
TupleID = PageID + Offset
Offset
Offset
... Storage Management | 5/234 |
Levels of DBMS related to storage management:
... Storage Management | 6/234 |
Topics to be considered:
Views of Data | 7/234 |
Users and top-level query evaluator see data as
... Views of Data | 8/234 |
Relational operators and access methods see data as
... Views of Data | 9/234 |
File manager sees both DB objects and file store
... Views of Data | 10/234 |
Disk manager sees data as
On typical modern databases, handled by operating system filesystem.
Storage Manager Interface | 11/234 |
The storage manager provides higher levels of system
select name from Employee
is implemented as something like
DB db = openDatabase("myDB"); Rel r = openRel(db,"Employee"); Scan s = startScan(r); Tuple t; while ((t = nextTuple(s)) != NULL) { char *name = getField(t,"name"); printf("%s\n", name); }
... Storage Manager Interface | 12/234 |
The above shows several kinds of operations/mappings:
Data Flow in Query Evaluation | 13/234 |
... Data Flow in Query Evaluation | 14/234 |
Notes on typical implementation strategies:
int
PageId = (FileNum<24)||PageNum
get_page(r,i,buf)
get_page(pid,buf)
DB
Rel
structs
struct RelRec { int fd; int npages; int blksize; } typedef struct RelRec *Rel;
Files in DBMSs | 15/234 |
Data sets can be viewed at several levels of abstraction in DBMSs.
Logical view: a file is a named collection of data items (e.g. a table of tuples)
Abstract physical view: a file is a sequence of fixed-size data blocks.
Physical realisation: a collection of sectors scattered over ≥1 disks.
The abstraction used for managing this: PageId
... Files in DBMSs | 16/234 |
... Files in DBMSs | 17/234 |
Two possibilities for DBMS disk managers to handle data:
File System Interface | 18/234 |
Most access to data on disks in DBMSs is via a file system.
Typical operations provided by the operating system:
fd = open(fileName,mode)// open a named file for reading/writing/appending close(fd)// close an open file, via its descriptor nread = read(fd, buf, nbytes)// attempt to read data from file into buffer nwritten = write(fd, buf, nbytes)// attempt to write data from buffer to file lseek(fd, offset, seek_type)// move file pointer to relative/absolute file offset fsync(fd)// flush contents of file buffers to disk
Storage Technology | 19/234 |
At this point in memory technology development:
Computational Storage | 20/234 |
Characteristics of main memory (RAM):
load reg,byte_address store reg,byte_address
Cache memory has similar characteristics to RAM, but is
Bulk Data Storage | 21/234 |
Requirements for bulk data storage:
... Bulk Data Storage | 22/234 |
Several kinds of bulk data storage technology currently exist:
Magnetic Disks | 23/234 |
Classical/dominant bulk storage technology.
Characteristics:
Modest increase in speed; good reduction in cost.
Optical Disks | 24/234 |
Optical disks provides an alternative spinning disk storage technology.
Several varieties: CD-ROM, CD-R, CD-RW, DVD-RW
Compared to magnetic disks, CD's have
Flash Memory | 25/234 |
Flash memory is a non-mechanical alternative to disk storage.
Compared to disks, flash memory has
... Flash Memory | 26/234 |
Properties of flash memory require specialised file system
Example: updating data in flash storage
Disk Management |
Disk Manager | 28/234 |
Aim:
... Disk Manager | 29/234 |
Basic disk management interface is simple:
void get_page(PageId p, Page buf)
PageId
Page
void put_page(PageId p, Page buf)
Page
PageId
PageId allocate_pages(int n)
n
void deallocate_page(PageId p, int n)
n
PageId
Disk Technology | 30/234 |
Disk architecture:
... Disk Technology | 31/234 |
Characteristics of disks:
read block at address (p,t,s) write block at address (p,t,s)
Disk Access Costs | 32/234 |
Access time includes:
... Disk Access Costs | 33/234 |
Example disk #1 characteristics:
... Disk Access Costs | 34/234 |
Time Tr to read one random block on disk #1:
Minimum Tr = 0 + 0 + 0.13 = 0.13 ms
Maximum Tr = 50 + 16.7 + 0.13 = 66.8 ms
Average Tr = 25 + (16.7/2) + 0.13 = 33.5 ms
... Disk Access Costs | 35/234 |
If operating system deals in 4KB blocks:
Tr(4-blocks) = 25 + (16.7/2) + 4×0.13 + 3×0.01 = 33.9 ms
Tr(1-block) = 25 + (16.7/2) + 0.13 = 33.5 ms
Note that the cost of reading 4KB is comparable to reading 1KB.
Sequential access reduces average block read cost significantly, but
... Disk Access Costs | 36/234 |
Example disk #2 characteristics:
If using 32-bit addresses, this leaves 8 bits (28=256 items/block).
Disk Characteristics | 37/234 |
Three important characteristics of disk subsystems:
... Disk Characteristics | 38/234 |
Increasing capacity:
Increasing Disk Capacity | 39/234 |
Compress data (e.g. LZ encoding)
+ more data fits on disk
- compression/expansion overhead
For large compressible data (e.g. text
For most relational data (e.g. int
char(8)
For high-performance memory caching, may never want to expand
(there is current research working on "computable" compressed data formats).
Improving Disk Access Costs | 40/234 |
Approach #1: Use knowledge of data access patterns.
E.g. two records frequently accessed together
⇒ put them in the same block (clustering)
E.g. records scanned sequentially
⇒ place them in "staggered" blocks, double-buffer
Arranging data to match access patterns can improve throughput by 10-20 times.
Approach #2: Avoid reading blocks for each item access.
E.g. buffer blocks in memory, assume likely re-use
Scheduled Disk Access | 41/234 |
Low-level disk manager (driver, controller):
Disk Layout | 42/234 |
If data sets are going to be frequently accessed in a pre-determined manner, arrange data on disk to minimise access time.
E.g. sequential scan
Modern systems generally don't, because of programmer complexity.
Unix has raw disk partitions: no file system, you write driver to manage disk.
Improving Writes | 43/234 |
Nonvolatile write buffers
Double Buffering | 44/234 |
Double-buffering exploits potential concurrency between disk and memory.
While reads/writes to disk are underway, other processing can be done.
With at least two buffers, can keep disk working full-time.
... Double Buffering | 45/234 |
Example: select sum(salary) from Employee
read A into buffer then process buffer content read B into buffer then process buffer content read C into buffer then process buffer content ...
Costs:
... Double Buffering | 46/234 |
Double-buffering approach:
read A into buffer1 process A in buffer1 and concurrently read B into buffer2 process B in buffer2 and concurrently read C into buffer1 ...
Costs:
General observation: use of multiple buffers can lead to substantial
cost savings.
We will see numerous examples where multiple memory buffers
are exploited.
Multiple Disk Systems | 47/234 |
Various strategies can be employed to improve capacity, performance and reliability when multiple disks are available.
RAID (redundant arrays on independent disks) defines a standard set of such techniques.
Essentially, multiple disks allow
(although there is obviously a trade-off between increased capacity and increased reliability via redundancy)
RAID Level 0 | 48/234 |
Uses striping to partition data for one file over several disks
E.g. for n disks, block i in the file is written to disk (i mod n)
Example: file with 6 data blocks striped onto 4 disks using (pid mod 4)
Increases capacity, improves data transfer rates, reduces reliability.
... RAID Level 0 | 49/234 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
disk = diskOf(PageId,ndisks) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect)
(We discuss later how the pid
RAID Level 1 | 50/234 |
Uses mirroring (or shadowing) to store multiple copies of each block.
Since disks can be read/written in parallel, transfer cost unchanged.
Multiple copies allows for single-disk failure with no data loss.
Example: file with 4 data blocks mirrored on two 2-disk partitions
Reduces capacity, improves reliability, no effect on data transfer rates.
... RAID Level 1 | 51/234 |
The disk manager and RAID controller have to perform a mapping something like:
writePage(PageId)
to
n = ndisksInPartition disk = diskOf(PageId,n) cyln = cylinderOf(PageId) plat = platterOf(PageId) sect = sectorOf(PageId) writeDiskPage(disk, cyln, plat, sect) writeDiskPage(disk+n, cyln, plat, sect)
RAID levels 2-6 | 52/234 |
The higher levels of raid incorporate various combinations of:
The differences are primarily in:
RAID level 6 can recover from smultaneous failures in two disks.
Disk Media Failure | 53/234 |
Rarely, a bit will be transferred to/from the disk incorrectly.
Error-correcting codes can check for and recover from this.
If recovery is not possible, the operation can simply be repeated.
If repeated reads/writes on the same block fail:
Database Objects | 54/234 |
DBMSs maintain various kinds of objects/information:
can be viewed as an super-object for all others | ||
global configuration information | ||
meta-information describing database contents | ||
named collections of tuples | ||
collections of typed field values | ||
access methods for efficient searching | ||
for handling rollback/recovery | ||
active elements |
... Database Objects | 55/234 |
The disk manager implements how DB objects are mapped to file system.
References to data objects typically reduce to e.g.
Offset
PageId
RecordID
PageId+Offset
Single-file DBMS | 56/234 |
One possible storage organisation is a single file for the entire database.
All objects are allocated to regions of this file.
Objects are allocated to regions (segments) of the file.
If an object grows too large for allocated segment, allocate an extension.
What happens to allocated space when objects are removed?
... Single-file DBMS | 57/234 |
Allocating space in Unix files is easy:
Single-file Disk Manager | 58/234 |
Simple disk manager for a single-file database:
// Disk Manager data/functions #define PAGESIZE 2048// bytes per page typedef int PageId;// PageId is block index typedef struct DBrec { char *dbname;// copy of database name int fd;// the database file SpaceTable map;// map of free/used areas NameTable names;// map names to areas + sizes } *DB; typedef struct Relrec { char *relname;// copy of table name int start;// page index of start of table data int npages;// number of pages of table data ... } * Rel;
... Single-file Disk Manager | 59/234 |
DB openDatabase(char *name) { DB db = new(struct DBrec); db->dbname = strdup(name); db->fd = open(name,O_RDWR); db->map = readSpaceTable(DBfd); db->names = readNameTable(DBfd); return db; }// stop using DB and update all meta-data void closeDatabase(DB db) { writeSpaceTable(db->fd,db->map); writeNameTable(db->fd,db->map); fsync(db->fd); close(db->fd); free(db); }
... Single-file Disk Manager | 60/234 |
// set up struct describing relation Rel openRelation(DB db, char *rname) { Rel r = new(struct Relrec); r->relname = strdup(rname);// get relation data from map tables r->start = ...; r->npages = ...; return r; }// stop using a relation void closeRelation(Rel r) { free(r); } #define nPages(r) (r->npages) #define makePageId(r,i) (r->first + i)
... Single-file Disk Manager | 61/234 |
// assume that Page = byte[PageSize] // assume that PageId = block number in file // read page from file into memory buffer void get_page(DB db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); read(db->fd, buf, PAGESIZE); }// write page from memory buffer to file void put_page(Db db, PageId p, Page buf) { lseek(db->fd, p*PAGESIZE, SEEK_SET); write(db->fd, buf, PAGESIZE); }
... Single-file Disk Manager | 62/234 |
// managing contents of mapping table is complex // assume a list of (offset,length,status) tuples // allocate n new pages at end of file PageId allocate_pages(int n) { int endfile = lseek(db->fd, 0, SEEK_END); addNewEntry(db->map, endfile, n);// note that file itself is not changed }// drop n pages starting from p void deallocate_pages(PageId p, int n) { markUnused(db->map, p, n);// note that file itself is not changed }
Example: Scanning a Relation | 63/234 |
With the above disk manager, our example:
select name from Employee
might be implemented as something like
DB db = openDatabase("myDB"); Rel r = openRelation(db,"Employee"); int npages = nPages(r); Page buffer = malloc(PAGESIZE*sizeof(char)); for (int i = 0; i < npages; i++) { PageId pid = makePageId(r,i); get_page(db, pid, buffer); foreach tuple in buffer { get tuple data and extract name } }
Multiple-file Disk Manager | 64/234 |
Most DBMSs don't use a single large file for all data.
They typically provide:
... Multiple-file Disk Manager | 65/234 |
Structure of PageId
If system uses one file per table, PageId
PageId
Oracle File Structures | 66/234 |
Oracle uses five different kinds of files:
catalogue, tables, procedures | ||
update logs | ||
record system events | ||
configuration info | ||
off-line collected updates |
... Oracle File Structures | 67/234 |
There may be multiple instances of each kind of file:
SYSTEM
... Oracle File Structures | 68/234 |
Tablespaces are logical units of storage (cf directories).
Every database object resides in exactly one tablespace.
Units of storage within a tablespace:
fixed size unit of storage (cf 2KB page) | ||
specific number of contiguous data blocks | ||
set of extents allocated to a single database object |
Segments can span multiple data files; extents cannot.
To be confusing, tables are called datafiles internally in Oracle.
... Oracle File Structures | 69/234 |
Layout of data within Oracle file storage:
PostgreSQL Storage Manager | 70/234 |
PostgreSQL uses the following file organisation ...
... PostgreSQL Storage Manager | 71/234 |
Components of storage subsystem:
RelFileNode
storage/smgr
storage/smgr/md.c
storage/file
smgr
Relations as Files | 72/234 |
PostgreSQL identifies relation files via their OIDs.
The core data structure for this is RelFileNode
typedef struct RelFileNode { Oid spcNode;// tablespace Oid dbNode;// database Oid relNode;// relation } RelFileNode;
Global (shared) tables (e.g. pg_database
spcNode == GLOBALTABLESPACE_OID
dbNode == 0
... Relations as Files | 73/234 |
The relpath
RelFileNode
char *relpath(RelFileNode rnode)// simplified { char *path = malloc(ENOUGH_SPACE); if (rnode.spcNode == GLOBALTABLESPACE_OID) {/* Shared system relations live in PGDATA/global */ Assert(rnode.dbNode == 0); sprintf(path, "%s/global/%u", DataDir, rnode.relNode); } else if (rnode.spcNode == DEFAULTTABLESPACE_OID) {/* The default tablespace is PGDATA/base */ sprintf(path, "%s/base/%u/%u", DataDir, rnode.dbNode, rnode.relNode); } else {/* All other tablespaces accessed via symlinks */ sprintf(path, "%s/pg_tblspc/%u/%u/%u", DataDir rnode.spcNode, rnode.dbNode, rnode.relNode); } return path; }
File Descriptor Pool | 74/234 |
Unix has limits on the number of concurrently open files.
PostgreSQL maintains a pool of open file descriptors:
open()
typedef char *FileName
Open files are referenced via: typedef int File
A File
... File Descriptor Pool | 75/234 |
Interface to file descriptor (pool):
File FileNameOpenFile(FileName fileName, int fileFlags, int fileMode);// open a file in the database directory ($PGDATA/base/...) File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);// open a file given a full file path File OpenTemporaryFile(bool interXact);// open temp file; flag: close at end of transaction? void FileClose(File file); void FileUnlink(File file); int FileRead(File file, char *buffer, int amount); int FileWrite(File file, char *buffer, int amount); int FileSync(File file); long FileSeek(File file, long offset, int whence); int FileTruncate(File file, long offset);
... File Descriptor Pool | 76/234 |
Virtual file descriptors (Vfd
VfdCache[0]
... File Descriptor Pool | 77/234 |
Virtual file descriptor records (simplified):
typedef struct vfd { s_short fd;// current FD, or VFD_CLOSED if none u_short fdstate;// bitflags for VFD's state File nextFree;// link to next free VFD, if in freelist File lruMoreRecently;// doubly linked recency-of-use list File lruLessRecently; long seekPos;// current logical file position char *fileName;// name of file, or NULL for unused VFD // NB: fileName is malloc'd, and must be free'd when closing the VFD int fileFlags;// open(2) flags for (re)opening the file int fileMode;// mode to pass to open(2) } Vfd;
File Manager | 78/234 |
The "magnetic disk storage manager"
PageId
PageId
typedef struct { RelFileNode rnode;// which relation ForkNumber forkNum;// which fork BlockNumber blockNum;// which block } BufferTag;
... File Manager | 79/234 |
Access to a block of data proceeds as follows:
offset = BlockNumber * BLCKSZ fileID = RelFileNode+ForkNumber if (fileID is already in Vfd pool) { if (offset is in this file) fd = use Vfd from pool else fd = allocate new Vfd for next part of file } else { fd = allocate new Vfd for this file } seek to offset in fd read/write data page (BLCKSZ bytes)
BLCKSZ
Buffer Pool |
Buffer Manager | 81/234 |
Aim:
Buffer pool
... Buffer Manager | 82/234 |
Buffer pool interposed between access methods and disk manager
Access methods/page manager normally work via get_page()
now work via calls to get_page_via_buffer_pool()
... Buffer Manager | 83/234 |
Basic buffer pool interface
Page request_page(PageId p);
p
void release_page(PageId p);
p
void mark_page(PageId p);
p
void flush_page(PageId p);
p
void hold_page(PageId p);
p
allocate_page
deallocate_page
Buffer Pool | 84/234 |
... Buffer Pool | 85/234 |
Buffer pool data structures:
... Buffer Pool | 86/234 |
In subsequent discussion, we assume:
Requesting Pages | 87/234 |
Call from client: request_page(pid)
If page pid
If page pid
... Requesting Pages | 88/234 |
Advantages:
Releasing Pages | 89/234 |
The release_page
If the page has been modified, must be written to disk before replaced.
Possible problem: changes not immediately reflected on disk
... Releasing Pages | 90/234 |
Advantages:
Buffer Manager Example #1 | 91/234 |
Self join: an example where buffer pool achieves major efficiency gains.
Consider a query to find pairs of employees with the same birthday:
select e1.name, e2.name from Employee e1, Employee e2 where e1.id < e2.id and e1.birthday = e2.birthday
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Employee e1 { for each tuple t2 in Employee e2 { if (t1.id < t2.id && t1.birthday == t2.birthday) append (t1.name,t2.name) to result set } }
... Buffer Manager Example #1 | 92/234 |
In terms of page-level operations, the algorithm looks like:
DB db = openDatabase("myDB"); Rel emp = openRel(db,"Employee"); int npages = nPages(emp); for (int i = 0; i < npages; i++) { PageId pid1 = makePageId(emp,i); Page p1 = request_page(pid1); for (int j = 0; j < npages; i++) { PageId pid2 = makePageId(emp,j); Page p2 = request_page(pid2);// compare all pairs of tuples from p1,p2 // construct solution set from matching pairs release_page(pid2); } release_page(pid1); }
... Buffer Manager Example #1 | 93/234 |
Consider a buffer pool with 200 frames and a relation with b ≤ 200 pages:
p1
p2
p2
Employee
... Buffer Manager Example #1 | 94/234 |
Now consider a buffer pool with 2 frames (the minimum required for the join):
p1
p2
p2
p2
... Buffer Manager Example #1 | 95/234 |
(... continued)
p2
p2
p1
p2
p1
Cf. 200-frame buffer vs 2-frame buffer ... if b=100, 100 reads vs 10000 reads.
The request_page | 96/234 |
Method:
dirty=False
pinCount=0
Page
Other Buffer Operations | 97/234 |
The release_page
The mark_page
operation:
The flush_page
write_page
Page Replacement Policies | 98/234 |
Several schemes are commonly in use:
For DBMS, we can predict patterns of page access better
(from our knowledge of how the relational operations are implemented)
... Page Replacement Policies | 99/234 |
The cost benefits from a buffer pool (with n frames) is determined by:
First scan costs b reads; subsequent scans are ``free''.
Example (b): sequential scan, MRU, n < b, no competition
First scan costs b reads; subsequent scans cost b - n reads.
Example (c): sequential scan, LRU, n < b, no competition
All scans cost b reads; known as sequential flooding.
Page Access Times | 100/234 |
How to determine when a page in the buffer was last accessed?
Could simply use the time of the last request_page
PageId
But this doesn't reflect real accesses to page.
For more realism, could use last request_page
release_page
Or could introduce operations for examining and modifying pages in pool:
examine_page(PageId, TupleId)
modify_page(PageId, TupleId, Tuple)
Buffer Manager Example #2 | 101/234 |
Standard join: an example where replacement policy can have large impact.
Consider a query to find customers who are also employees:
select c.name from Customer c, Employee e where c.ssn = e.ssn;
This might be implemented inside the DBMS via nested loops:
for each tuple t1 in Customer { for each tuple t2 in Employee { if (t1.ssn == t2.ssn) append (t1.name) to result set } }
... Buffer Manager Example #2 | 102/234 |
Assume that:
Customer
Employee
... Buffer Manager Example #2 | 103/234 |
Works well with MRU strategy:
Customer
Employee
Customer
Employee
Note: assumes that both request_page
release_page
... Buffer Manager Example #2 | 104/234 |
Works less well with LRU strategy:
Customer
Employee
Employee
Customer
Employee
PostgreSQL Buffer Manager | 105/234 |
PostgreSQL buffer manager:
Buffers are located in a large region of shared memory.
Functions: src/backend/storage/buffer/*.c
Definitions: src/include/storage/buf*.h
... PostgreSQL Buffer Manager | 106/234 |
Buffer pool consists of:
Nbuffers
BufferDesc
Nbuffers
Buffer
BufferDesc
Buffer
shared_buffers = 16MB# min 128KB, at least max_connections*2, 8KB each
... PostgreSQL Buffer Manager | 107/234 |
... PostgreSQL Buffer Manager | 108/234 |
Definitions related to buffer manager:
include/storage/buf.h
Buffer
include/storage/bufmgr.h
include/storage/buf_internals.h
BufferDesc
backend/storage/buffer/
Buffer Pool Data Objects | 109/234 |
BufferDescriptors
Buffer
BufferDescriptors
BufMgrLock
BufferTag
... Buffer Pool Data Objects | 110/234 |
Buffer manager data types:
BufFlags: BM_DIRTY, BM_VALID, BM_TAG_VALID, BM_IO_IN_PROGRESS, ... typedef struct buftag { RelFileNode rnode;/* physical relation identifier */ ForkNumber forkNum; BlockNumber blockNum;/* relative to start of reln */ } BufferTag; typedef struct sbufdesc {(simplified) BufferTag tag;/* ID of page contained in buffer */ BufFlags flags;/* see bit definitions above */ uint16 usage_count;/* usage counter for clock sweep */ unsigned refcount;/* # of backends holding pins */ int buf_id;/* buffer's index number (from 0) */ int freeNext;/* link in freelist chain */ ... } BufferDesc;
Buffer Pool Functions | 111/234 |
Buffer manager interface:
Buffer ReadBuffer(Relation r, BlockNumber n)
r
Buffer
ForkNumber
ReadBuffer_Common
... Buffer Pool Functions | 112/234 |
Buffer manager interface (cont):
void ReleaseBuffer(Buffer buf)
void MarkBufferDirty(Buffer buf)
... Buffer Pool Functions | 113/234 |
Additional buffer manager functions:
Page BufferGetPage(Buffer buf)
BufferIsPinned(Buffer buf)
CheckPointBuffers
... Buffer Pool Functions | 114/234 |
Important internal buffer manager function:
BufferDesc *BufferAlloc(
Relation r, ForkNumber f,
BlockNumber n, bool *found)
ReadBuffer
ReadBuffer
Clock-sweep Replacement Strategy | 115/234 |
PostgreSQL page replacement strategy: clock-sweep
NextVictimBuffer
usage_count
NextVictimBuffer
Record/Tuple Management |
Views of Data | 117/234 |
The disk and buffer manager provide the following view:
PageId
RecordId
RID
... Views of Data | 118/234 |
The abstract view of a relation:
... Views of Data | 119/234 |
We use the following low-level abstractions:
RecPage
byte[]
Record
... Views of Data | 120/234 |
We use the following high-level abstractions:
Relation
Tuple
Records vs Tuples | 121/234 |
A table is defined by a collection of attributes (schema), e.g.
create table Employee ( id# integer primary key, name varchar(20), -- or char(20) job varchar(10), -- or char(10) dept number(4) );
A tuple is a collection of attribute values for such a schema, e.g.
(33357462, 'Neil Young', 'Musician', 0277)
A record is a sequence of bytes, containing data for one tuple.
Record Management | 122/234 |
Aim:
Tuple
Record
RecordId
Tuple
RecPage
Page-level Operations | 123/234 |
Operations to access records from a page ...
Record get_record(RecordId rid)
rid
Record
Record first_record()
Record
Record next_record()
Record
null
... Page-level Operations | 124/234 |
Operations to make changes to records in a page ...
void update_record(RecordId rid, Record rec)
rid
rec
RecordId insert_record(Record rec)
rid
void delete_record(RecordId rid)
rid
Tuple-level Operations | 125/234 |
Typ get
Field(int fno)
fno
Tuple
getIntField(1)
getStringField(2)
void
set
Field(int fno,
val)
fno
Tuple
val
setIntField(1,42)
setStringField(2,"abc")
Also need operations to convert between Record
Tuple
Relation-level Operations | 126/234 |
Tuple get_tuple(RecordId rid)
rid
Tuple
Tuple first_tuple()
Tuple
Tuple next_tuple()
Tuple
null
Tuples
Tuples
Records
Example Query | 127/234 |
Recall previous example of simple scan of a relation:
select name from Employee
implemented as:
DB db = openDatabase("myDB"); Rel r = openRel(db,"Employee"); Scan s = startScan(r); Tuple t; while ((t = nextTuple(s)) != NULL) { char *name = getField(t,"name"); printf("%s\n", name); }
... Example Query | 128/234 |
Conceptually, the scanning implementation is simple:
// maintain "current" state of scan struct ScanRec { Rel curRel; RecId curRec }; typedef struct ScanRec *Scan; Scan startScan(Rel r) { Scan s = malloc(sizeof(struct ScanRec)); s->curRec = firstRecId(r); return s; } Tuple nextTuple(Scan s) { Tuple t = fetchTuple(s->curRec); s->curRec = nextRecId(r,s->curRec); return t; }
... Example Query | 129/234 |
The real implementation relies on the buffer manager:
struct ScanRec { Rel curRel; PageId curPID; RecPage curPage; }; typedef struct ScanRec *Scan; Scan startScan(Rel r) { Scan s = malloc(sizeof(struct ScanRec)); s->curPID = firstPageId(r); Buffer page = request_page(s->curPage); s->curPage = start_page_scan(page); return s; }
... Example Query | 130/234 |
And similarly the nextTuple()
Tuple nextTuple(Scan s) {// if more records in the current page Tuple t; if (t = next_rec_in_page(s->curPage)) != NULL) return t; while (t == null) {// current page finished release_page(s->curPID);// release current page s->curPID = next_page_id(s->curRel, s->curPID);// ... and if no more pages, then finished if (s->curPID == NULL) return NULL; Buffer page = request_page(s->curPID); s->curPage = start_page_scan(page); t = next_rec_in_page(s->curPage); } return t; }
Record Identifiers | 131/234 |
The implementation of RecordID
A RecordId
PageId
Some DBMSs provide ROWID
PostgreSQL provides a unique OID for every row in the database.
... Record Identifiers | 132/234 |
RecordID
0x0000, 0x1000, 0x2000, 0x3000, ...
RecordId
Example RecordId | 133/234 |
Consider a DBMS like Oracle which uses a small number of large files.
Suitable RecordId
(Note: however you partition the bits, you can address at most 4 billion records)
... Example RecordId | 134/234 |
Consider a DBMS like MiniSQL, which uses one data file per relation.
One possibility is a variation on the Oracle approach:
RecordId
rid
Manipulating RecordId | 135/234 |
Functions for constructing/interrogating RecordId
typedef unsigned int RecordId; RecordId makeRecordId(int file, int page, int slot) { return (file << 28) | (page << 8) | (slot); } int fileNo(RecordId rid) { return (rid >> 28) & 0xF; } int pageNo(RecordId rid) { return (rid >> 8) & 0xFFFFF; } int slotNo(RecordId rid) { return rid & 0xFF; }
... Manipulating RecordId | 136/234 |
Alternative implementation if details of file/page are hidden within PageId
typedef unsigned int PageId;//only uses 24-bits typedef unsigned int RecordId; RecordId makeRecordId(PageId pid, int slot) { return (pid << 8) | (slot); } int pageId(RecordId rid) { return (rid >> 8) & 0xFFFFFF; } int slotNo(RecordId rid) { return rid & 0xFF; }
Record Formats | 137/234 |
Records are stored within fixed-length pages.
Records may be fixed-length:
Fixed-length Records | 138/234 |
Encoding scheme for fixed-length records:
Since record format is frequently consulted at query time, it should be memory-resident.
... Fixed-length Records | 139/234 |
Advantages of fixed-length records:
... Fixed-length Records | 140/234 |
Handling attempts to insert values larger than available fields:
RecordId
varchar
Variable-length Records | 141/234 |
Some encoding schemes for variable-length records:
... Variable-length Records | 142/234 |
More encoding schemes for variable-length records:
<employee> <id#>33357462</id#> <dept>0277</dept> <name>Neil Young</name> <job>Musician</job> </employee>
Tuple
Record
... Variable-length Records | 143/234 |
Advantages of variable-length records:
Spanned Records | 144/234 |
How to handle record that does not fit into free space in page?
Two approaches:
... Spanned Records | 145/234 |
Advantages of spanned records:
Converting Records to Tuples | 146/234 |
A Record
byte[]
Tuple
... Converting Records to Tuples | 147/234 |
DBMSs typically define a fixed set of field types for use in schema.
E.g. DATE
FLOAT
INTEGER
NUMBER(
)
VARCHAR(
)
This determines the primitive types to be handled in the implementation:
DATE |
time_t |
|
FLOAT |
float,double |
|
INTEGER |
int,long |
|
NUMBER( ) |
int[] |
|
VARCHAR( ) |
char[] |
Defining Tuple | 148/234 |
To convert a Record
Tuple
FieldDesc
RelnDesc
typedef struct { short offset;// index of starting byte short length;// number of bytes Types type;// reference to Type data } FieldDesc; typedef struct { char *relname;// relation name ushort nfields;// # of fields FieldDesc fields[];// field descriptors } RelnDesc;
... Defining Tuple | 149/234 |
For the example relation:
FieldDesc fields[] = malloc(4*sizeof(FieldDesc); fields[0] = FieldDesc(0,4,INTEGER); fields[1] = FieldDesc(4,20,VARCHAR); fields[2] = FieldDesc(24,10,CHAR); fields[3] = FieldDesc(34,4,NUMBER);
This defines the schema
... Defining Tuple | 150/234 |
A Tuple
Record
typedef struct { Record data;// pointer to data ushort nfields;// # fields FieldDesc fields[];// field descriptions } Tuple;
... Defining Tuple | 151/234 |
A Tuple
Record
RelnDesc
It also necessary to know how the Record
Assume the following Record
Assume also that lengths are 1-byte quantities (no field longer than 256-bytes).
... Defining Tuple | 152/234 |
How the Record
Tuple
Tuple mkTuple(RelnDesc schema, Record record) { int i, pos = 0; int size = sizeof(Tuple) + (nfields-1)*sizeof(FieldDesc); Tuple *t = malloc(size); t->data = record; t->nfields = schema.nfields; for (i=0; i < schema.nfields; i++) { int len = record[pos++]; t->fields[i].offset = pos; t->fields[i].length = len;// could add checking for over-length fields, etc. t->fields[i].type = schema.fields[i].type; pos += length; } return t; }
PostgreSQL Tuples | 153/234 |
Definitions: src/include/access/*tup*.h
Functions: src/backend/access/common/*tup*.c
PostgreSQL defines tuples via:
Datum
... PostgreSQL Tuples | 154/234 |
Tuple structure:
... PostgreSQL Tuples | 155/234 |
Tuple-related data types:
// representation of a data value // may be the actual value, or may be a pointer to it typedef unitptr_t Datum;
The actual data value:
Datum
int
... PostgreSQL Tuples | 156/234 |
Tuple-related data types: (cont)
typedef struct HeapTupleFields// simplified { TransactionId t_xmin;// inserting xact ID TransactionId t_xmax;// deleting or locking xact ID CommandId t_cid;// inserting/deleting command ID, or both } HeapTupleFields; typedef struct HeapTupleHeaderData// simplified { HeapTupleFields t_heap; ItemPointerData t_ctid;// current TID of this or newer tuple uint16 t_infomask2;// number of attributes + flags uint16 t_infomask;// flags e.g. has_null, has_varwidth uint8 t_hoff;// sizeof header incl. bitmap+padding // above is fixed size (23 bytes) for all heap tuples bits8 t_bits[1];// bitmap of NULLs, variable length // actual data follows at end of struct } HeapTupleHeaderData;
... PostgreSQL Tuples | 157/234 |
typedef struct tupleDesc { int natts;// number of attributes in the tuple Form_pg_attribute *attrs;// array of pointers to attr descriptors TupleConstr *constr;// constraints, or NULL if none Oid tdtypeid;// composite type ID for tuple type int32 tdtypmod;// typmod for tuple type bool tdhasoid;// tuple has oid attribute in its header int tdrefcount;// reference count, -1 if not counting } *TupleDesc;
... PostgreSQL Tuples | 158/234 |
Operations on Tuples:
// create Tuple from values HeapTuple heap_form_tuple(TupleDesc tupDesc, Datum *values, bool *isnull)// return Datum given Tuple, attr and descriptor // sets isnull to true if value is NULL #define heap_getattr(tup, attnum, tupleDesc, isnull) ...// returns true if attribute has no value bool heap_attisnull(HeapTuple tup, int attnum) ...// produce a modified tuple from an existing one HeapTuple heap_modify_tuple(HeapTuple tuple, TupleDesc tupleDesc, Datum *replValues, bool *replIsnull, bool *doReplace)
Page Formats | 159/234 |
Ultimately, a Page
byte[]
We want to interpret/manipulate it as a collection of Record
Typical operations on Page
get(rid)
TupleId
first()
Page
next()
Page
insert(rec)
Page
update(rid,rec)
delete(rid)
Page
... Page Formats | 160/234 |
Factors affecting Page
Page
Page
Page
Page
Page
... Page Formats | 161/234 |
For fixed-length records, use record slots.
Insertion: place new record in first available slot.
Deletion: two possibilities for handling free record slots:
... Page Formats | 162/234 |
Problem with packed format and no slot directory
Problem with unpacked/bitmap format
... Page Formats | 163/234 |
For variable-length records, use slot directory.
Possibilities for handling free-space within block:
... Page Formats | 164/234 |
Compacted free space:
Note: "pointers" are implemented as word offsets within block.
... Page Formats | 165/234 |
Fragmented free space:
Storage Utilisation | 166/234 |
How many records can fit in a page? (How long is a piece of string?)
Depends on: page size, (avg) record size, slot directory, ...
For a typical DBMS application
... Storage Utilisation | 167/234 |
Example of determining space utilisation ...
Assumptions:
(integer,varchar(20),char(10),number(4))
char(10)
PageId
... Storage Utilisation | 168/234 |
Max record size = 4(offsets) + 4 + 20 + 12 + 4 = 44 bytes
Minimum number of records = 1024/44 = 23 (assume all max size and no directory)
Average number of records = 1024/40 = 25 (assume no directory)
So, allow 32 directory slots (5-bit slot indexes), and 32 bytes for directory.
Number of records = Nr, where 44 × Nr + 32+4 ≤ 1024
Aim to maximise Nr, so Nr = 22
Notes: because there are 32 slots, could have up to 32 (small) records
... Storage Utilisation | 169/234 |
If we switched to 8KB pages, then
So, allow 256 slots (8-bit slot indexes), and 352 bytes for directory (256*11bits)
Number of records = Nr, where 44 × Nr + 352 ≤ 8192
Aim to maximise Nr, so Nr = 178
Could reduce size of directory to allow more records ... but only so far.
Note: 11-bit directory entries also means that it's costly to access them.
Overflows | 170/234 |
Sometimes, it may not be possible to insert a record into a page:
If there is still insufficient space, we have one of the other cases.
... Overflows | 171/234 |
How the other cases are handled depends on the file organisation:
... Overflows | 172/234 |
Overflow files for very large records and BLOBs:
... Overflows | 173/234 |
Page-based handling of overflows:
PageId
Useful for scan-all-records type operations.
... Overflows | 174/234 |
Record-based handling of overflows:
Useful for locating specific record via rid.
PostgreSQL Page Representation | 175/234 |
Functions: src/backend/storage/page/*.c
Definitions: src/include/storage/bufpage.h
Each page is 8KB (default BLCKSZ
... PostgreSQL Page Representation | 176/234 |
PostgreSQL tuple page layout:
... PostgreSQL Page Representation | 177/234 |
Page-related data types:
// a Page is simply a pointer to start of buffer typedef Pointer Page;// indexes into the tuple directory typedef unit16 LocationIndex;// entries in tuple directory (line pointer array) typedef struct ItemIdData { unsigned lp_off:15,// tuple offset from start of page lp_flags:2,// state of item pointer lp_len:15;// byte length of tuple } ItemIdData;
... PostgreSQL Page Representation | 178/234 |
Page-related data types: (cont)
typedef struct PageHeaderData { ... uint16 pd_flags;// flag bits (e.g. free, full, ... LocationIndex pd_lower;// offset to start of free space LocationIndex pd_upper;// offset to end of free space LocationIndex pd_special;// offset to start of special space uint16 pd_pagesize_version; ... ItemIdData pd_linp[1];// beginning of line pointer array } PageHeaderData; typedef PageHeaderData *PageHeader;
... PostgreSQL Page Representation | 179/234 |
Operations on Page
void PageInit(Page page, Size pageSize, ...)
Page
pd_lower
pd_upper
OffsetNumber
PageAddItem(Page page, Item item, Size size, ...)
Page
void PageRepairFragmentation(Page page)
... PostgreSQL Page Representation | 180/234 |
PostgreSQL has two kinds of pages:
One important difference:
Representing Database Objects |
Database Objects | 182/234 |
RDBMSs manage different kinds of objects
How are the different types of objects represented?
How do we go from a name (or OID) to bytes stored on disk?
... Database Objects | 183/234 |
Top-level "objects" in typical SQL standard databases:
catalog ... SQL terminology for a database
Schema.Relation
... Database Objects | 184/234 |
Consider what information the RDBMS needs about relations:
All of this information is stored in the system catalog.
(The "system catalog" is also called "data dictionary" or "system view")
In most RDBMSs, the catalog itself is also stored as tables.
... Database Objects | 185/234 |
Standard for catalogs in SQL:2003: INFORMATION_SCHEMA
Schemata(catalog_name, schema_name, schema_owner, ...) Tables(table_catalog, table_schema, table_name, table_type, ...) Columns(table_catalog, table_schema, table_name, column_name, ordinal_position, column_default, is_nullable, data_type, ...) Views(table_catalog, table_schema, table_name, view_definition, check_option, is_updatable, is_insertable_into) Role_table_grants(grantor, grantee, privilege_type, is_grantable, table_catalog, table_schema, table_name, ...)etc. etc.
For complete details, see Section 30 of the PostgreSQL 8.0.3 documentation.
System Catalog | 186/234 |
Most DBMSs also have their own internal catalog structure.
Would typically contain information such as:
Users(id:int, name:string, ...) Databases(id:int, name:string, owner:ref(User), ...) Schemas(id:int, name:string, owner:ref(User), ...) Types(id:int, name:string, defn:string, size:int, ...) Tables(id:int, name:string, owner:ref(User), inSchema:ref(Schema), ...) Attributes(id, name:string, table:ref(Table), type:ref(Type), pkey:bool, ...)etc. etc.
Standard SQL INFORMATION_SCHEMA
... System Catalog | 187/234 |
The catalog is manipulated by a range of SQL operations:
create
as
drop
alter
grant
on
E.g. consider an SQL DDL operation such as:
create table ABC ( x integer primary key, y integer );
... System Catalog | 188/234 |
This would produce a set of catalog changes something like ...
userID := current_user(); schemaID := current_schema(); tabID := nextval('tab_id_seq'); select into intID id from Types where name='integer'; insert into Tables(id,name,owner,inSchema,...) values (tabID, 'abc', userID, schema, ...) attrID := nextval('attr_id_seq'); insert into Attributes(id,name,table,type,pkey,...) values (attrID, 'x', tabID, intID, true, ...) attrID := nextval('attr_id_seq'); insert into Attributes(id,name,table,type,pkey,...) values (attrID, 'y', tabID, intID, false, ...)
... System Catalog | 189/234 |
In PostgreSQL, the system catalog is available to users via:
psql
\d
information_schema
select * from information_schema.tables;
pg_catalog
pg_tables
PostgreSQL Catalog | 190/234 |
The \d?
psql
\dt |
list information about tables | |
\dv |
list information about views | |
\df |
list information about functions | |
\dp |
list table access privileges | |
\dT |
list information about data types | |
\dd |
shows comments attached to DB objects |
... PostgreSQL Catalog | 191/234 |
A PostgreSQL installation typically has several databases.
Some catalog information is global, e.g.
PGDATA/pg_global
... PostgreSQL Catalog | 192/234 |
Global installation data is recorded in shared tables
... PostgreSQL Catalog | 193/234 |
PostgreSQL tuples contain
create table
oid |
unique identifying number for tuple (optional) | |
tableoid |
which table this tuple belongs to | |
xmin/xmax |
which transaction created/deleted tuple (for MVCC) |
Representing Users/Groups | 194/234 |
In version 8, PostgreSQL merged notions of users/groups into roles.
Represented by two base tables: pg_authid
pg_auth_members
View pg_shadow
pg_authid
View pg_user
pg_shadow
CREATE
ALTER
DROP
USER
pg_authid
CREATE
ALTER
DROP
GROUP
pg_auth_members
Both tables are global (shared across all DBs in a cluster).
... Representing Users/Groups | 195/234 |
pg_authid
oid |
unique integer key for this role | |
rolname |
symbolic name for role (PostgreSQL identifier) | |
rolpassword |
plain or md5-encrypted password | |
rolcreatedb |
can create new databases | |
rolsuper |
is a superuser (owns server process) | |
rolcatupdate |
can update system catalogs |
etc. etc.
... Representing Users/Groups | 196/234 |
pg_shadow
usename |
symbolic user name (e.g. 'jas' |
|
usesysid |
integer key to reference user (pg_authid.oid |
|
passwd |
plain or md5-encrypted password | |
usecreatdb |
can create new databases | |
usesuper |
is a superuser (owns server process) | |
usecatupd |
can update system catalogs |
etc. etc.
... Representing Users/Groups | 197/234 |
pg_group
groname |
group name (e.g. 'developers' |
|
grosysid |
integer key to reference group | |
grolist[] |
array containing group members (vector of refs to pg_authid.oid |
Note the use of multi-valued attribute (PostgreSQL extension)
Representing High-level Objects | 198/234 |
Above the level of individual DB schemata, we have:
pg_database
pg_namespace
pg_tablespace
Keys are names (strings) and must be unique within cluster.
... Representing High-level Objects | 199/234 |
pg_database
datname |
database name (e.g. 'mydb' |
|
datdba |
database owner (refs pg_authid.oid |
|
datpath |
where files for database are stored (if not in the PGDATA directory) |
|
datacl[] |
access permissions | |
datistemplate |
can be used to clone new databases (e.g. template0 template1 |
etc. etc.
... Representing High-level Objects | 200/234 |
Digression: access control lists (acl
PostgreSQL represents access via an array of access elements.
Each access element contains:
UserName=Privileges/Grantor group GroupName=Privileges/Grantor
where Privileges is a string enumerating privileges, e.g.
jas=arwdRxt/jas,fred=r/jas,joe=rwad/jas
... Representing High-level Objects | 201/234 |
pg_namespace
nspname |
namespace name (e.g. 'public' |
|
nspowner |
namespace owner (refs pg_authid.oid |
|
nspacl[] |
access permissions |
Note that nspname
... Representing High-level Objects | 202/234 |
pg_tablespace
spcname |
tablespace name (e.g. 'disk5' |
|
spcowner |
tablespace owner (refs pg_authid.oid |
|
spclocation |
full filepath to tablespace directory | |
spcacl[] |
access permissions |
Two pre-defined tablespaces:
pg_default
pg_global
Representing Tables | 203/234 |
Entries in multiple catalog tables are required for each user-level table.
Due to O-O heritage, base table for tables is called pg_class
The pg_class
CREATE TYPE AS
pg_class
Tuples in pg_class
... Representing Tables | 204/234 |
pg_class
relname |
name of table (e.g. employee |
|
relnamespace |
schema in which table defined (refs pg_namespace.oid |
|
reltype |
data type corresponding to table (refs pg_type.oid |
|
relowner |
owner (refs pg_authid.oid |
|
reltuples |
# tuples in table | |
relacl |
access permissions |
... Representing Tables | 205/234 |
pg_class
relkind |
what kind of object 'r' = ordinary table, 'i' = index, 'v' = view 'c' = composite type, 'S' = sequence, 's' = special |
|
relnatts |
# attributes in table (how many entries in pg_attribute |
|
relchecks |
# of constraints on table (how many entries in pg_constraint |
|
relhasindex |
table has/had an index? | |
relhaspkey |
table has/had a primary key? |
etc.
... Representing Tables | 206/234 |
pg_type
typname |
name of type (e.g. 'integer' |
|
typnamespace |
schema in which type defined (refs pg_namespace.oid |
|
typowner |
owner (refs pg_authid.oid |
|
typtype |
what kind of data type 'b' = base type, 'c' = complex (row) type, ... |
Note: a complex type is automatically created for each table
(defines "type" for each tuple in table; also, type for functions returning SETOF
... Representing Tables | 207/234 |
pg_type
typlen |
how much storage used for values (-1 for variable-length types, e.g. text |
|
typalign |
memory alignment for values ('c' = byte-boundary, 'i' = 4-byte-boundary, ...) |
|
typrelid |
table associated with complex type (refs pg_class.oid |
|
typstorage |
where/how values are stored ('p' = in-tuple, 'e' = in external table, compressed?) |
(We discuss more details of the pg_type
... Representing Tables | 208/234 |
pg_attribute
attname |
name of attribute (e.g. 'empname' |
|
attrelid |
table this attribute belongs to (refs pg_class.oid |
|
attnum |
attribute position (1..n, sys attrs are -ve) | |
atttypid |
data type of this attribute (refs pg_type.oid |
(attrelid,attnum)
... Representing Tables | 209/234 |
pg_attribute
attlen |
storage space required by attribute (copy of pg_type.typlen |
|
atttypmod |
storage space for var-length attributes (e.g. 6+ATTR_HEADER_SIZE for char(6) |
|
attalign |
memory-alignment info (copy of pg_type.typalign |
|
attndims |
number of dimensions if attr is an array |
... Representing Tables | 210/234 |
pg_attribute
attnotnull |
attribute may not be null? | |
atthasdef |
attribute has a default values (value is held in pg_attrdef |
|
attisdropped |
attribute has been dropped from table |
Also has notion of large data being stored in a separate table (so-called "TOAST" table).
... Representing Tables | 211/234 |
An SQL DDL statement like
create table MyTable ( a int unique not null, b char(6) );
will cause entries to be made in the following tables:
pg_class
pg_attribute
pg_type
... Representing Tables | 212/234 |
The example leads to a series of database changes like
rel_oid := new_oid(); user_id = current_user(); insert into pg_class(oid,name,owner,kind,pages,tuples,...) values (rel_oid, 'mytable', user_id, 'r', 0, 0, ...) select oid,typlen into int_oid,int_len from pg_type where typname = 'int'; insert into pg_attribute(relid,name,typid,num,len,typmod,notnull...) values (rel_oid, 'a', int_oid, 1, int_len, -1, true, ...) select oid,typlen into char_oid,char_len from pg_type where typname = 'char'; insert into pg_attribute(relid,name,typid,num,len,typmod,notnull...) values (rel_oid, 'b', char_oid, 2, -1, 6+4, false, ...) insert into pg_type(name,owner,len,type,relid,align,...) values ('mytable', user_id, 4, 'c', rel_oid, 'i', ...)
... Representing Tables | 213/234 |
pg_attrdef
adrelid |
table that column belongs to (refs pg_class.oid |
|
adnum |
which column in the table (refs pg_attribute.attnum |
|
adsrc |
readable representation of default value | |
adbin |
internal representation of default value |
... Representing Tables | 214/234 |
pg_constraint
conname |
name of constraint (not unique) | |
connamespace |
schema containing this constraint | |
contype |
kind of constraint 'c' = check, 'u' = unique, 'p' = primary key, 'f' = foreign key |
|
conrelid |
which table (refs pg_class.oid |
|
conkey |
which attributes (vector of values from pg_attribute.attnum |
|
consrc |
check constraint expression |
(Names are automatically generated from context (fkey,check
... Representing Tables | 215/234 |
For foreign-key constraints, pg_constraint
confrelid |
referenced table for foreign key | |
confkey |
key attributes in foreign table | |
conkey |
corresponding attributes in local table |
Foreign keys also introduce triggers to perform checking.
For column-specific constraints:
consrc |
readable check constraint expression | |
conbin |
internal check constraint expression |
... Representing Tables | 216/234 |
An SQL DDL statement like
create table MyOtherTable ( x int check (x > 0), y int references MyTable(a), z int default -1 );
will cause similar entries as before in catalogs, plus
pg_constraint
x
y
pg_attrdef
z
... Representing Tables | 217/234 |
The example leads to a series of database changes like
rel_oid := new_oid(); user_id = current_user(); insert into pg_class(oid,name,owner,kind,pages,tuples,...) values (rel_oid, 'myothertable', user_id, 'r', 0, 0, ...) select oid,typlen into int_oid,int_len from pg_type where typname = 'int'; select oid into old_oid from pg_class where relname='mytable';-- pg_attribute entries for attributes x=1, y=2, z=3 insert into pg_attrdef(relid,num,src,bin) values (rel_oid, 3, -1, {CONST :...}) insert into pg_constraint(type,relid,key,src,...) values ('c', rel_oid, {1}, '(x > 0)', ...) insert into pg_constraint(type,relid,key,frelid,fkey,...) values ('f', rel_oid, {2}, old_oid, {1}, ...)
Representing Functions | 218/234 |
Stored procedures (functions) are defined as
create function power(int x, int y) returns int as $$ declare i int; product int := 1; begin for i in 1..y loop product := product * x; end loop; return product; end; $$ language plpgsql;
Stored procedures are represented in the catalog via
pg_proc
pg_type
... Representing Functions | 219/234 |
pg_proc
proname |
name of function (e.g. substr |
|
pronamespace |
schema in which function defined (refs pg_namespace.oid |
|
proowner |
owner (refs pg_authid.oid |
|
proacl[] |
access permissions |
etc.
... Representing Functions | 220/234 |
pg_proc
pronargs |
how many arguments | |
prorettype |
return type (refs pg_type.oid |
|
proargtypes[] |
argument types (ref pg_type.oid |
|
proreset |
returns set of values of prorettype |
|
proisagg |
is function an aggregate? | |
proisstrict |
returns null if any arg is null | |
provolatile |
return value depends on side-effects? ('i' = immutable, 's'= stable, 'v' = volatile) |
... Representing Functions | 221/234 |
pg_proc
prolang |
what language function written in | |
prosrc |
source code if interpreted (e.g. PLpgSQL) | |
probin |
additional info on how to invoke function (interpretation is language-specific) |
... Representing Functions | 222/234 |
Consider two alternative ways of defining a x2 function.
sq.c int square_in_c(int x) { return x * x; } create function square(int) returns int as '/path/to/sq.o', 'square_in_c' language 'C';
or
create function square(int) returns int as $$ begin return $1 * $1; end; $$ language plpgsql;
... Representing Functions | 223/234 |
The above leads to a series of database changes like
user_id := current_user(); select oid,typlen into int_oid,int_len from pg_type where typname = 'int'; insert into pg_proc(name,owner,rettype,nargs,argtypes, prosrc,probin...) values ('square', user_id, int_oid, 1, {int_oid}, 'square_in_c', '/path/to/sq.o', ...)-- or insert into pg_proc(name,owner,rettype,nargs,argtypes, prosrc,probin...) values ('square', user_id, int_oid, 1, {int_oid}, 'begin return $1 * $1; end;', '-', ...)
... Representing Functions | 224/234 |
Users can define their own aggregate functions (like max()
Requires definition of three components:
pg_aggregate
The aggregate's name is stored in the pg_proc
... Representing Functions | 225/234 |
Consider defining your own average()
Need to define a new aggregate:
create aggregate average ( basetype = integer, sfunc = int_avg_accum, stype = int[], finalfunc = int_avg_result, initcond = '{0,0}' );
and need to define functions to support aggregate ...
... Representing Functions | 226/234 |
create function int_avg_accum(state int[], int) returns int[] as $$ declare res int[2]; begin res[1] := state[1] + $2; res[2] := res[2] + 1; return res; end; $$ language plpgsql; create function int_avg_result(state int[]) returns int as $$ begin if (state[2] = 0) then return null; end if; return (state[1] / state[2]); end; $$ language plpgsql;
... Representing Functions | 227/234 |
Users can define their own operators to use in expressions.
Operators are syntactic sugar for unary/binary functions.
Consider defining an operator for the power(x,y)
create operator ** ( procedure = power, leftarg = int, rightarg = int );-- which can be used as select 4 ** 3;-- giving a result of 64
Operator definitions are stored in pg_operator
Representing Types | 228/234 |
Users can also define new data types, which includes
create type point3d ( input = point3d_in,-- function to parse values output = point3d_out,-- function to display values internallength = 24,-- space for three float8's alignment = double-- align tuples properly );
... Representing Types | 229/234 |
pg_type
typinput |
text input conversion function | |
typoutput |
text output conversion function | |
typreceive |
binary input conversion function | |
typsend |
binary output conversion function |
All attributes are references to pg_proc.oid
... Representing Types | 230/234 |
All data types need access methods for querying.
The following catalogs tables are involved in this:
pg_am
pg_opclass
pg_amop
pg_amproc
... Representing Types | 231/234 |
pg_am
amname |
name of access method (e.g. btree |
|
amowner |
owner (refs pg_authid.oid |
|
amorder strategy |
operator for determining sort order (0 if unsorted) |
|
amcanunique |
does AM support unique indexes? | |
ammulticol |
does AM support multicolumn indexes? | |
amindexnulls |
does AM support NULL index entries? | |
amconcurrent |
does AM support concurrent updates? |
... Representing Types | 232/234 |
pg_am
amgettuple |
"next valid tuple" function | |
ambeginscan |
"start new scan" function | |
amrescan |
"restart this scan" function | |
amendscan |
"end this scan" function | |
amcostestimate |
estimate cost of index scan |
All attributes are references to pg_proc.oid
Functions drive the query evaluation process.
... Representing Types | 233/234 |
pg_am
aminsert |
"insert this tuple" function | |
ambuild |
"build new index" function | |
ambulkdelete |
bulk delete function | |
amvacuum |
post-vacuum cleanup function |
All attributes are references to pg_proc.oid
Functions implement different aspects of updating data/index files.
... Representing Types | 234/234 |
Built-in access methods:
heap
btree
hash
rtree
GiST
SP-GiST
GIN
Produced: 29 Feb 2016