PostgreSQL natively supplies a rich diversity of data types supporting many practical use cases. This article introduces the special implementation of serial data types typically used for creation of synthetic primary keys.
A foundational precept of database design theory is that each tuple (i.e., row) of a relation (i.e., table) must be uniquely identified from other tuples. The attributes, or columns, that together distinctly identify one tuple from all the others are called a “key”. Some purists maintain that any modeled object or concept inherently possesses an attribute or set of attributes that can serve as a key and that it is important to identify this set of key attributes and utilize them for the unique selection of tuples.
But as a practical matter, identifying a sufficiently large set of attributes assurring uniqueness for a modeled object may be impractical, and so for real-world implementations, developers often turn to synthetic keys as a surrogate. That is, rather than relying on some combination of actual attributes, a value internal to the database, typically incremented integer values, and otherwise having no physical meaning is defined as a key. In additional to the simplicity of a single column key, the fact that there is no real-world dependency means that external factors can never force a need to change the value, such as for instance, might be the case if a person’s name where used as a key … and then the person married or entered a federal government witness protection program and changed their name. Even some values commonly thought by laypersons to be unique and immutable, such as the U.S. social security number, are neither: a person can obtain a new SSN, and SSN’s sometimes are re-used.
Declaring a Serial Data Type
PostgreSQL provides a special datatype declaration to satisfy this need for synthetic keys. Declaring a database table column as type SERIAL satisfies the requirement for synthetic keys by supplying unique integers upon inserts of new tuples. This pseudo-datatype implements an integer data type column with an associated default value derived via a function call that supplies incremented integer values. Executing the following code to create a simple table with an id column of type serial:
CREATE TABLE person (id serial, full_name text); actually executes the following DDL CREATE TABLE person ( id integer NOT NULL, full_name text ); CREATE SEQUENCE person_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1; ALTER SEQUENCE person_id_seq OWNED BY person.id; ALTER TABLE ONLY person ALTER COLUMN id SET DEFAULT nextval('person_id_seq'::regclass);
That is, the keyword “serial” as a datatype specification implies execution of DDL statements creating an integer type column with a NOT NULL constraint, a SEQUENCE, and then the column default is ALTERED to call a built-in function accessing that SEQUENCE.
The built-in function nextval performs an autoincrement service: each time nextval is called it increments the specified sequence counter and returns the the newly-incremented value.
You can see the result of this effect by examining the table definition:
postgres=# d person Table "public.person" Column | Type | Modifiers -----------+---------+----------------------------------------- Id | integer | not null default nextval('person_id_seq'::regclass) full_name | text |
Inserting Serial Values
To make use of the auto-increment functionality, we simply insert rows, relying on the default value for the serial column:
INSERT INTO person (full_name) VALUES ('Alice'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice (1 row)
We see that a value for the id column corresponding to the new “Alice” row has been automatically generated. Alternatively, one can make use of the DEFAULT keyword if explicitly listing all column names is desired:
INSERT INTO person (id, full_name) VALUES (DEFAULT, 'Bob'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob (2 rows)
where we see the auto-increment functionality more apparently, assigning the serially-next value to the new row for the second insert of “Bob”.
Inserting multiple rows even works:
INSERT INTO person (full_name) VALUES ('Cathy'), ('David'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 3 | Cathy 4 | David (4 rows)
Missing Serial Values
The built-in nextval function is optimized for non-blocking, high-concurrency applications and so does not respect rollback. Consequently, this means there may be missing values in the sequence. Below, we rollback an insert, but then see a subsequent insert gets an new value that skips over the value that would have been associated with the aborted transaction:
BEGIN TRANSACTION; INSERT INTO person (full_name) VALUES ('Eve'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 3 | Cathy 4 | David 5 | Eve (5 rows) ROLLBACK; INSERT INTO person (full_name) VALUES ('Fred'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 3 | Cathy 4 | David 6 | Fred (5 rows)
The advantage of not respecting rollbacks is that other sessions attempting concurrent inserts are not blocked by other inserting sessions.
Another way to end up with missing values is if rows are deleted:
DELETE FROM person WHERE full_name = 'Cathy'; SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 4 | David 6 | Fred (4 rows)
Note that even after deleting the most recently inserted row corresponding to the largest auto-increment id column value, the sequence counter does not revert, i.e., even though after deleting the row corresponding to ‘Fred’, for subsequent inserts the sequence counter still retains the previously-known largest value and increments from there:
DELETE FROM person WHERE full_name = 'Fred'; INSERT INTO person (full_name) VALUES ('Gina'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 4 | David 7 | Gina (4 rows)
Gaps or missing values as shown above are reportedly viewed as a problem by some application developers because on the PostgreSQL General mailing list, there is a slow-but-steady reiteration of the question how to avoid sequence gaps when employing the serial pseudo-datatype. Sometimes there is no actual underlying business requirement, it’s just matter of personal aversion to missing values. But there are circumstances when preventing missing numbers is a real need, and that is the subject for a subsequent article.
NO YOU CAN’T – YES YOU CAN!
The NOT NULL constraint imputed by the serial pseudo-datatype protects against the insertion of NULL for the id column by rejecting such insert attempts:
INSERT INTO person (id, full_name) VALUES (NULL, 'Henry'); ERROR: null value in column "id" violates not-null constraint DETAIL: Failing row contains (null, Henry).
Thus, we are assured of having a value for that attribute.
However, a problem some people encounter is that, as declared above, nothing prevents explicit insertion of values, bypassing the default autoincrement value derived via invocation of the nextval function:
INSERT INTO person (id, full_name) VALUES (9, 'Ingrid'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 4 | David 7 | Gina 9 | Ingrid (5 rows)
But then two inserts later using the default produces a duplicate value for the id column if there is no constraint check of column values against the sequence value:
INSERT INTO person (full_name) VALUES ('James'); INSERT INTO person (full_name) VALUES ('Karen'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 4 | David 7 | Gina 9 | Ingrid 8 | James 9 | Karen (7 rows)
If we were in fact using the serial id column as a key, we would have declared it as a PRIMARY KEY or at least created a UNIQUE INDEX. Had we done that, then the ‘Karen’ insert above would have failed with a duplicate key error. The most recent release of PostgreSQL includes a new constraint declaration syntax ‘generated by default as identity’ which avoids this pitfall and some other legacy issues related to the serial pseudo-datatype.
Sequence Manipulation Functions
In addition to the nextval function we already mentioned which advances the sequence and returns the new value, there are a few other functions for querying and setting the sequence values: the currval function returns value most recently obtained with nextval for specified sequence, the lastval function returns value most recently obtained with nextval for any sequence, and the setval function sets a sequence’s current value. These functions are called with simple queries., for example
SELECT currval('person_id_seq'); currval --------- 9 (1 row)
And note that if a call is made to the nextval function independently of actually performing an insert, it does increment the sequence, and that will be reflected in subsequent inserts:
SELECT nextval('person_id_seq'); nextval --------- 10 (1 row) INSERT INTO person (full_name) VALUES ('Larry'); SELECT * FROM person; id | full_name ----+----------- 1 | Alice 2 | Bob 4 | David 7 | Gina 9 | Ingrid 8 | James 9 | Karen 11 | Larry (8 rows)
We have introduced a basic understanding of the PostgreSQL SERIAL pseudo-datatype for auto-incremented synthetic keys. For illustration in this article, we used the SERIAL type declaration, which creates a 4-byte integer column. PostgreSQL accommodates different range needs with the SMALLSERIAL and BIGSERIAL pseudo-datatypes for, respectively, 2-byte and 8-byte column sizes. Look for a future article on one means of addressing the need for sequences with no missing values.