Heap Tables
HEAP
Table and Index are Separately Stored:
-
In a heap table, the actual data (rows) is stored in an unordered format, which means there is no inherent organization to how data is stored on the disk. The table itself only contains the raw data, with no sorting or indexing applied to it.
-
The index is a separate structure that is used to facilitate faster data retrieval. It maintains a specific order based on the columns defined in the index, allowing the database to quickly locate the corresponding data in the heap.
Table Only Stores Data:
- The heap table is solely responsible for storing the actual data records. This includes all the columns defined in the table schema but does not include any additional organization or sorting.
- When new data is inserted into a heap, it simply goes to the end of the table without any particular order, leading to a potentially fragmented structure over time, especially with frequent insertions and deletions.
Index Only for Querying Data:
- The primary purpose of the index is to optimize query performance. It serves as a roadmap to quickly locate rows based on the indexed columns, significantly speeding up data retrieval operations.
- When a query is executed, the database first checks the index to find the relevant entries. Once it identifies the correct index entries, it uses the pointers (Row Identifiers, or RIDs) to access the actual data in the heap table.
Stored as Different Data Objects:
- The table and index exist as distinct data objects within the database management system (DBMS). They are stored separately in the database file system, with the heap table storing the data rows and the index storing its own data structure (which may include key values and pointers).
- This separation allows for flexibility in how data is accessed and managed. The index can be modified or rebuilt independently of the table data, which is beneficial for maintenance and optimization.
Advantages of Heap Tables
1. Data Not Affected by Index:
- The organization of data in a heap table is independent of any indexes created. This means that data can be written and stored in the available space without concern for sorting or organizing it in a specific order.
- As a result, insertions can be faster since data can simply be appended to the end of the heap.
2. Reduce Cost:
Because data is stored without regard to an index, the cost of writing data (in terms of time and resources) is reduced. The simplicity of the heap structure allows for efficient data handling.
3. Data Stored to Available Space, Without Sequence:
Heap tables take advantage of available space without needing to maintain a specific sequence, which simplifies the storage mechanism and speeds up data insertion.
4. To Build Index Later:
Since heap tables allow for flexible data storage, indexes can be built later when necessary. This means you can prioritize fast data insertion and later optimize query performance by creating indexes when needed.
5. Thus, Write is Fast!:
The ability to write data quickly without needing to maintain an order or update an index immediately makes heap tables suitable for scenarios where rapid data ingestion is crucial.
For many large datasets, especially where performance is focused on data input rather than immediate retrieval, heap tables provide an efficient method for managing large volumes of data due to their straightforward storage method.
Querying Data in Heap Tables
Full Table Scan
Since HEAP tables are unsorted and lack a clustered index, retrieving specific rows or ranges of data may require a full table scan. A full table scan means the database engine must examine every row in the table to find records that match the query criteria, as there’s no structured path to follow.
Drawbacks of Full Table Scans in HEAP Tables
Full table scans in HEAPs are usually inefficient because they involve scanning every row in the table. As the table grows, this process becomes increasingly resource-intensive, impacting CPU, memory, and disk I/O. Frequent full table scans can reduce database performance, especially if the table is large and the queries are complex or executed repeatedly.
Example of a HEAP Table
Suppose we have a Customers table, defined as a HEAP, with the following structure and some sample data:
SQL> CREATE TABLE Customers ( CustomerID INT, Name VARCHAR(100), Age INT, City VARCHAR(50) ); SQL> INSERT INTO Customers (CustomerID, Name, Age, City) VALUES (1, 'Alice', 24, 'New York'), (2, 'Bob', 35, 'Chicago'), (3, 'Carol', 29, 'San Francisco'), (4, 'David', 42, 'New York'), (5, 'Eve', 27, 'Boston'), (6, 'Frank', 50, 'Chicago'); SQL> SELECT * FROM Customers WHERE Age > 30; CUSTOMERID NAME AGE CITY ---------- -------------- ----- -------- 2 Bob 35 Chicago 4 David 42 New York 6 Frank 50 Chicago
Since Customers is a HEAP table with no clustered index, this query will trigger a full table scan. Here’s how the scan would work on this data:
- The database engine starts at the beginning of the table and reads each row sequentially.
- It checks each row's Age value to see if it meets the condition (Age > 30).
- Rows 2, 4, and 6 meet the condition, so they are included in the result set:
(2, 'Bob', 35, 'Chicago')
(4, 'David', 42, 'New York')
(6, 'Frank', 50, 'Chicago')
In a HEAP table, these records are stored in the order of insertion on the data page. Let's assume that they are stored like this initially:
Page | Row Sequence | CustomerID | Name | Age | City |
---|---|---|---|---|---|
P1 | Row 1 | 1 | Alice | 24 | New York |
P1 | Row 2 | 2 | Bob | 35 | Chicago |
P1 | Row 3 | 3 | Carol | 29 | San Francisco |
P1 | Row 4 | 4 | David | 42 | New York |
P1 | Row 5 | 5 | Eve | 27 | Boston |
P1 | Row 6 | 6 | Frank | 50 | Chicago |
Deletion Example
Suppose we delete Eve's row:
SQL> DELETE FROM Customers WHERE CustomerID = 5;
Page | Row Sequence | CustomerID | Name | Age | City |
---|---|---|---|---|---|
P1 | Row 1 | 1 | Alice | 24 | New York |
P1 | Row 2 | 2 | Bob | 35 | Chicago |
P1 | Row 3 | 3 | Carol | 29 | San Francisco |
P1 | Row 4 | 4 | David | 42 | New York |
P1 | Gap | (5) | (Eve) | (27) | (Boston) |
P1 | Row 6 | 6 | Frank | 50 | Chicago |
New Insertion Example
If we insert a new record, it might fill the gap left by Eve's deletion, depending on the DBMS's allocation strategy:
INSERT INTO Customers (CustomerID, Name, Age, City) VALUES (7, 'Grace', 34, 'Miami');
The new data might go into Eve's former slot:
Page | Row Sequence | CustomerID | Name | Age | City |
---|---|---|---|---|---|
P1 | Row 1 | 1 | Alice | 24 | New York |
P1 | Row 2 | 2 | Bob | 35 | Chicago |
P1 | Row 3 | 3 | Carol | 29 | San Francisco |
P1 | Row 4 | 4 | David | 42 | New York |
P1 | Row 5 (New) | 7 | Grace | 34 | Miami |
P1 | Row 6 | 6 | Frank | 50 | Chicago |
Update Example
Suppose Carol's information is updated with a longer city name, changing her city from 'San Francisco' to 'Los Angeles'. If this update causes the row size to increase and it no longer fits in the original space, it might be relocated to a different page. Let’s assume Carol is moved to a new page, with a forwarding pointer left in her original slot:
Page | Row Sequence | CustomerID | Name | Age | City |
---|---|---|---|---|---|
P1 | Row 1 | 1 | Alice | 24 | New York |
P1 | Row 2 | 2 | Bob | 35 | Chicago |
P1 | Forwarding | 3 | (Carol) | (29) | (Los Angeles) |
P1 | Row 4 | 4 | David | 42 | New York |
P1 | Row 5 | 7 | Grace | 34 | Miami |
P1 | Row 6 | 6 | Frank | 50 | Chicago |
Page | Row Sequence | CustomerID | Name | Age | City |
---|---|---|---|---|---|
P2 | Row 1 (New) | 3 | Carol | 29 | Los Angeles |
Drawbacks of Full Table Scans in HEAP Tables
A full table scan in a HEAP table like Customers requires reading every row, even if only a few rows meet the query condition. This process is time-consuming and resource-intensive, especially as the table grows. For tables with thousands or millions of rows, this approach can drastically reduce performance.
Mitigation: Adding a non-clustered index on the Age column could help the database engine locate relevant rows without scanning the entire table, improving query performance.