Hashing in Direct File Organization

Database / Friday, November 29th, 2019


Hashing is a way of performing optimal searches and retrievals. Essentially, it increases speed, betters ease of transfer, improves retrieval, optimizes searching of data, reduces overhead. Thus, one objective of hashing is to optimize disk access and packing density.

The packing density, roughly equivalent to a load factor, is simply how much of the table is filled (normally, 70% is sufficient). The ultimate goal, however, is to minimize disk space and access time by inserting and retrieving a record from the table is only one seek. One way to minimize this time is to use small table size (say m < 10).

Components of Hashing

There are mainly 4 key components involved in hashing:

1. The hash table.
2. The hashing method.
3. Collisions.
4. Collision resolution techniques.

Files which are stored on a direct access storage medium are called direct access files as they have a unique address through which they can be accessed directly.

A portion of disk space is reserved.

A “hashing” algorithm computes record address.

Hashing block diagram
How Hashing is used in Direct File Organization

The Advantages of Hashing

One advantage of hashing is that no index storage space is required, whereas inserting into other structures such as trees normally require an index. Such an index could be in the form of a queue.

Second advantage of hashing is it provides the advantage of rapid updates.

The Disadvantage of Hashing

One disadvantage to using hashing as an insert and retrieval tool is that it usually lacks locality and sequential retrieval by key. Consequently, insertion and retrieval are more random.

Second disadvantage is the inability to use duplicate keys. This is a problem when key values are very small (i.e., one or two digits).

In direct file organization, the key value is mapped directly to the storage location. The usual method of direct mapping by performing some arithmetic manipulation of the key value.

This process is called hashing. Let us consider a hash function h that maps the key value k to the value h(k). The value h(k) is used as an address and for our application we require that this value be in some range. If our address are for the records lies between S1 and S2, that for all values of k it should generate values between S1 and S2.

Role of Hashing in Direct File Organization

With the help of example, it is obvious that a hash function that maps many different key values to a single address or one that does not map the key values uniformly is a bad hash function. A collision is said to occur when two distinct key values are mapped to the same storage location. Collision is handled in a number of ways. The colliding records may be assigned to the next available space, or they may be assigned to an overflow area. We can immediately see that with hashing schemes there are no indexes to traverse. With well-designed hashing functions where collisions are few, this is a great advantage.

Another problem that we have to resolve is to decide what address is represented by h(k). Let the address generated by the hash function the addresses of buckets in which the y, address pair values or records are stored. Figure shows the buckets containing the y address pairs that allow a reorganization of actual data file and actual record address without affecting the hash functions. A limited number of collisions could be handled automatically by the use of a bucket of sufficient capacity. Obviously the space required for the buckets will be, in general, much smaller than the actual data file. Consequently, its reorganization will not be that much expensive. Once the bucket address is generated from the key y the hash function, a search in the bucket is also required to locate the address of the required record. However, since the bucket size small, this overhead is small.

The use of bucket reduces the problem associated with the collisions. In spite of this, a bucket may be handled by providing overflow buckets and using a pointer from the normal bucket to an entry in the overflow bucket. All such overflow entries are linked. Multiple overflows from the same bucket results in a long list and slows down the retrieval of these records. In an alternate scheme, the address generated by the hash function is a bucket address and the bucket is used to start the records directly instead of using a pointer to the block containing the record.

hashing bucket key address
Blocks of Records

Lets represent the value:

S = upper bucket address value – lower bucket address value + 1

Here, S gives the number of buckets. Assume that we have some mechanism to convert key values to numeric ones. Then a simple hashing function is h(k) = k mod S

Where k is the numeric representation of the key and h(k) produces a bucket address.



Leave a Reply

Your email address will not be published. Required fields are marked *