9 Implementation Notes 9.1 Sparse Matrix Representation 9.3 Floating Point Performance

9.2 Database Cache Performance

When matrix algorithms are implemented in a relational environment, database access requirements can play a significant (if not dominating) role in an algorithm’s time complexity. The current section examines the theoretical and empirical performance characteristics of a B^link tree (see Lehman and Yao[18] ) supported by an in-core cache with a LRU paging discipline. The operators described in the previous section form the basis of the discussion.

9.2.1 Sequential Matrix Element Retrieval

The time complexity of next is O( 1 ), since next just looks up a virtual address (no search is involved). A next operation requires one cache access. No key comparisons are necessary. An additional cache access is required to obtain the non-key portion of the tuple.

9.2.2 Arbitrary Matrix Element Retrieval

The time complexity of get is O( log n ) where n is the number of tuples in the relation. A get operation is implemented by a tree search. When the key to a sparse matrix is four bytes long, the tree of the matrix is a 31–61 tree. Interpolating the tables found in Gonnet[19] yields:

The expected height of a 10,000 key tree (i.e. number of nodes searched) is three.
The average node to key ratio of a 10,000 key tree is 0.02904. Hence, the average node will contain 34.43 keys.

Each descending branch in the tree search is determined through a binary search of a tree node. The maximum number of key comparisons needed to search a node in this manner is $\log_{2}k$, where k is the number of keys in the node. Therefore, it will take no more than 5.1 comparisons to locate the appropriate tree branch in an average 35 key node.

These observations imply that no more than 3 cache lookups and 16 key comparisons are required to locate an entry in a 10,000 key tree. An additional cache access is required to obtain the non-key portions of the tuple.

9.2.3 Arbitrary Matrix Element Update

If the address of the non-key portion of a tuple is known, put is an O( 1 ) operation requiring a single cache access. If the address is not known put is equivalent to a get – O( log n ) – followed by a direct cache access.

9.2.4 Matrix Element Insertion

The time complexity of insert is O( log n ) where n is the number of tuples in the relation. An insert operation is equivalent to a put operation unless the tree splits. Interpolating the tables in Gonnet[19] yields:

The average number of splits for the n+1^st insertion into a 10,000 key 31–61 tree is approximately 0.02933, i.e. the tree will split each time 33.40 items are inserted (on the average).

Splitting increases the constant associated with the growth rate slightly. It does not increase the growth rate per se.

9.2.5 Matrix Element Deletion

Deleting a key from a B^link tree is analogous to inserting one, except tree nodes occasionally combine instead of splitting. Therefore, the time complexity of delete is O( log n ) like insertion.

9.2.6 Empirical Performance Measurements

Note. This section contains data that is so antiquated that we’re not sure it has much relevance to modern hardware configurations. It is included for the sake of completeness and as a historical curiosity .

The measurements in Table 1 provide empirical performance statistics for various data base operations. The measurements were made on a 16 Mhz IBM PS/2 Model 70 with a 16 Mhz 80387 coprocessor and a 27 msec, 60 Mbyte fixed disk drive (do you think the hardware is a bit dated? I suspect many readers have only seen this hardware in a museum — if at all). You should note two characteristics of the measurements:

The cache was large enough to hold the entire B^link tree of relations A, B, and C. There are no cache faults to disk in these measurements. The relation D was too big to fit in core. Its next times reflect numerous cache faults.
The get operation looked for the same item during each repetition. This is explains the lack of cache faults while relation D was processed. Once the path to the item was in core it was never paged out.

Table 1: Database Cache Benchmarks

	Repetitions
	30k	50k	100k	200k	Average
Operation	(seconds)	(seconds)	(seconds)	(seconds)	( $\mu$ sec)
Relation A ¹
Next	n/a	12	25	51	255
Get	n/a	26	52	103	515
Relation B ²
Next	7	12	24	47	235
Get	34	57	114	228	1,140
Relation C ³
Next	7	12	24	49	245
Get	65	108	216	433	2,165
Relation D ⁴
Next	48	82	164	n/a	1,640
Cache faults	2,058	3,541	7,095
Get	76	126	252	n/a	2,520
Cache faults	1	1	1

¹ 112 tuples, 2 byte key.
² 463 tuples, 4 byte key.
³ 673 tuples, 22 byte key.
⁴ 10,122 tuples, 22 byte key.

Neglecting cache faults, the time required to find a tuple is a function of two variables: the size of the relation and the size of the key. The number of tuples in a relation determines the number of comparisons that are made. The size of the key effects the amount of work required to perform each comparison. Comparing the “get relation D” measurements of Table 1 to the “get relation C” measurements provides an indication of the actual effect of a relation’s size on search time (since both relations have the same size key). Note that

$$\frac{2,520\mbox{ }\mu\mbox{sec}}{2,165\mbox{ }\mu\mbox{sec}}\approx 1.167$$

which is below the theoretical bound

$$\frac{\log_{2}10,122}{\log_{2}673}\approx\frac{13.31}{9.40}\approx 1.416$$

Comparing “get relation C” to “get relation B” gives a good feel for the impact of key size. The size of the relation should not have much impact on the search time discrepancies since

$$\frac{\log_{2}673}{\log_{2}463}\approx\frac{9.40}{8.87}\approx 1.060$$

The next operation was metered by repeatedly scanning a relation until the desired number of operations was achieved. Each time the end of a relation was encountered the scan was restarted. The high next measurement for relation A probably reflects the overhead of many loop starts and stops (since there were only 12 tuples in the relation). The elevated next time of the relation C is probably due to key length. After all the keys on a cache page are processed, a relatively expensive page fault occurs. Larger keys cause the cache page to change more frequently. Given that the in-core next observations have a mean of 240 $\mu$sec and a standard deviation of 10 $\mu$sec, the effects of these peripheral processes appear to be relatively insignificant.

In summary, Table 1 shows that the O( 1 ) operations have an empirical time advantage that ranges from 1.54/1 to 8.84/1 over the O( log n ) operations. This observation underscores the importance of tailoring algorithmic implementations to take advantage of the O( 1 ) operators. In practical terms, his means that sequential access and stack operations are preferred to direct random access.