Experimental determination of cache memory characteristics. Associative Memory Direct Mapped Cache

Cache architecture.

Depending on the way in which the cache line and the main memory area are matched, three cache architectures are distinguished: a direct-mapped cache, a fully associative cache, and a partial or set-associative cache.

Direct mapped cache.

In a direct-mapped cache, the address of the memory that is accessed uniquely determines the cache line in which the required block can be located. We will explain the principle of operation of such a cache using the example of a 256 KB non-sectored cache with a line size of 32 bytes and a cached main memory of 64 MB - a typical motherboard cache for Pentium. The structure of the cache memory in such a system is illustrated in Fig. 1. The cached main memory is conventionally paginated (in this case, 256 KB), the size of which is the same as the size of the cache (256 KB). Cache - memory (and conditionally - pages of main memory) is divided into lines (256 KB / 32 bytes - 8 K lines). The direct mapping architecture implies that each cache line can map from any cached memory page only the corresponding line. Since the main memory is much larger than the cache, each cache line can be claimed by many memory blocks with the same low part of the address, which is called the offset within the page (0 - set, 1-set, 2-set ... N-set of 32-byte blocks ). One line at a certain moment can contain a copy of only one of these blocks. The line number is the address of the line in the cache memory, and the tag carries information about which block occupies this line (the tag is the upper part of the address set by the processor when accessing memory, or, in other words, the page number). The tag memory must have a number of cells equal to the number of cache lines, and its width must be sufficient to accommodate the high-order bits of the cached memory address that are not on the cache address bus. In addition to the address part of the tag, each cache line is associated with bits of signs of validity and data modification. At the beginning of each access to RAM, the cache memory controller, first of all, reads the tag memory cell with the line address determined by bits A17-A5, compares the contents of this line of tag memory with bits A25-A18 of the memory address set by the processor, and analyzes sign of reality.

Rice. 1. Cache with direct mapping.

This analysis is performed in a special tracking loop, sometimes called a query loop. If, as a result of the analysis, it turns out that the required block is not in the cache, a cycle of accessing the main memory (cache miss) is generated (or continues). If it hits, the request is serviced by the cache. In case of a miss after reading from the main memory by the receiver of information, the new data is placed in the cache line (if it is clean) , and the upper bits of the address are placed in its tag and the sign of data validity is set. Regardless of the amount of requested data, the entire line is rewritten to the cache from the main memory (since the sign of validity applies to all its bytes). If the cache controller implements read-ahead, then on subsequent free bus cycles, the next line is also updated (if it was clean ). Reading "in reserve" allows, if necessary, to carry out a batch read cycle from the cache across the line boundary. Such a cache has the simplest hardware implementation and is used in the secondary cache of most motherboards. However, it has a serious shortcoming. If, during the execution of the program, the processor alternately accesses the same line address, but with different or cyclically changing tag numbers, then cache misses will be constantly fixed and the cache memory, instead of speeding up, will slow down the exchange with memory. Page switching in multitasking operating systems (OS) also reduces the number of cache hits, which affects system performance. Increasing the cache size while maintaining the direct-mapping architecture will not have a very significant effect, since different tasks will claim the same cache lines. Without increasing the size of the cache, it is possible to increase the efficiency of caching by changing its structure, which will be discussed below.

Sometimes a forward-mapped cache uses the concept of a set instead of a row in a sectorized forward-mapped cache, and the sector is then called a row. A set (as well as a non-sector cache line) has tag information associated with all elements of the set (lines or sectors). In addition, each element of the set (row or sector) has its own bit of validity (Fig. 2).

Rice. 2. Sectored forward-mapped cache

Type-associative (partially-associative) cache.

The set-associative architecture of the cache (Fig. 3) allows each block of cached memory to claim one of several cache lines combined into a set. In this architecture, there are, as it were, several parallel and coordinated direct mapping channels, where the cache controller has to decide which of the rows of the set to place the next block of data. For example, in the simplest case, each memory block can fit into one of four set lines (four-channel set-associative cache). The set number in which the requested data block can be mapped is uniquely identified by the middle part of the address (like the line number in the forward mapping cache). The set line that maps to the required block is determined by a tag comparison (as in an associative cache) performed in parallel for all tag memory lines of the given set (cache channels). In addition, each set must have a flag associated with it that defines the set string to be replaced with a new data block in the event of a cache miss (the arrow points in its direction in the figure).

Rice. 3. Four-channel set-associative cache

The candidate date for replacement is usually the string that has not been accessed for the longest time (LRU algorithm - Least Recently Used). With a relatively large number of channels (lines in the set), some simplification is resorted to - the Pseudo-LRU algorithm for four lines allows making decisions using only 3 bits.

It is also possible to use a FIFO (first in, first out) replacement algorithm or even random replacement, which is simpler but less efficient. The set-associative architecture is widely used for the primary cache of modern processors.

Associative cache.

Unlike the previous ones, for a fully associative cache, any of its lines can map any block of memory, which significantly increases the efficiency of using a limited amount of cache. However, all bits

Rice. 4. Fully associative cache memory.

The addresses of the cached block, minus the bits that determine the position (offset) of the data in the line, are stored in the tag memory. In such an architecture (Fig. 4), to determine the presence of the requested data in the cache memory, a comparison with the upper part of the tag address of all tag memory lines, and not just one or several, as in direct mapping or set-associative architecture, is required. Sequential enumeration of tag memory cells is eliminated. (this takes too much time), so there remains a parallel comparison of the tags of all cells, and this is a complex hardware task that is acceptable only for small amounts of primary cache (M2 Cyrix processor).

The developers of cache memory have encountered the problem that any cell of the huge main memory can potentially be in the cache memory. If the working set of data used in the program is large enough, then this means that many fragments of the main memory will compete for each place in the cache memory. As previously reported, it is not uncommon for the ratio between cache and main memory to be 1 to 1000.

3.3.1 Associativity

It would be possible to implement a cache memory in which each cache line can store a copy of any memory location. It is called fully associative cache (fully associative cache). To access a cache line, the processor core would have to compare the tags of every single cache line with the tag of the requested address. The tag will need to store the entire address, which will not be specified by the offset in the cache line (meaning that the value of S shown in the figure in section 3.2 will be zero).

There are caches that are implemented in this way, but looking at the size of the L2 cache currently in use, it's impractical. Note that a 4MB cache with 64B cache lines should have 65,536 entries. In order to obtain adequate performance, the cache logic must be able, within a few cycles, to select from all these entries the one that matches the given tag. The cost of implementing such a scheme would be enormous.

Figure 3.5: Schematic representation of a fully associative cache

Each cache line requires the comparator to perform a large tag comparison (note that S is zero). The letter next to each connection indicates the width of the connection in bits. If nothing is specified, then the connection width is one bit. Each comparator must compare two values, each of which is T bits wide. Then, based on the result, the contents of the corresponding cache line should be selected and made available. To do this, you need to combine as many sets of data lines O as there are cache segments (cache buckets). The number of transistors required to implement one comparator will be large, in part because the comparator must be very fast. An iterative comparator cannot be used. The only way to save on the number of comparators is to reduce their number by iterative tag comparison. This doesn't work for the same reason that iterative comparators don't: it would take too long.

A fully associative cache is practical for a small cache (for example, the TLB cache on some Intel processors is fully associative), but the cache needs to be small - really small. We are talking about a maximum of several dozen records.

L1i, L1d and higher level caches require a different approach. All you can do is limit your search. In the most extreme case, each tag maps to exactly one cache entry. The calculation is simple: for a 4MB/64B cache with 65,536 entries, we can access each entry directly and use bits 6 to 21 of the address (16 bits) to do this. The lower 6 bits are the cache line index.


Figure 3.6: Schematic representation of a direct-mapped cache

As can be seen from Figure 3.6, the implementation of such direct mapped cache (direct-mapped cache) can be quick and easy. It requires only one comparator, one multiplexer (two are shown in this diagram because the tag and data are separated, but this is not a strict design requirement), and some logic to select content that contains valid cache lines. The comparator is complex because of the requirements regarding speed, but now there is only one; as a result, more effort can be expended to make it faster. The real complexity of this approach lies in the multiplexers. The number of transistors in a simple multiplexer grows in O(log N), where N is the number of cache lines. This is acceptable, but it can result in a slow multiplexer, in which case the speed can be increased by spending money on transistors in the multiplexers and parallelizing some of the work to increase the speed. The total number of transistors will grow slowly compared to the size of the cache, which makes this a very attractive solution. But this approach has a drawback: it only works if the addresses used in the program are evenly distributed relative to the bits used for direct mapping. If this is not the case, and it usually is, some cache entries are actively used and therefore deallocated repeatedly, while others are hardly used at all or remain empty.


Figure 3.7: Schematic representation of a cache with multiple associativity

This problem can be solved by using cache memory with multiple associativity (set associated). Multi-associativity caches combine features of fully associative caches and direct-mapped caches, largely avoiding the disadvantages of these solutions. Figure 3.7 shows a cache layout with multiple associativity. The memory for tags and for data is divided into sets, the selection of which is carried out in accordance with the address. This is similar to a direct mapped cache. But instead of using a separate element for each value in a set, the same set is used to cache some small number of values. Tags for all elements of the set are compared in parallel, which is similar to the operation of a fully associative cache.

The result is a cache that is reasonably robust against misses due to bad or intentional selection of addresses with the same set numbers at the same time, and the size of the cache is not limited by the number of comparators that can operate in parallel. If the cache increases (see figure), only the number of columns increases, not the number of rows. The number of lines increases only if the associativity of the cache increases. Today, L2 cache processors use associativity levels up to 16 and higher. L1 cache is typically set to level 8.

Table 3.1: Effects of Cache Size, Associativity, and Cache Line Size

Size
cache memory
L2
Associativity
direct display 2 4 8
CL=32CL=64 CL=32CL=64 CL=32CL=64 CL=32CL=64
512k 27 794 595 20 422 527 25 222 611 18 303 581 24 096 510 17 356 121 23 666 929 17 029 334
1M 19 007 315 13 903 854 16 566 738 12 127 174 15 537 500 11 436 705 15 162 895 11 233 896
2M 12 230 962 8 801 403 9 081 881 6 491 011 7 878 601 5 675 181 7 391 389 5 382 064
4M 7 749 986 5 427 836 4 736 187 3 159 507 3 788 122 2 418 898 3 430 713 2 125 103
8M 4 731 904 3 209 693 2 690 498 1 602 957 2 207 655 1 228 190 2 111 075 1 155 847
16M 2 620 587 1 528 592 1 958 293 1 089 580 1 704 878 883 530 1 671 541 862 324

If we have a 4MB/64B cache and 8-channel associativity, then we will have 8192 sets in the cache and only 13 tag bits will be required to address the cache sets. To determine which of the entries (if any) in the cache set contains the addressed cache line, 8 tags need to be compared. This can be done in a very short time. As you can see from practice, this makes sense.

Table 3.1 shows the number of L2 cache misses for some program (in this case, the gcc compiler, which the Linux kernel developers consider to be the most important benchmark) when changing the cache size, cache line size, and the value multiple associativity. In Section 7.2, we will introduce the cache simulation tool required for this test.

Simply, if it's not already obvious, the relationship of all these values ​​is that the cache size is

cache line size x associativity x number of sets

The address-to-cache mapping is computed as

O = log2 cache line size

S = log2 of the number of sets

according to the figure in section 3.2.


Fig.3.8: Cache size and associativity level (CL=32)

Rice. 3.8 makes the table data more understandable. The figure shows data for a cache line of a fixed size of 32 bytes. Looking at the numbers for a given cache size, you can see that associativity can really help reduce cache misses significantly. For an 8 MB cache, moving from direct-mapped to 2-way associativity saves almost 44% cache. If a multi-associativity cache is used, then the processor can store a larger working set in the cache than in the case of a direct mapped cache.

You can sometimes read in the literature that introducing associativity has the same effect as doubling the cache size. This, as seen in the case of moving from a 4 MB cache to an 8 MB cache, is true in some extreme cases. But this, of course, is not true with a subsequent increase in associativity. As can be seen from the data, a subsequent increase in associativity gives a significantly smaller gain. However, we should not completely ignore this fact. In our example program, the peak memory usage is 5.6 MB. So with a cache size of 8 MB, the same cache sets will be used multiple times (more than twice). As the working set increases, the savings can increase, because, as we see, with smaller cache sizes, the advantage of using associativity will be greater.

In general, increasing cache associativity above 8 seems to have little effect on a single workload thread. With the advent of multi-core processors that use a shared L2 cache, the situation is changing. Now you basically have two programs accessing the same cache, which in practice should double the effect of using associativity (or quadruple for quad-core processors). Thus, it can be expected that as the number of cores increases, the associativity of the shared cache should increase. As this becomes impossible (16-channel associativity is already difficult to implement), processor designers will start using a common L3 cache and beyond, while the L2 cache will potentially be shared by some subset of cores.

Another effect we can see in Figure 3.8 is how increasing the cache size improves performance. This data cannot be interpreted without knowing the size of the working set. Obviously, a cache as large as the main memory should lead to better results than a smaller cache, so in general there is no limit to increasing the size of the cache and gaining measurable benefits.

As mentioned above, the working set size at its peak is 5.6 MB. This value does not allow us to calculate the amount of memory that would bring the maximum benefit, but allows us to estimate this size. The problem is that all the memory is not continuously used and hence we have conflicts even with 16M cache and 5.6M working set (remember the advantage of 2-way 16MB associative cache over direct view version). But it's safe to say that under such a load, the advantage of a 32 MB cache will not be significant. However, who said that the working set should remain unchanged? As workloads grow over time, so should the size of the cache. When buying machines and deciding how much cache to pay for, it's worth measuring the working set size. Why this is important can be seen in Fig. 3.10.


Figure 3.9: Memory layout used in testing

Two types of tests are run. In the first test, the elements are processed sequentially. The test program uses the pointer n , but the elements of the array are linked together so that they are traversed in the order they are in memory. This option is shown at the bottom of Figure 3.9. There is one back reference coming from the last element. In the second test (upper part of the figure), the elements of the array are traversed in random order. In both cases, the elements of the array form a cyclic singly linked list.

Known options for mapping main memory to cache memory can be reduced to three types: direct, fully associative And partially associative.

At direct display line address i cache to which the block can be mapped j from the OP, is uniquely determined by the expression: i = j mod m, Where m is the total number of lines in the cache, i.e. per cache line with the number i displayed each m-th block of the OP, if the countdown starts from the block, the number of which is equal to i.

Direct mapping is a simple and inexpensive mapping method to implement. Its main drawback is the rigid assignment of one line in the cache to certain OP blocks. Therefore, if the program alternately accesses words from two different blocks mapped to the same cache line, this line will constantly be updated and the hit probability will be low.

Fully associative display overcomes the disadvantage of the direct one by allowing any RAM block to be loaded into any cache line. Associative mapping provides flexibility in choosing a row for a newly recorded block. The fundamental disadvantage of this method is the need to perform a check for all cache lines.

Partially associative mapping is one of the possible compromises, combining the advantages of direct and associative mapping methods and, to a certain extent, free from their shortcomings. The cache memory is divided into v subsets (sets), each of which contains k rows (it is customary to say that a set has k inputs). The dependence between the set and the OP blocks is the same as in direct mapping: to the lines included in the set i, only well-defined blocks of main memory can be displayed, in accordance with the relation i = j mod v, Where j– address of the OP block. At the same time, the placement of blocks along the lines of the module is arbitrary, and the associative principle is used to search for the desired line within the module.

In extreme cases, when v = m, k= 1, the multiple-associative mapping is reduced to a direct one, and when v = 1,k = m- to the associative.

Depending on the method of determining the mutual correspondence of the cache line and the main memory block, three cache architectures are distinguished:

Fully associative cache

Direct mapping cache (direct-mapped cache);

set- (partially or multiple-) associative cache (set-associative cache).

IN fully associative cache any block of main memory can reside in any cache line, or any cache line can map to any block of main memory. In this case, the upper bits of the cached data address, minus the bits that determine the position (offset) of the data in the line (block), are entered into the catalog and used as a tag. In such an architecture, to determine whether there is data in the cache at a particular address, it is necessary to compare the upper bits of this address with the tags of all lines in the cache directory. If such a comparison is done sequentially, then it will take too long, and the cache memory becomes meaningless due to low performance. Therefore, such a comparison must be performed in parallel for all tags. This requirement is best met by associative memory, that is, the tag must be stored in the cache's tag associative memory.



Such an organization of the cache memory is a complex hardware problem that can be solved only for small volumes, i.e., a fully associative cache, due to its complexity, cannot have a large volume and is used, as a rule, for auxiliary purposes. For example, in Intel processors, a fully associative cache is used in the paging block to build association translation buffer TLB (Translation Look aside Buffer), designed to speed up access to heavily used pages.

The opposite architecture is direct mapping cache. In a direct-mapped cache, a particular block of main memory can only reside in a well-defined cache line. The main memory is conditionally divided into pages, the size of which matches the size of the cache memory. The direct-mapping architecture means that each cache line can only map the corresponding block from any page in main memory. Blocks with the same number of all pages fall into the same cache line. Therefore, each cache line is claimed by many same-numbered main memory blocks within a page. One line at a time can only contain a copy of one of these blocks. The tag is the number of the page whose block occupies the corresponding cache line. In such an architecture, to determine whether there is data in the cache with a specific address, it is necessary to compare the number of the page to which this address belongs with the tag of the line in the cache memory directory that corresponds to the block on the page containing the given address, i.e., you need to perform only one comparison.

The direct mapped cache has the simplest hardware implementation, since the cache has the structure of a conventional directly addressable memory and only one comparer is needed. Therefore, such a cache can be large.

Intermediate between a fully associative cache and a direct-mapped cache is set-associative cache, which is mainly used in modern microprocessors. In a set-associative cache, unlike a direct-mapped cache, each block of main memory can claim one of several cache lines combined in a set. This increases the likelihood of a successful conversion. Simplistically, we can assume that a set-associative cache is several parallel and coordinated direct mapping channels in which rows with the same numbers form a corresponding set. The dial string representing the required block of main memory is determined by a tag comparison (as in an associative cache) performed in parallel for all cache channels. Each set has an associated flag that specifies the set row to be replaced by a new block in the event of a cache miss. The candidate for replacement is usually the string that has not been accessed for the longest time (LRU algorithm - Least Recently Used). It is also possible to use a FIFO or even random replacement algorithm, which is simpler but less efficient.

Why do we observe a constant increase in the performance of single-threaded programs? At the moment, we are at that stage of development of microprocessor technologies, when the increase in the speed of single-threaded applications depends only on memory. The number of cores is growing, but the frequency is fixed within 4 GHz and does not give a performance boost.

The speed and frequency of memory operation is the main reason why we get "our piece of free cake" (link). That is why it is important to use memory as efficiently as we can, and even more so as fast as the cache. To optimize the program for a specific computer, it is useful to know the characteristics of the processor's cache memory: the number of levels, size, line length. This is especially important in high-performance code - system kernels, mathematical libraries.

How to determine the characteristics of the automatic cache? (of course, cpuinfo is not considered to be parsed, if only because in the end we would like to get an algorithm that can be easily implemented in other operating systems. Convenient, isn't it?) That's what we'll do now.

A bit of theory

Three types of caches currently exist and are widely used: direct-mapped cache, associative cache, and multiple-associative cache.
Direct mapping cache
- a given RAM line can be mapped to a single cache line, but many possible RAM lines can be mapped to each cache line.
Associative cache (fully associative cache)
- any RAM line can be mapped to any cache line.
Multiple Associative Cache
- the cache memory is divided into several "banks", each of which functions as a direct mapped cache, so a line of RAM can be mapped not to a single possible cache entry (as it would be in the case of direct mapping), but to one of several banks ; Bank selection is based on the LRU or other mechanism for each cached line.

LRU - the displacement of the "long unused" line itself, the memory cache.

Idea

To determine the number of cache levels, you need to consider the order of memory accesses, in which the transition will be clearly visible. Different levels of cache differ primarily in the speed of memory response. In the event of a "cache miss" for the L1 cache, data will be searched in the next memory levels, and if the data size is greater than L1 and less than L2, then the memory response speed will be the L2 response speed. The previous statement is also true in the general case.

It is clear that we need to choose a test on which we will clearly see cache misses and test it on various data sizes.

Knowing the logic of multiple-associative caches that work according to the LRU algorithm, it is not difficult to come up with an algorithm on which the cache "falls", nothing tricky - passing through the line. The efficiency criterion is the time of one memory access. Naturally, you need to sequentially access all elements of the string, repeating many times to average the result. For example, there may be cases when the line fits in the cache, but for the first pass we load the line from RAM and therefore get a completely inadequate time.

I would like to see something like steps, passing along lines of different lengths. To determine the nature of the steps, consider an example of a line pass for a direct and associative cache, the case of a multiple-associative cache will be an average between a direct-mapped cache and an associative cache.

Associative cache

As soon as the data size exceeds the cache size,
a fully associative cache "misses" on every memory access.

Direct cache

Consider different row sizes. - shows the maximum number of misses that the processor will spend to access the elements of the array during the next pass through the line.

As you can see, the memory access time does not increase dramatically, but as the amount of data increases. As soon as the data size exceeds the cache size, there will be misses on every memory access.

Therefore, for an associative cache, the step will be vertical, while for a direct cache, it will gradually increase up to twice the size of the cache. A multi-associative cache would be an average case, a "bump", if only because the access-time could not be better than a direct one.

If we talk about memory, then the fastest is the cache, followed by the operational, the slowest is the swap, we will not talk about it in the future. In turn, different cache levels (as a rule, today processors have 2-3 cache levels) have different memory response speeds: the higher the level, the slower the response speed. And therefore, if the line is placed in the first level of the cache (which, by the way, is completely associative), the response time will be less than that of a line that is significantly larger than the size of the first level cache. Therefore, there will be several plateaus on the graph of memory response time versus line sizes - a plateau * of the memory response, and plateaus caused by various cache levels.

*Function plateau - ( i:x, f(xi) - f(xi+1)< eps: eps → 0 }

Let's start implementing

We will use C (ANSI C99) for implementation.

The code was written quickly, the usual pass through lines of different lengths, less than 10mb, which is performed repeatedly. (We will give small pieces of the program that carry a semantic load).

For (i = 0; i< 16; i++) { for (j = 0; j < L_STR; j++) A[j]++; }

We look at the chart - and we see one big step. But in theory, everything turns out just fine. It became necessary to understand: why is this happening? And how to fix it?

Obviously, this can happen for two reasons: either the processor does not have a cache memory, or the processor is so good at guessing memory accesses. Since the first option is closer to fiction, the reason for everything is a good prediction of hits.

The fact is that today far from top processors, in addition to the principle of spatial locality, also predict an arithmetic progression in the order of memory access. Therefore, memory accesses need to be randomized.

The length of the randomized array should be comparable to the length of the main string in order to get rid of the large granularity of accesses, as well as the length of the array should not be a power of two, because of this, “overlays” occurred - which may result in outliers. It is best to set the granularity constant, including if the granularity is a prime number, then overlapping effects can be avoided. And the length of the randomized array is a function of the length of the string.
for (i = 0; i< j; i++) { for (m = 0; m < L; m++) { for (x = 0; x < M; x++){ v = A[ random[x] + m ]; } } }

After that, we surprised the long-awaited "picture", which we talked about at the beginning.

The program is divided into 2 parts - test and data processing. Write a script in 3 lines to run or run it 2 times by hand, decide for yourself.

Linux listing size.

#include #include #include #include #define T char #define MAX_S 0x1000000 #define L 101 volatile T A; int m_rand; int main ()( static struct timespec t1, ​​t2; memset ((void*)A, 0, sizeof (A)); srand(time(NULL)); int v, M; register int i, j, k, m , x; for (k = 1024; k< MAX_S;) { M = k / L; printf("%g\t", (k+M*4)/(1024.*1024)); for (i = 0; i < M; i++) m_rand[i] = L * i; for (i = 0; i < M/4; i++) { j = rand() % M; x = rand() % M; m = m_rand[j]; m_rand[j] = m_rand[i]; m_rand[i] = m; } if (k < 100*1024) j = 1024; else if (k < 300*1024) j = 128; else j = 32; clock_gettime (CLOCK_PROCESS_CPUTIME_ID, &t1); for (i = 0; i < j; i++) { for (m = 0; m < L; m++) { for (x = 0; x < M; x++){ v = A[ m_rand[x] + m ]; } } } clock_gettime (CLOCK_PROCESS_CPUTIME_ID, &t2); printf ("%g\n",1000000000. * (((t2.tv_sec + t2.tv_nsec * 1.e-9) - (t1.tv_sec + t1.tv_nsec * 1.e-9)))/(double)(L*M*j)); if (k >

Windows listing size.c

#include #include #include #include #include #include using namespace std; #define T char #define MAX_S 0x1000000 #define L 101 volatile T A; int m_rand; int main ()( LARGE_INTEGER freq; LARGE_INTEGER time1; LARGE_INTEGER time2; QueryPerformanceFrequency(&freq); memset ((void*)A, 0, sizeof (A)); srand(time(NULL)); int v, M; register int i, j, k, m, x; for (k = 1024; k< MAX_S;) { M = k / L; printf("%g\t", (k+M*4)/(1024.*1024)); for (i = 0; i < M; i++) m_rand[i] = L * i; for (i = 0; i < M/4; i++) { j = rand() % M; x = rand() % M; m = m_rand[j]; m_rand[j] = m_rand[i]; m_rand[i] = m; } if (k < 100*1024) j = 1024; else if (k < 300*1024) j = 128; else j = 32; QueryPerformanceCounter(&time1); for (i = 0; i < j; i++) { for (m = 0; m < L; m++) { for (x = 0; x < M; x++){ v = A[ m_rand[x] + m ]; } } } QueryPerformanceCounter(&time2); time2.QuadPart -= time1.QuadPart; double span = (double) time2.QuadPart / freq.QuadPart; printf ("%g\n",1000000000. * span/(double)(L*M*j)); if (k >100*1024) k += k/16; else k += 4*1024; ) return 0; )

In general, I think everything is clear, but I would like to stipulate a few points.

Array A is declared as volatile - this directive guarantees us that there will always be calls to array A, that is, they will not be “cut out” by either the optimizer or the compiler. It is also worth mentioning that the entire computational load is performed before the time measurement, which allows us to reduce the background influence.

The file is translated into assembler on Ubuntu 12.04 and compiler gcc 4.6 - cycles are saved.

Data processing

It is logical to use derivatives for data processing. And despite the fact that as the order of differentiation increases, the noise increases, the second derivative and its properties will be used. No matter how noisy the second derivative is, we are only interested in the sign of the second derivative.

We find all points at which the second derivative is greater than zero (with some error because the second derivative, in addition to being considered numerically, is very noisy). We set the function of the dependence of the sign of the second derivative of the function on the size of the cache. The function takes the value 1 at the points where the sign of the second derivative is greater than zero, and zero if the sign of the second derivative is less than or equal to zero.

Take-off points are the beginning of each step. Also, before processing the data, it is necessary to remove single outliers that do not change the semantic load of the data, but create noticeable noise.

Listing of the data_pr.c file
#include #include #include double round (double x) ( int mul = 100; if (x > 0) return floor(x * mul + .5) / mul; else return ceil(x * mul - .5) / mul; ) float size, time , der_1, der_2; int main()( size = 0; time = 0; der_1 = 0; der_2 = 0; int i, z = 110; for (i = 1; i< 110; i++) scanf("%g%g", &size[i], &time[i]); for (i = 1; i < z; i++) der_1[i] = (time[i]-time)/(size[i]-size); for (i = 1; i < z; i++) if ((time[i]-time)/(size[i]-size) >2) der_2[i] = 1; else der_2[i] = 0; //comb for (i = 0; i< z; i++) if (der_2[i] == der_2 && der_2 != der_2[i]) der_2 = der_2[i]; for (i = 0; i < z-4; i++) if (der_2[i] == der_2 && der_2[i] != der_2 && der_2[i] != der_2) { der_2 = der_2[i]; der_2 = der_2[i]; } for (i = 0; i < z-4; i++) if (der_2[i] == der_2 && der_2[i] != der_2 && der_2[i] != der_2 && der_2[i] != der_2) { der_2 = der_2[i]; der_2 = der_2[i]; der_2 = der_2[i]; } // int k = 1; for (i = 0; i < z-4; i++){ if (der_2[i] == 1) printf("L%d = %g\tMb\n", k++, size[i]); while (der_2[i] == 1) i++; } return 0; }

Tests

CPU/OS/kernel version/compiler/compile keys - will be specified for each test.

  • Intel Pentium CPU P6100 @2.00GHz / Ubuntu 12.04 / 3.2.0-27-generic / gcc -Wall -O3 size.c -lrt

    L1 = 0.05 Mb
    L2 = 0.2 Mb
    L3 = 2.7 Mb

  • I will not give all the good tests, let's talk about the "Rake"

Let's talk about the "rake"

The rake was detected while processing data on the server processor Intel Xeon 2.4/L2 = 512 kb/Windows Server 2008

The problem lies in the small number of points that fall within the interval of reaching a plateau - accordingly, the jump in the second derivative is imperceptible and is taken as noise.

You can solve this problem using the least squares method, or run tests in the course of determining plateau zones.

I would like to hear your suggestions on how to solve this problem.

Bibliography

  • Makarov A.V. Computer architecture and low-level programming.
  • Ulrich Drepper

Cache with direct mapping (placement);

fully associative cache;

· multiple associative cache or partially associative.

Direct Mapped Cache(accommodation) is the most
a simple buffer type. A memory address uniquely identifies a string
cache in which the block of information will be placed. At the same time, pre-
It is assumed that the RAM is divided into blocks and each
which block in the buffer is given only one line. This is a simple and inexpensive display method to implement. Its main drawback is the rigid assignment of one line in the cache to certain OP blocks. Therefore, if a program alternately accesses words from two different blocks mapped to the same cache line, that line will be constantly updated and the hit probability will be low.

Fully Associative Cache overcomes the disadvantage of the direct one by allowing any RAM block to be loaded into any cache line. The control logic distinguishes two fields in the UE address: the tag field and the word field. The tag field matches the address of the OP block. To check for a copy of the block in the cache, the cache control logic must simultaneously check the tags of all rows for a match against the address tag field. Associative mapping provides flexibility in choosing a row for a newly recorded block. The fundamental disadvantage of this method is the need to use expensive associative memory.

Multiple Associative Type or Partially Associative Display Type − this is one of the possible compromises, combining the advantages of direct and associative methods. The cache memory (both tags and data) is divided into a number of modules. The dependency between the module and OP blocks is as rigid as in direct mapping. But the placement of blocks along the lines of the module is arbitrary, and the associative principle is used to search for the desired line within the module. This display method is the most widely used in modern microprocessors.

Mapping OP sectors in the cache.

This type of display is used in all modern computers and consists in the fact that the entire OP is divided into sectors consisting of a fixed number of consecutive blocks. The cache memory is also divided into sectors containing the same number of lines. The location of the blocks in the OP sector and the cache sector is completely the same. Sector-to-cache mapping is performed associatively, so any sector from the RAM can be placed in any sector of the cache. Thus, in the process of operation, the ALU turns to the OP in search of the next command, as a result of which, in the cache (in the absence of a block containing this command), a whole sector of information from the OP is loaded into the cache, moreover, according to the principle of locality, due to this, a significant increase in system speed.

Hierarchical cache model

As a rule, the cache memory has a multi-level architecture. For example, in a computer with 32 KB internal (in the CPU core) and 1 MB external (in the CPU case or on the motherboard) cache, the former would be considered cache level 1 (L1) and the latter would be cache 2 -th level (L2). In modern server systems, the number of cache levels can be up to four, although two or three levels are most commonly used.

In some processor architectures, the L1 cache is divided into an instruction cache (InstructionCache, I-cache) and a data cache (DataCache, D-cache), not necessarily of the same size. From the point of view of circuitry, it is easier and cheaper to design separate I-cache and D-cache: the command fetch is carried out by the I-box, and the data is fetched by the E-box and F-box, although in both cases the A-box and C-box are involved. All these blocks are large, and it is problematic to provide them with simultaneous and fast access to the same cache. In addition, this would inevitably require an increase in the number of access ports, which also complicates the design task.

Since I-cache and D-cache must provide very low access latency (this is true for any L1 cache), they have to sacrifice their size - usually it is from 16 to 32 KB. After all, the smaller the cache size, the easier it is to achieve low latency access.

The cache memory of the 2nd level, as a rule, is unified, that is, it can contain both commands and data. If it is built into the CPU core, then they talk about S-cache (SecondaryCache, secondary cache), otherwise - about B-cache (BackupCache, backup cache). In modern server CPUs, the volume of S-cache is from one to several megabytes, aB-cache - up to 64 MB. If the design of the CPU provides for the presence of a built-in cache of the 3rd level, then it is called T-cache (TernaryCache, tertiary cache). As a rule, each successive cache level is slower, but larger than the previous one. If B-cache is present in the system (as the last level of the cache memory model), then it can be controlled by both the CPU and the system logic set.

If at the time of execution of a certain instruction in the registers there is no data for it, then they will be requested from the nearest cache level, i.e. from D-cache. If they are not in D-Cache, the request is sent to S-cache, etc. In the worst case, the data will be delivered directly from memory. However, an even sadder option is possible, when the virtual memory management subsystem of the operating system (OS) manages to force them into the paging file on the hard drive. In the case of delivery from RAM, the loss of time to receive the necessary data can be from tens to hundreds of CPU cycles, and in the case of data on the hard disk, we can already talk about millions of cycles.

Cache Associativity

One of the fundamental characteristics of cache memory - the level of associativity - reflects its logical segmentation. The fact is that sequential enumeration of all cache lines in search of the necessary data would require tens of cycles and would negate all the gain from using the memory built into the CPU. Therefore, RAM cells are hard-wired to cache lines (each line can contain data from a fixed set of addresses), which significantly reduces the search time. More than one cache line can be associated with each RAM cell: for example, n-way associative means that information at a certain RAM address can be stored in n cache locations.

The choice of a place can be carried out according to various algorithms, among which the LRU (LeastRecentlyUsed, the entry most recently requested is replaced) and LFU (LeastFrequentlyUsed, the least frequently requested) replacement principles are most often used, although there are modifications of these principles. For example, fully associative cache memory (fullyassociative), in which information located at an arbitrary address in RAM can be placed in an arbitrary line. Another option is direct mapping, in which information located at an arbitrary address in RAM can be placed in only one place in the cache. Naturally, this option provides the highest performance, since the controller will have to "look" into only one cache line when checking for information, but it is also the least efficient, since the controller will not select the "optimal" place when writing. For the same amount of cache, the full associativity scheme will be the least fast, but the most efficient.

A fully associative cache is encountered in practice, but, as a rule, it has a very small amount. For example, the Cyrix 6x86 CPU used 256 bytes of this instruction cache in front of a unified 16 or 64 KB L1 cache. Often, a fully associative scheme is used when designing TLBs (which will be discussed below), jump address caches, read-write buffers, etc. As a rule, the associativity levels of I-cache and D-cache are quite low (up to four channels) - their increase It is inappropriate, because it leads to an increase in access delays and, as a result, negatively affects performance. As some compensation, the associativity of the S-cache is increased (usually up to 16 channels), since the delays in accessing this cache are unimportant. For example, according to a study of commonly used integer tasks, the Intel Pentium III had 16K bytes of quad-lane D-cache covering about 93% of requests, and 16K bytes of quad-lane I-cache covered 99% of requests.

Cache line and tag size

An important characteristic of cache memory is the line size. As a rule, one address record (the so-called tag) is used per line, which indicates which address in RAM the given line corresponds to. It is obvious that the numbering of individual bytes is inappropriate, since in this case the amount of overhead information in the cache will exceed the amount of data itself by several times. Therefore, one tag usually relies on one line, which is usually 32 or 64 bytes in size (the actual maximum is 1024 bytes), and is equivalent to four (sometimes eight) bits of the system data bus. In addition, each cache line is accompanied by some information to ensure fault tolerance: one or more parity bits (parity) or eight or more bytes of error detection and correction (ECC, ErrorCheckingandCorrecting), although in mass solutions they often do not use either one or another.

The size of a cache tag depends on three main factors: the size of the cache, the maximum amount of RAM that can be cached, and the associativity of the cache. Mathematically, this size is calculated by the formula:

Stag=log2(Smem*A/Scache),

where Stag is the size of one cache tag, in bits; Smem - maximum cacheable amount of RAM, in bytes; Scache - amount of cache memory, in bytes; A - cache associativity, in channels.

It follows that a system with 1 GB RAM and 1 MB cache with dual-channel associativity would require 11 bits for each tag. It is noteworthy that the cache line size itself does not affect the tag size in any way, but inversely affects the number of tags. It should be understood that the size of the cache line does not make sense to make it less than the bit width of the system data bus, but a multiple increase in size will lead to excessive clogging of the cache with unnecessary information and excessive load on the system bus and the memory bus. In addition, the maximum amount of cache memory that can be cached does not have to match the maximum amount of RAM that can be installed on the system. If a situation arises when there is more RAM than can be cached, then the cache will contain information only from the lower segment of the RAM. This was exactly the situation with the Socket7/Super7 platform. The chipsets for this platform allowed for large amounts of RAM (from 256 MB to 1 GB), while the cached amount was often limited to the first 64 MB (we are talking about the B-cache located on the motherboard) due to the use of cheap 8 - bits of tag SRAM chips (2 bits of which were reserved for indicators of validity and line change). This resulted in a noticeable drop in performance.