How much cache memory. A new approach to processor caching. Packing and unpacking

When performing various tasks, the necessary blocks of information from the RAM are received by the processor of your computer. Having processed them, the CPU writes the results of calculations to memory and receives subsequent blocks of data for processing. This continues until the task is completed.

The above processes are carried out at a very high speed. However, the speed of even the fastest RAM is significantly less than the speed of any weak processor. Each action, whether it is writing information to it or reading from it, takes a lot of time. The speed of the RAM is ten times lower than the speed of the processor.

Despite such a difference in the speed of information processing, the PC processor is not idle and does not wait for the RAM to issue and receive data. The processor is always working and all thanks to the presence of cache memory in it.

Cache is a special kind of RAM. The processor uses cache memory to store those copies of information from the computer's main RAM that are likely to be accessed in the near future.

In essence, the cache memory acts as a high-speed memory buffer that stores information that the processor may need. Thus, the processor receives the necessary data ten times faster than when reading them from RAM.

The main difference between a cache memory and a regular buffer is the built-in logic functions. The buffer stores random data, which is usually processed according to the "first received, first issued" or "first received, last issued" scheme. The cache contains data that is likely to be accessed in the near future. Therefore, thanks to the “smart cache”, the processor can run at full speed and not wait for data to be retrieved from slower RAM.

Main types and levels of L1 L2 L3 cache

Cache memory is made in the form of static random access memory (SRAM) chips that are installed on the system board or built into the processor. Compared to other types of memory, static memory can operate at very high speeds.

The cache speed depends on the size of a particular chip. The larger the size of the chip, the more difficult it is to achieve high speed for its operation. Given this feature, during the manufacture of the processor cache memory is performed in the form of several small blocks, called levels. The most common today is the three-level cache system L1, L2, L3:

Cache memory of the first level L1 - the smallest in volume (only a few tens of kilobytes), but the fastest in speed and the most important. It contains the data most frequently used by the processor and runs without delay. Typically, the number of L1 memory chips is equal to the number of processor cores, with each core accessing only its own L1 chip.

L2 cache it is inferior to L1 memory in speed, but wins in volume, which is already measured in several hundred kilobytes. It is designed to temporarily store important information that is less likely to be accessed than the information stored in the L1 cache.

Third level L3 cache - has the largest volume of the three levels (it can reach tens of megabytes), but also has the slowest speed, which is still significantly higher than the speed of RAM. The L3 cache is shared across all processor cores. The memory level L3 is intended for temporary storage of those important data, the probability of accessing which is slightly lower than that of the information stored in the first two levels L1, L2. It also ensures the interaction of the processor cores with each other.

Some processor models are made with two levels of cache memory, in which L2 combines all the functions of L2 and L3.

When a large amount of cache is useful.

You will feel a significant effect from a large amount of cache when using archiver programs, in 3D games, during video processing and encoding. In relatively "light" programs and applications, the difference is practically not noticeable (office programs, players, etc.).

What is the dirtiest place on the computer? Think basket? User folders? Cooling system? Didn't guess! The dirtiest place is the cache! After all, it constantly has to be cleaned!

In fact, there are many caches on a computer, and they serve not as a waste dump, but as accelerators for equipment and applications. Where does their reputation as a "systemic garbage chute" come from? Let's see what a cache is, how it happens, how it works and why from time to time.

The concept and types of cache memory

Esh or cache memory is a special storage of frequently used data, which is accessed tens, hundreds and thousands of times faster than RAM or other storage media.

Applications (web browsers, audio and video players, database editors, etc.), operating system components (thumbnail cache, DNS cache) and hardware (CPU L1-L3 cache, GPU framebuffer, etc.) have their own cache memory. chip, drive buffers). It is implemented in different ways - software and hardware.

  • The program cache is just a separate folder or file where, for example, pictures, menus, scripts, multimedia content and other content of visited sites are downloaded. This is the folder where the browser first dives when you open a web page again. Swapping a piece of content from local storage speeds up its loading and .

  • In hard drives, in particular, the cache is a separate RAM chip with a capacity of 1-256 Mb, located on the electronics board. It receives information read from the magnetic layer and not yet loaded into RAM, as well as data that the operating system most often requests.

  • A modern central processor contains 2-3 main levels of cache memory (it is also called scratch memory), located in the form of hardware modules on the same chip. The fastest and smallest in volume (32-64 Kb) is cache Level 1 (L1) - it runs at the same frequency as the processor. L2 is in the middle position in terms of speed and capacity (from 128 Kb to 12 Mb). And L3 is the slowest and most voluminous (up to 40 Mb), it is absent on some models. The speed of L3 is only low relative to its faster counterparts, but it is also hundreds of times faster than the most productive RAM.

The scratchpad memory of the processor is used to store constantly used data, pumped from RAM, and machine code instructions. The larger it is, the faster the processor.

Today, three levels of caching is no longer the limit. With the advent of the Sandy Bridge architecture, Intel has implemented an additional cache L0 (intended for storing decrypted microinstructions) in its products. And the most high-performance CPUs also have a fourth-level cache, made in the form of a separate microcircuit.

Schematically, the interaction of cache L0-L3 levels looks like this (for example, Intel Xeon):

Human language about how it all works

To understand how cache memory works, imagine a person working at a desk. Folders and documents that he uses all the time are on the table ( in cache). To access them, just reach out your hand.

The papers he needs less often are stored nearby on the shelves ( in RAM). To get them, you need to get up and walk a few meters. And what a person does not currently work with has been archived ( recorded on hard disk).

The wider the table, the more documents will fit on it, which means that the employee will be able to get quick access to more information ( the larger the cache capacity, the faster the program or device works in theory).

Sometimes he makes mistakes - he keeps papers on the table that contain incorrect information and uses them in his work. As a result, the quality of his work is reduced ( cache errors lead to software and hardware failures). To correct the situation, the employee must throw away the documents with errors and put the correct ones in their place ( clear cache memory).

The table has a limited area ( cache memory is limited). Sometimes it can be expanded, for example, by moving a second table, and sometimes it cannot (the cache size can be increased if such an opportunity is provided by the program; the hardware cache cannot be changed, since it is implemented in hardware).

Another way to speed up access to more documents than the table can hold is to find an assistant who will serve paper to the worker from the shelf (the operating system can allocate some of the unused RAM to cache device data). But it's still slower than taking them off the table.

Documents at hand should be relevant for current tasks. This is the responsibility of the employee himself. You need to clean up the papers regularly (the extrusion of irrelevant data from the cache memory falls "on the shoulders" of applications that use it; some programs have an automatic cache clearing function).

If an employee forgets to maintain order in the workplace and keep documentation up to date, he can draw a table cleaning schedule for himself and use it as a reminder. As a last resort, entrust this to an assistant (if an application dependent on cache memory has become slower or often loads outdated data, use scheduled cache cleaning tools or do this manually every few days).

We actually come across "caching functions" all over the place. This is the purchase of products for the future, and various actions that we perform in passing, at the same time, etc. In fact, this is everything that saves us from unnecessary fuss and unnecessary body movements, streamlines life and facilitates work. The computer does the same. In a word, if there was no cache, it would work hundreds and thousands of times slower. And we wouldn't like it.

What is a cache, why is it needed and how does it work updated: February 25, 2017 by: Johnny Mnemonic

Almost all developers know that the processor cache is such a small but fast memory that stores data from recently visited areas of memory - the definition is short and quite accurate. Nevertheless, knowledge of the "boring" details about the mechanisms of the cache is necessary to understand the factors that affect the performance of the code.

In this article, we will look at a number of examples illustrating the various features of the caches and their impact on performance. The examples will be in C#, the choice of language and platform does not affect performance assessment and final conclusions so much. Naturally, within reasonable limits, if you choose a language in which reading a value from an array is tantamount to accessing a hash table, you will not get any results suitable for interpretation. Translator's notes are in italics.

Habracut - - -

Example 1: memory access and performance

How much faster do you think the second loop is than the first?
int arr = new int ;

// first
for (int i = 0; i< arr.Length; i++) arr[i] *= 3;

// second
for (int i = 0; i< arr.Length; i += 16) arr[i] *= 3;


The first loop multiplies all array values ​​by 3, the second loop only multiplies every sixteenth value. The second cycle does only 6% work the first cycle, but on modern machines both cycles take approximately the same time: 80ms And 78 ms respectively (on my machine).

The answer is simple - memory access. The speed of these loops is primarily determined by the speed of the memory subsystem, and not by the speed of integer multiplication. As we will see in the next example, the number of accesses to RAM is the same in both the first and second cases.

Example 2: Impact of Cache Lines

Let's dig deeper - let's try other step values, not just 1 and 16:
for (int i = 0; i< arr.Length; i += K /* шаг */ ) arr[i] *= 3;

Here is the running time of this cycle for various values ​​of step K:

Please note that with step values ​​from 1 to 16, the operating time practically does not change. But with values ​​greater than 16, the running time is roughly halved every time we double the step. This does not mean that the loop somehow magically starts to run faster, just that the number of iterations also decreases. The key point is the same running time with step values ​​from 1 to 16.

The reason for this is that modern processors access memory not by bytes, but by small blocks called cache lines. Typically, the string size is 64 bytes. When you read any value from memory, at least one cache line gets into the cache. Subsequent access to any value from this string is very fast.

Because 16 int values ​​occupy 64 bytes, loops with steps from 1 to 16 access the same number of cache lines, more precisely, all the cache lines of the array. At step 32, every second line is accessed, at step 64, every fourth.

Understanding this is very important for some optimization methods. The number of accesses to it depends on the location of data in memory. For example, misaligned data may require two RAM accesses instead of one. As we found out above, the speed of work will be two times lower.

Example 3: First and Second Level Caches (L1 and L2) Sizes

Modern processors typically have two or three levels of caches, commonly referred to as L1, L2, and L3. You can use the CoreInfo utility or the GetLogicalProcessorInfo Windows API function to find out the sizes of caches at different levels. Both methods also provide information about the cache line size for each level.

On my machine, CoreInfo reports 32KB L1 data caches, 32KB L1 instruction caches, and 4MB L2 data caches. Each core has its own personal L1 caches, L2 caches are common for each pair of cores:

Logical Processor to Cache Map: *--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 **-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64 --*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 --** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64
Let's check this information experimentally. To do this, let's iterate through our array incrementing every 16th value - an easy way to change the data in each line of the cache. When we reach the end, we return to the beginning. Let's check the different array sizes, we should see a performance drop when the array no longer fits into the caches of different levels.

The code is like this:

int steps = 64 * 1024 * 1024; // number of iterations
int lengthMod = arr.Length - 1; // array size -- power of two

for (int i = 0; i< steps; i++)
{
// x & lengthMod = x % arr.Length because powers of two
arr[(i * 16) & lengthMod]++;
}


Test results:

On my machine, performance drops after 32 KB and 4 MB are noticeable - these are the sizes of the L1 and L2 caches.

Example 4: Instruction Parallelism

Now let's look at something else. Which of the two loops do you think will run faster?
int steps = 256 * 1024 * 1024;
int a = new int ;

// first
for (int i = 0; i< steps; i++) { a++; a++; }

// second
for (int i = 0; i< steps; i++) { a++; a++; }


It turns out that the second loop runs almost twice as fast, at least on all the machines I've tested. Why? Because the commands inside the loops have different data dependencies. The commands of the first one have the following chain of dependencies:

In the second cycle, the dependencies are:

The functional parts of modern processors are capable of performing a certain number of certain operations simultaneously, usually not a very large number. For example, parallel access to data from the L1 cache at two addresses is possible, as well as the simultaneous execution of two simple arithmetic instructions. In the first cycle, the processor cannot use these features, but it can in the second.

Example 5: Cache Associativity

One of the key questions that needs to be answered when designing a cache is whether data from a certain memory area can be stored in any cache cells or only in some of them. Three possible solutions:
  1. Direct Mapping Cache, the data of each cache line in RAM is stored in only one predefined cache cell. The simplest way to compute the mapping is: row_in_index % number_of_cache_cells. Two rows mapped to the same cell cannot be cached at the same time.
  2. N-entry partially associative cache, each line can be stored in N different cache locations. For example, in a 16-way cache, a row can be stored in one of the 16 cells that make up the group. Usually, strings with equal low bits of indices share one group.
  3. Fully associative cache, any string can be stored in any cache location. The solution is equivalent to a hash table in its behavior.
Direct-mapped caches are prone to conflicts, for example, when two rows compete for one cell, alternately pushing each other out of the cache, the efficiency is very low. On the other hand, fully associative caches, although without this disadvantage, are very complex and expensive to implement. Partially associative caches are a typical trade-off between implementation complexity and efficiency.

For example, on my machine, the 4MB L2 cache is a 16-way semi-associative cache. All RAM is divided into sets of lines by the least significant bits of their indices, the lines from each set compete for one group of 16 L2 cache cells.

Since the L2 cache has 65,536 cells (4 * 2 20 / 64) and each group consists of 16 cells, we have a total of 4,096 groups. Thus, the lower 12 bits of the row index determine which group this row belongs to (2 12 = 4 096). As a result, rows with addresses divisible by 262,144 (4,096 * 64) share the same group of 16 cells and compete for a place in it.

For the effects of associativity to work, we need to constantly access a large number of rows from the same group, for example, using the following code:

public static long UpdateEveryKthByte(byte arr, int K)
{
const int rep = 1024 * 1024; // number of iterations

Stopwatch sw = Stopwatch.StartNew();

int p = 0;
for (int i = 0; i< rep; i++)
{
arr[p]++;

P+=K; if (p >= arr.Length) p = 0;
}

Sw.Stop();
return sw.ElapsedMilliseconds;
}


The method increments each K-th element of the array. When we reach the end, we start again. After a fairly large number of iterations (2 20), we stop. I made runs for various array sizes and K step values. Results (blue - long running time, white - small):

The blue areas correspond to those cases when, with constant data change, the cache is not able to accommodate all required data at the same time. Bright blue color indicates an operating time of about 80 ms, almost white - 10 ms.

Let's deal with the blue areas:

  1. Why do vertical lines appear? The vertical lines correspond to the step values ​​at which too many rows (more than 16) from one group are accessed. For these values, my machine's 16-way cache can't hold all the data it needs.

    Some of the bad stride values ​​are powers of two: 256 and 512. For example, consider stride 512 and an 8 MB array. At this step, there are 32 sections in the array (8 * 2 20 / 262 144), which are competing with each other for cells in 512 cache groups (262 144 / 512). There are 32 plots, and there are only 16 cells in the cache for each group, so there is not enough space for everyone.

    Other stride values ​​that are not powers of 2 are just unlucky, causing a lot of accesses to the same cache groups, and also causing vertical blue lines to appear in the figure. At this point, lovers of number theory are invited to think.

  2. Why do vertical lines break at a 4MB boundary? With an array size of 4 MB or less, a 16-way cache behaves the same as a fully associative cache, that is, it can fit all the data in the array without conflicts. There are no more than 16 areas fighting for one cache group (262 144 * 16 = 4 * 2 20 = 4 MB).
  3. Why is there a big blue triangle on the top left? Because with a small step and a large array, the cache is not able to fit all the necessary data. The degree of associativity of the cache plays a secondary role here, the limitation is related to the size of the L2 cache.

    For example, with an array size of 16 MB and a stride of 128, we access every 128th byte, thus updating every other line of the array cache. It takes 8 MB to store every other line in the cache, but my machine only has 4 MB.

    Even if the cache were fully associative, this would not allow 8 MB of data to be stored in it. Note that in the above example with stride 512 and array size 8 MB, we need only 1 MB of cache to store all the data we need, but this is not possible due to insufficient cache associativity.

  4. Why is the left side of the triangle gradually gaining its intensity? The maximum intensity falls on the step value of 64 bytes, which is equal to the size of the cache line. As we saw in the first and second examples, sequential access to the same string costs next to nothing. Let's say, with a step of 16 bytes, we have four memory accesses for the price of one.

    Since the number of iterations is the same in our test for any step value, the cheaper step results in less running time.

The detected effects are also preserved at large values ​​of the parameters:

Cache associativity is an interesting thing that can show up under certain conditions. Unlike the other problems discussed in this article, it is not so serious. Certainly, this is not something that requires constant attention when writing programs.

Example 6: False Cache Sharing

On multi-core machines, you may encounter another problem - cache matching. Processor cores have partially or completely separate caches. On my machine, the L1 caches are separate (as usual) and there are two L2 caches shared by each pair of cores. Details may vary, but in general, modern multi-core processors have multi-level hierarchical caches. Moreover, the fastest, but also the smallest caches, belong to individual cores.

When one of the cores modifies the value in its cache, the other cores can no longer use the old value. The value in the caches of other cores must be updated. Moreover, it should be updated entire cache line because caches operate on row-level data.

Let's demonstrate this problem with the following code:

private static int s_counter = new int ;

private void UpdateCounter(int position)
{
for (int j = 0; j< 100000000; j++)
{
s_counter = s_counter + 3;
}
}


If on my four-core machine I call this method with parameters 0, 1, 2, 3 simultaneously from four threads, then the running time will be 4.3 seconds. But if I call the method with parameters 16, 32, 48, 64, then the running time will be only 0.28 seconds.

Why? In the first case, all four values ​​processed by the threads at any given time are very likely to end up in the same cache line. Each time one core increments a value, it marks the cache locations containing that value in other cores as invalid. After this operation, all other cores will have to cache the line again. This renders the caching mechanism inoperable, killing performance.

Example 7: hardware complexity

Even now, when the principles of cache operation are not a secret for you, hardware will still surprise you. Processors differ from each other in optimization methods, heuristics, and other subtleties of implementation.

The L1 cache of some processors can access two cells in parallel if they belong to different groups, but if they belong to the same group, only sequentially. As far as I know, some can even access different quarters of the same cell in parallel.

Processors can surprise you with clever optimizations. For example, the code from the previous example about false cache sharing does not work as intended on my home computer - in the simplest cases, the processor can optimize performance and reduce negative effects. If the code is slightly modified, everything falls into place.

Here is another example of strange iron quirks:

private static int A, B, C, D, E, F, G;

private static void Weirdness()
{
for (int i = 0; i< 200000000; i++)
{
<какой-то код>
}
}


If instead<какой-то код>substitute three different options, you can get the following results:

Incrementing fields A, B, C, D takes longer than incrementing fields A, C, E, G. Stranger still, incrementing fields A and C takes more time than fields A, C And E, G. I do not know exactly what the reasons for this are, but perhaps they are associated with memory banks ( yes, yes, with the usual three-liter memory savings banks, and not what you thought). If you have any thoughts on this, please feel free to comment.

On my machine, the above is not observed, however, sometimes there are abnormally bad results - most likely, the task scheduler makes its own "corrections".

The lesson to be learned from this example is that it is very difficult to fully predict the behavior of hardware. Yes, Can predict a lot, but you need to constantly confirm your predictions through measurements and testing.

Conclusion

I hope that everything discussed has helped you understand the structure of processor caches. Now you can put what you learned into practice to optimize your code.

The chips on most modern desktops have four cores, but chip makers have already announced plans to move to six cores, and 16-core processors are far from uncommon for high-end servers today.

The more cores, the greater the problem of allocating memory between all the cores while working together. With an increase in the number of cores, it is more and more profitable to minimize the loss of time on managing cores during data processing - because the data exchange rate lags behind the speed of the processor and data processing in memory. You can physically access someone else's fast cache, or you can use your own slow one, but save on data transfer time. The task is complicated by the fact that the amount of memory requested by programs does not clearly correspond to the amount of cache memory of each type.

Physically, only a very limited amount of memory can be placed as close to the processor as possible - the cache of the L1 level processor, the volume of which is extremely insignificant. Daniel Sanchez, Po-An Tsai, and Nathan Beckmann, researchers at MIT's Computer Science and Artificial Intelligence Laboratory, have taught a computer to configure its different types of memory to a flexible hierarchy of programs in real time mode. The new system, called Jenga, analyzes the volume needs and frequency of program memory accesses and reallocates the power of each of the 3 types of processor cache in combinations that provide increased efficiency and energy savings.


To begin with, the researchers tested the performance growth with a combination of static and dynamic memory when working on programs for a single-core processor and obtained a primary hierarchy - when which combination is better to use. From 2 types of memory or from one. Two parameters were evaluated - signal delay (latency) and energy consumed during the operation of each of the programs. Approximately 40% of programs began to work worse with a combination of types of memory, the rest - better. Having fixed which programs “like” mixed performance, and which ones like memory size, the researchers built their Jenga system.

They virtually tested 4 kinds of programs on a virtual machine with 36 cores. Programs tested:

  • omnet - Objective Modular Network Testbed, C simulation library and network simulator platform (blue in figure)
  • mcf - Meta Content Framework (red color)
  • astar - Virtual Reality Display Software (Green)
  • bzip2 - archiver (purple)


The picture shows where and how the data of each of the programs was processed. The letters show where each application runs (one per quadrant), the colors show where its data resides, and the shading indicates the second level of the virtual hierarchy when present.

Cache levels

The CPU cache is divided into several levels. For universal processors - up to 3. The fastest memory is the first level cache - L1-cache, since it is located on the same chip as the processor. Consists of an instruction cache and a data cache. Some processors without L1 cache cannot function. The L1 cache operates at the processor frequency and can be accessed every clock cycle. It is often possible to perform multiple read/write operations at the same time. The volume is usually small - no more than 128 KB.

The L1 cache interacts with the second-level cache - L2. It is the second fastest. It is usually located either on-chip, like L1, or in close proximity to the core, such as in a processor cartridge. In older processors, the chipset on the motherboard. The volume of L2 cache is from 128 KB to 12 MB. In modern multi-core processors, the second-level cache, located on the same chip, is a separate memory - with a total cache size of 8 MB, each core has 2 MB. Typically, the latency of the L2 cache located on the core chip is from 8 to 20 core cycles. In tasks involving numerous accesses to a limited memory area, for example, a DBMS, its full use gives a tenfold increase in performance.

The L3 cache is usually even larger, although somewhat slower than L2 (due to the fact that the bus between L2 and L3 is narrower than the bus between L1 and L2). L3 is usually located separately from the CPU core, but can be large - more than 32 MB. L3 cache is slower than previous caches, but still faster than RAM. In multiprocessor systems is in common use. The use of a third-level cache is justified in a very narrow range of tasks and may not only not provide an increase in performance, but vice versa and lead to a general decrease in system performance.

Disabling the second and third level cache is most useful in math problems when the amount of data is less than the size of the cache. In this case, you can load all the data at once into the L1 cache, and then process them.


From time to time, Jenga reconfigures virtual hierarchies at the OS level to minimize the amount of data exchange, taking into account resource constraints and application behavior. Each reconfiguration consists of four steps.

Jenga distributes data not only depending on which programs are dispatched - those who love large single-speed memory or those who love the speed of mixed caches, but also depending on the physical proximity of memory cells to the data being processed. Regardless of what type of cache the program requires by default or by hierarchy. The main thing is to minimize signal delay and power consumption. Depending on how many types of memory the program "likes", Jenga models the latency of each virtual hierarchy with one or two levels. Two-level hierarchies form a surface, one-level hierarchies form a curve. Jenga then projects the minimum delay in the dimensions of VL1, which results in two curves. Finally, Jenga uses these curves to select the best hierarchy (i.e. VL1 size).

The use of Jenga gave a tangible effect. The 36-core virtual chip is 30 percent faster and uses 85 percent less power. Of course, for now, Jenga is just a simulation of a running computer and it will be some time before you see real examples of this cache and even before chip manufacturers adopt it if they like the technology.

Configuration of a conditional 36 nuclear machine

  • Processors. 36 cores, x86-64 ISA, 2.4 GHz, Silvermont-like OOO: 8B-wide
    ifetch; 2-level bpred with 512x10-bit BHSRs + 1024x2-bit PHT, 2-way decode/issue/rename/commit, 32-entry IQ and ROB, 10-entry LQ, 16-entry SQ; 371pJ/instruction, 163mW/core static power
  • L1 caches. 32 KB, 8-way set-associative, split data and instruction caches,
    3-cycle latency; 15/33 pJ per hit/miss
  • Prefetchers Service. 16-entry stream prefetchers modeled after and validated against
    Nehalem
  • L2 caches. 128 KB private per-core, 8-way set-associative, inclusive, 6-cycle latency; 46/93 pJ per hit/miss
  • Coherent mode (Coherence). 16-way, 6-cycle latency directory banks for Jenga; in-cache L3 directories for others
  • Global NOC. 6×6 mesh, 128-bit flits and links, X-Y routing, 2-cycle pipelined routers, 1-cycle links; 63/71pJ per router/link flit traversal, 12/4mW router/link static power
  • Blocks of static memory SRAM. 18 MB, one 512 KB bank per tile, 4-way 52-candidate zcache, 9-cycle bank latency, Vantage partitioning; 240/500 pJ per hit/miss, 28 mW/bank static power
  • Multilayer Dynamic Memory Stacked DRAM. 1152MB, one 128MB vault per 4 tiles, Alloy with MAP-I DDR3-3200 (1600MHz), 128-bit bus, 16 ranks, 8 banks/rank, 2 KB row buffer; 4.4/6.2 nJ per hit/miss, 88 mW/vault static power
  • main memory. 4 DDR3-1600 channels, 64-bit bus, 2 ranks/channel, 8 banks/rank, 8 KB row buffer; 20 nJ/access, 4W static power
  • DRAM timings. tCAS=8, tRCD=8, tRTP=4, tRAS=24, tRP=8, tRRD=4, tWTR=4, tWR=8, tFAW=18 (all timings in tCK; stacked DRAM has half the tCK as main memory )

Cache is a memory built into the processor, into which the most frequently used data (commands) of the RAM are written, which significantly speeds up the work.

L1 cache size (from 8 to 128 KB)
The amount of cache memory in the first level.
L1 cache is a block of high-speed memory located directly on the processor core.
It copies data retrieved from RAM.

Saving the basic commands allows you to increase the performance of the processor due to the higher speed of data processing (processing from the cache is faster than from RAM).

The capacity of the cache memory of the first level is small and is calculated in kilobytes.
Typically, "older" processor models have a large L1 cache.
For multi-core models, the amount of L1 cache for one core is indicated.

L2 cache size (from 128 to 12288 KB)
The amount of cache memory in the second level.
The L2 cache is a block of high-speed memory that performs the same functions as the L1 cache (see "L1 cache size"), but with a slower speed and a larger volume.

If you choose a processor for resource-intensive tasks, then a model with a large amount of L2 cache will be preferable.
For multi-core processors, the total amount of L2 cache is indicated.

L3 cache size (from 0 to 16384 KB)
The amount of cache memory in the third level.
The integrated L3 cache, combined with a fast system bus, forms a high-speed data link to the system memory.

As a rule, only CPUs for server solutions or special editions of "desktop" processors are equipped with a third-level cache.

L3 cache is available, for example, in such processor lines as Intel Pentium 4 Extreme Edition, Xeon DP, Itanium 2, Xeon MP and others.

Twin BiCS FLASH - a new 3D flash memory technology

On December 11, 2019, at the IEEE International Electronic Devices Meeting (IEDM), TOKYO-Kioxia Corporation announced a 3D flash memory technology - Twin BiCS FLASH.

AMD Radeon Software Adrenalin Edition 2020 Driver 19.12.2 WHQL (Added)

On December 10, AMD introduced the Radeon Software Adrenalin 2020 Edition 19.12.2 WHQL mega driver.

Windows 10 Cumulative Update 1909 KB4530684

On December 10, 2019, Microsoft released cumulative update KB4530684 (Build 18363.535) for Windows 10 November 2019 Update (version 1909) on x86, x64 (amd64), ARM64 and Windows Server 2019 (1909) processor-based systems for x64-based systems.

NVIDIA Game Ready GeForce Driver 441.66 WHQL

NVIDIA GeForce Game Ready 441.66 WHQL driver includes support for MechWarrior 5: Mercenaries and Detroit: Become Human, and adds G-SYNC support for MSI MAG251RX and ViewSonic XG270 monitors.