Чтоб долго не искать

Цитата
“Limited PCI bursting with x86 architecture”
Reads originated from an x86 (Intel) architecture CPU destined to a PCI slave device
(such as PLX 9050) will never be bursted. This is a x86 IA32 limitation. By their nature,
reads from a PCI slave device are to uncached memory. (The PCI device is mapped into
uncached memory). As a result, reads are blocking (another read can’t emerge from the
CPU until the earlier read is completed). So this leads to the situation that the largest
”burst” of data is confined to the largest ”single” read that the x86 CPU can perform
to uncached memory. Currently, this is a 64 bit read, done using the instruction MOVQ
r64, mem. So, the largest read burst that you can get by using an x86 CPU to read
a PCI slave is a 64 bit, or ”two” data phase burst. (Pretty darn short burst, since it
is not even a full cache line). Tricks can be played that could allow a faster read, but
they are involved and error prone. For instance, the PCI slave device’s memory could
be mapped as cacheable (likely writethrough mode would be best). Then when the PCI
slave device was read by the CPU execution unit, the cache unit would pull in a whole
cache line at a time. This could be done for consecutive addresses. The result would be
”bursting” of x86 cacheline size (64 bytes per line), or 16 PCI data transfer clocks. (Not
that fantastic either, but better). Unfortunately, you then have to flush the CPU cache
before transferring again from the same address. (That is usually detrimental to system
performance, so no one does it). In summary, the x86 will block on a read to uncached
memory. This limits the ”burst” size to the size of the largest single ”read” transaction
to uncached memory, which is 64-bits, 4xbytes, 2 PCI data clocks, all byte enables. If
you play games with the cache, and mark the memory as write through, then you can
get 64 bytes reads, but the cost is flushing the entire CPU cache whenever address reads
will repeat. This is so expensive that it defeats the gains realized in the larger bursts, so
it usually is not done. (Plus getting an OS interface that allows WBINVD cache flush
instruction to be executed is a hassle). As to burst writes, chipset (Intel 440BX and VIA
MVP3, for example) architectures won’t burst more than 4 LWords at a time out of main
memory to a PCI target device. This is because uncached writes coming out of the CPU
don’t gather in posted write buffers in the chipset in greater numbers than 4 LWords at
a time to allow greater bursts. You won’t get greater bursts, no matter how hard you
try. You can get limited bursting if you work at it. The recommended combination is: 1.
Map the device memory as USWC (write combined) if possible. NT has functions that
allow for this, and some other OS do as well. This facility resets the x86 architecture
MTRR registers to give a USWC format to some memory. Some video drivers call this
interface; try looking at a video driver sample in the NT DDK. 2. Use a chipset with
lots of posted write buffers, and one which can combine sequential writes to linearly
sequential addresses into a single burst. A chipset that can combine sequential partial
writes into a single PCI burst is the best. 3. Use the MOVQ mem, m64 instruction in
a tight loop rather than the REP movsd instruction. The former at least outputs 64-bit
partial writes, the latter only does 32-bit partial writes. Write performance with bursts
of 4 LWords originating from an x86 architecture CPU is approximately 50MB/s. High
performance devices such as the 9054 are set up as bus masters so they can master their
own traffic with longer bursts. The 9054 can achieve transfer performance of 122MB/s.