RSS Implementation Options

Receive-Side Scaling (RSS) is cryptographically secure, so the hash algorithm can require significant resources if it is run on the host CPU. Ideally, RSS should not be implemented on top of a network adapter that cannot generate the RSS hash.

If a network adapter can support the RSS hash, many implementation options are possible. The RSS implementation options described below incorporate hardware-specific tradeoffs, host parallelization tradeoffs, and limitations imposed by the host system implementation (specifically, whether the Message Signaled Interrupt (MSI-X) Engineering Change Notice for Peripheral Component Interconnect (PCI) 2.3 is supported). Implementation options vary depending on price and performance tradeoffs. Some of the tradeoffs include:

· Where the RSS load-balancing indirection table lookup occurs.

· The number of receive descriptor queues that exist between the network adapter hardware and the miniport adapter.

· The number of simultaneous outstanding message signaled interrupts that are supported by the network adapter. This enabled multiple interrupts on multiple CPUs at the same time.

The following RSS implementation options assume the presence of a queue between the network adapter and the miniport adapter. The queue contains “receive descriptors” and is referred to as the receive descriptor queue. Receive descriptors are implementation dependent, but presumably must contain a mechanism for the miniport adapter to find the received packets, as well as additional information necessary to pass the packets to NDIS. For RSS, an additional requirement is imposed: the receive descriptor must contain the 32-bit hash value so that the NDIS indicate can contain the 32-bit hash result for the packet.

Option 1a and option 2 assume that an atomic counter is maintained in the miniport adapter, which represents the number of launched deferred procedure calls (DPCs). One way to implement the atomic counter is to first initialize the variable according to the number of DPCs that are about to be launched. Then, when the DPCs are ready to exit, the parallel DPCs should decrement the value with the NdisInterlockedDecrement function. Finally, just before the last DPC returns (that is, when NdisInterlockedDecrement causes the atomic counter’s value to decrement to zero), the DPC reenables network adapter interrupts.

This may not be the most efficient way to manage interrupt moderation. More efficient mechanisms are left to vendor innovation.

Some implementations may include an additional concept, referred to as a CPU vector in this paper. A CPU vector might be needed in cases where there are multiple receive descriptor queues to be processed on multiple CPUs. The CPU vector is set by the network adapter to track which CPUs should have a DPC scheduled to process newly arrived receive descriptors. Before the network adapter fires an interrupt to cause the miniport adapter to process received packets, it uses a direct memory access (DMA) write to move the CPU vector from the network adapter into host memory. The interrupt service routine then reads from the host memory location and uses the CPU vector to schedule DPCs.

RSS Implementation options:

· Option 1—Multiple Receive Descriptor Queues.The network adapter computes the RSS hash and uses the indirection table to find the CPU for the packets. The network adapter supports a receive descriptor queue for each CPU. In Figure 1 earlier in this paper, "Line A," represents option 1.

· Option 1a—Single Interrupt (using either line-based interrupts or message signaled interrupts). The network adapter can initiate only a single interrupt at a time. The miniport adapter uses the CPU vector concept defined above.

1. As packets arrive, the network adapter queues receive descriptors directly into the correct CPU’s receive descriptor queue.

2. The network adapter delays the interrupt according to its vendor-specific interrupt moderation scheme, but it maintains a local vector of CPUs that need a scheduled DPC to process their receive descriptor queue.

3. Just before the network adapter fires the interrupt, the network adapter uses a Direct Memory Access (DMA) write to move the vector of CPUs into host memory.

4. The network adapter fires the interrupt, which causes the ISR to run.

5. The ISR uses the CPU vector in host memory to set an atomic counter according to the number of DPCs it is going to launch.

6. The ISR launches a DPC for each CPU in the CPU vector.

7. Each DPC processes the receive descriptor queue assigned to the CPU it is currently running on.

8. When each DPC finishes processing its receive descriptor queue, it decrements the atomic counter.

9. The last DPC to decrement the counter reenables the interrupt.

· Option 1b—Multiple Interrupts. As in option 1a, the network adapter computes the RSS hash, masks the result, locates the destination processor, adds the BaseCPUNumber, and uses a DMA write to move a receive descriptor to the correct queue. However, the network adapter uses message signaled interrupts (MSI-X, which enables support for multiple interrupts from a single I/O device to multiple CPUs) to directly interrupt the processors whose receive descriptor queues are not empty. The ISR that is triggered on each processor then initiates a single DPC on the associated processor to process the receive descriptors for that processor. Thus there is no need for an atomic counter. When a DPC exits, it reenables the interrupt for that processor. For more information about MSI-X, visit the PCI-SIG website at http://www.pcisig.com/home.

· Option 2—Single Receive Descriptor Queue.The network adapter computes only the RSS hash, and it has a single receive descriptor queue that is linked to the miniport adapter. The miniport adapter supports a primary DPC and a secondary DPC. The primary DPC maintains a receive descriptor queue for each secondary DPC. In Figure 1 earlier in this paper, "Line B" represents option 2. The following enumerated list matches the list shown in option 1a, but is edited to reflect the changes associated with option 2.

1. As packets arrive, the network adapter queues receive descriptors into the receive descriptor queue.

2. The network adapter delays the interrupt according to its vendor-specific interrupt moderation scheme.

3. Deleted. This step is not applicable to option 2.

4. The network adapter fires the interrupt, which causes the interrupt service routine to run, which causes the primary DPC to be scheduled.

5. This operation is now significantly more complex:

§ The primary DPC performs the following operations for each receive descriptor:

· Retrieves the 32-bit hash value from the receive descriptor.

· Masks the hash value to the number of bits set in the configuration OID.

· Uses the masked hash value to index into the indirection table and locate the CPU for this packet.

· Copies the receive descriptor to the appropriate secondary DPC receive descriptor queue.

· Maintains a CPU vector of secondary DPCs to schedule.

§ The primary DPC uses the CPU vector to set an atomic counter according to the number of DPCs it is going to launch.

6. The primary DPC launches a secondary DPC for each CPU in the vector (except for the CPU it is executing on), and it then processes the receive descriptor queue assigned to the CPU it is currently running on.

7. When each secondary DPC finishes processing its receive descriptor queue, it decrements the atomic counter.

8. The last secondary DPC to decrement the counter reenables the interrupt.

Option 1a assumes the costs of implementing the hash computation, processor mapping, and multiple receive descriptor queues in the network adapter are acceptable, but requires a single interrupt service routine to queue multiple DPCs on various processors (which in turn requires an interprocessor interrupt). If interrupts are not enabled until after all DPCs have finished processing, there is also additional overhead because of the atomic counter, and there is a potential for delayed receive processing if one receive descriptor queue is lightly loaded and another is heavily loaded.

Option 1b provides the highest level of parallelization. Because interrupts are enabled independently for each CPU, under some workloads it might also be more responsive because it avoids some head-of-queue blocking issues. However, since option 1b requires MSI-X functionality, a feature that is just starting to be supported by host systems, implementations that support option 1b and MSI-X should also support option 1a for systems that do not support MSI-X.

Option 2 is the most cost effective implementation in terms of network adapter costs, but it implements much of the RSS algorithm in software, thereby increasing the load on the processor that is hosting the primary DPC. The increased load is from software mapping the hash signature to the processor and then copying the receive descriptor into the appropriate processor queue. Option 2 also involves synchronization costs that are needed to ensure that the interrupt is reenabled when all DPCs have finished their packet processing. This is also true for option 1a.

⇐ Предыдущая 1 234 5 Следующая ⇒