NDIS 6.0 RSS vs. Current Receive Processing

NDIS 6.0 RSS Advantages

NDIS 6.0 resolves single-CPU processing issues by implementing Receive-Side Scaling (RSS). RSS is a Microsoft Scalable Networking initiative technology that enables receive processing to be balanced across multiple processors in the system while maintaining in-order delivery of the data. RSS enables parallel DPCs and, if the computer and network adapter support it, multiple interrupts.

RSS provides the following benefits:

· Parallel Execution. Receive packets from a single network adapter can be processed concurrently on multiple CPUs, while preserving in-order delivery.

· Dynamic Load Balancing. As system load on the host varies, RSS will rebalance the network processing load between the processors.

· Cache Locality. Because packets from a single connection are always mapped to a specific processor, state for a particular connection never has to move from one processor’s cache to another processor’s cache, thereby eliminating cache thrashing and also promoting improved performance.

· Send Side Scaling. Transmission Control Protocol (TCP) is often limited as to how much data it can send to the remote peer. The reasons can include the TCP congestion window, the size of the advertised receive window, or TCP slow-start. When an application tries to send a buffer larger than the size of the advertised receive window, TCP sends part of the data and then waits for an acknowledgment before sending the balance of the data. When the TCP acknowledgement arrives, additional data is sent in the context of the DPC in which the acknowledgement is indicated. Thus, scaled receive processing can also result in scaled transmit processing.

· Secure Hash. The default generated RSS signature is cryptographically secure, making it much more difficult for malicious remote hosts to force the system into an unbalanced state.

Receive-Side Scaling (RSS) Algorithm

This section defines the RSS algorithm and contrasts it with the current NDIS 5.1 packet processing algorithm. In general, RSS enables packets from a single network adapter to be processed in parallel on multiple CPUs while preserving in-order delivery to TCP connections.

NDIS 6.0 RSS vs. Current Receive Processing

The current NDIS 5.1 architecture for processing incoming packets, supported by the Microsoft Windows Server™ 2003 operating system, is typically implemented by a network adapter vendor by leveraging a receive descriptor queue between the network adapter and the miniport adapter to pass per-packet information. The packets are processed in the following sequence:

1. At the network adapter, as packets arrive off the wire, the packet contents are transferred into host memory using Direct Memory Access (DMA), and a receive descriptor is transferred into the receive descriptor queue (again through DMA). An interrupt will eventually be posted to the host to indicate that new data is present. Exactly when the interrupt fires depends on the vendor’s interrupt moderation scheme.

2. Depending on the system’s interrupt architecture, either the interrupt will be distributed to one of the host processors (based on a vendor-specific heuristic), or it will always be routed to the same processor.

3. At the network adapter, if additional packets arrive, then data and descriptors are transferred to host memory using DMA. An interrupt is not fired.

4. The interrupt service routine (ISR) runs on the host processor that the interrupt was routed to, which disables further interrupts from the network adapter. The ISR then schedules the miniport adapter’s deferred procedure call (DPC) to run on a specific processor—usually the same processor used to run the ISR, unless the DPC is explicitly set to run on another processor.

5. When the DPC runs, it processes the receive descriptor queue. Either the DPC creates an array of packets to hand to the NDIS interface, or it signals each packet to the NDIS interface, one at a time. In either case, no other processor can perform network adapter interrupt processing because interrupts from the network adapter are disabled.

6. The protocol stack processes each indicated packet. For TCP, this involves updating internal state, potentially sending new data if the TCP window allows it to do so, and potentially indicating or completing data to the application.

7. Once all receive descriptors have been consumed or some maximum amount of processing has been done, the DPC reenables interrupts on the network adapter and returns, allowing another interrupt to be triggered on another (potentially different) host processor.

RSS enables parallelism by changing steps 5 and 7 to allow one of the following algorithms to be implemented:

· Fire a single ISR that eventually results in the queuing of not just one DPC to a specific processor, but one DPC to potentially every processor. As shown in step 4, interrupts from the card remain disabled, and are only re-enabled after every DPC has executed in Step 7.

· Fire multiple ISRs to specific processors that cause multiple DPCs to be scheduled in parallel. As shown in step 4, a specific interrupt remains disabled, and is reenabled only after a single DPC (or group of DPCs for a given ISR) has executed in Step 7.

The sequence of events just described enables parallel processing of received packets; however, if in-order delivery is not preserved, performance will probably be degraded. For example, if packets for a group of connections are processed on different CPUs and one CPU is lightly loaded while the other is heavily loaded, older packets could actually be processed first. Because TCP acknowledgement generation and processing is highly optimized for in-order processing, performance will be degraded unless RSS supports in-order delivery of TCP segments.

RSS enables in-order packet delivery by ensuring that packets for a single TCP connection are always processed by one processor. This RSS feature requires that the network adapter examine each packet header and then use a hashing function to compute a signature for the packet. To ensure that the load is balanced evenly across the CPUs, the hash result is used as an index into an indirection table. Because the indirection table contains the specific CPU that is to run the associated DPC and the host protocol stack can change the contents of the indirection table at any time, the host protocol stack can dynamically balance the processing load on each CPU.

Figure 1 shows the RSS processing sequence. As shown on the right side of Figure 1, incoming network packets arrive for processing. The hash function is applied to the header to produce a 32-bit hash result. The hash type controls which incoming packet fields are used to generate the hash result. The hash mask is applied to the hash result to get the number of bits that are used to identify the CPU number in the indirection table. The Indirection Table result is then added to BaseCPUNumber to enable RSS interrupts to be restricted from some CPUs.

The RSS processing sequence generates two variables: the scheduled CPU that runs the deferred procedure call (DPC) and the 32-bit hash result. Both are passed to the protocol driver on a per-packet basis. Lines A and B in Figure 1 are possible implementation options that are discussed in “RSS Implementation,” later in this paper.

Figure 1 RSS receive-processing sequence

RSS Initialization

The Receive-Side Scaling (RSS) parameters are selected when the miniport adapter is initialized, and they can be changed while the miniport adapter is operational. During initialization, NDIS requests the set of predefined hashing functions and hashing types that the miniport adapter supports by calling a specific NDIS object identifier (OID) for RSS capability discovery. NDIS then uses another NDIS OID to inform the miniport adapter of the RSS configuration values that were selected.

All network adapters are required to implement the default hash function, referred to as the Toeplitz hash. For more information about the Toeplitz hash, see "Toeplitz Hash Function Specification" and "Next Steps and Resources" later in the paper.

The following variables are set during RSS initialization. Note that tuple is a common term in networking, and is used to indicate the number of parameters used. For example, 4-tuple means four parameters are used, and 2-tuple means that two parameters are used.

· Hash function. The default hash function is the Toeplitz hash. No other hash functions are currently defined.

· Hash type. The fields that are used to hash across the incoming packet. Depending on what the miniport adapter advertises that it can support, the host protocol stack can enable any combination of the following set of flags:

1. 4-tuple of source TCP Port, source IP version 4 (IPv4) address, destination TCP Port, and destination IPv4 address. This is the only required hash type to support.

2. 4-tuple of source TCP Port, source IP version 6 (IPv6) address, destination TCP Port, and destination IPv6 address.

3. 2-tuple of source IPv4 address, and destination IPv4 address.

4. 2-tuple of source IPv6 address, and destination IPv6 address.

5. 2-tuple of source IPv6 address, and destination IPv6 address, including support for parsing IPv6 extension headers.

See the RSS DDK documentation for additional information about combining hash field flags.

· Hash bits (or mask). The number of hash-result bits that are used to index into the indirection table. All network adapters must support seven bits. The host protocol stack will set the actual number of bits to be used during initialization. The number will be between 1 and 7, inclusive. This range effectively defines the size of the indirection table.

· Indirection table. The values for the indirection table. The host protocol stack will periodically rebalance the network load by changing the indirection table.

· BaseCPUNumber. The lowest number CPU to use for RSS. BaseCPUNumber is added to the result of the indirection table lookup.

· Secret hash key. The size of the key is dependent upon the hash function. For the Toeplitz hash, the size is 40 bytes for IPv6 and 16 bytes for IPv4.

Once RSS is initialized, data transfer can begin. Over a period of time, the host protocol stack will call the configuration OID to modify the indirection table to rebalance the processing load. Normally, all parameters in the OID will be the same except for the values contained in the indirection table; however, after RSS is initialized, the host protocol stack may change other RSS initialization parameters. This occurrence will be extremely rare, so it is acceptable to require a hardware reset to change the hash algorithm, the secret hash key, the hash type, the base CPU number, or the number of hash bits used.

12 3 4 5 Следующая ⇒