[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide
From: |
Scott Feldman |
Subject: |
Re: [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide |
Date: |
Fri, 16 Jan 2015 00:14:05 -0800 |
On Mon, Jan 12, 2015 at 3:40 AM, Paolo Bonzini <address@hidden> wrote:
> On 11/01/2015 04:57, address@hidden wrote:
>> +PCI Configuration Space
>> +-----------------------
>> +
>> +Each switch instance registers as a PCI device with PCI configuration space:
>> +
>> + offset width description value
>> + ---------------------------------------------
>> + 0x0 2 Vendor ID 0x1b36
>> + 0x2 2 Device ID 0x0006
>> + 0x4 4 Command/Status
>> + 0x8 1 Revision ID 0x01
>> + 0x9 3 Class code 0x2800
>> + 0xC 1 Cache line size
>> + 0xD 1 Latency timer
>> + 0xE 1 Header type
>> + 0xF 1 Built-in self test
>> + 0x10 4 Base address low
>> + 0x14 4 Base address high
>> + 0x18-28 Reserved
>> + 0x2C 2 Subsystem vendor ID 0x0000
>> + 0x2E 2 Subsystem ID 0x0000
>
> This should not be guaranteed to 0, should it?
Your're right. Added a note that subsystem implementation will fill this in.
>
>> + 0x30-38 Reserved
>> + 0x3C 1 Interrupt line
>> + 0x3D 1 Interrupt pin 0x00
>> + 0x3E 1 Min grant 0x00
>> + 0x3D 1 Max latency 0x00
>> + 0x40 1 TRDY timeout
>> + 0x41 1 Retry count
>> + 0x42 2 Reserved
>> +
>> +
>> +SECTION 3: Memory-Mapped Register Space
>> +=======================================
>> +
>> +There are two memory-mapped BARs. BAR0 maps device register space and is
>> +0x2000 in size. BAR1 maps MSI-X vector and PBA tables and is also 0x2000 in
>> +size, allowing for 256 MSI-X vectors. The host BIOS will assign the base
>> +address location. The host driver/OS will map the base address to host
>> memory,
>> +giving the driver mmio access to the device register space.
>
> No need for the bits after "The host BIOS..." since that's just normal PCI.
Gone.
>> +All registers are 4 or 8 bytes long. It is assumed host software will
>> access 4
>> +byte registers with one 4-byte access, and 8 byte registers with either two
>> +4-byte accesses or a single 8-byte access. In the case of two 4-byte
>> accesses,
>> +access must be lower and then upper 4-bytes, in that order.
>
> Double 4-byte accesses are not implemented, are they?
They are now :) Tested on i386. I'll include changes with v4.
>> +Interrupt credits
>> +^^^^^^^^^^^^^^^^^
>> +
>> +MSI-X vectors used for descriptor ring completions use a credit mechanism
>> for
>> +efficient device, PCIe bus, OS and driver operations. Each descriptor ring
>> has
>> +a credit count which represent the number of outstanding descriptors to be
>> +processed by the driver. As the device marks descriptors complete, the
>> credit
>> +count is incremented. As the driver processes those outstanding
>> descriptors,
>> +it returns credits back to the device. This way, the device knows the
>> driver's
>> +progress and can make decisions about when to fire the next interrupt or
>> not.
>> +When the credit count is zero, and the first descriptors are posted for the
>> +driver, a single interrupt is fired. Once the interrupt is fired, the
>> +interrupt is disabled (auto-masked). In response to the interrupt, the
>> driver
>> +will process descriptors and PIO write a returned credit value for that
>> +descriptor ring. If the driver returns all credits (the driver caught up
>> with
>> +the device and there is no outstanding work), then the interrupt is
>> unmasked,
>> +but not fired. If only partial credits are returned, the interrupt remains
>> +masked but the device generates an interrupt, signaling the driver that more
>> +outstanding work is available.
>
> Perhaps mention that this masking is unrelated to the MSI-X interrupt
> mask register?
Done.
>> +SECTION 5: Test Registers
>> +=========================
>> +
>> +Rocker switch has several test registers to support troubleshooting register
>
> s/Rocker switch/Rocker/
Done.
>> +access, interrupt generation, and DMA operations:
>> +
>> + TEST_REG, offset 0x0010, 32-bit (R/W)
>> + TEST_REG64, offset 0x0018, 64-bit (R/W)
>> + TEST_IRQ, offset 0x0020, 32-bit (R/W)
>> + TEST_DMA_ADDR, offset 0x0028, 64-bit (R/W)
>> + TEST_DMA_SIZE, offset 0x0030, 32-bit (R/W)
>> + TEST_DMA_CTRL, offset 0x0034, 32-bit (R/W)
>> +
>> +Reads to TEST_REG and TEST_REG64 will read a value 2x the last value
>> written to
>
> s/2x/equal to twice/
Done.
>> +the register. The 32-bit and 64-bit versions are for testing 32-bit and
>> 64-bit
>> +host accesses.
>
> Right now, as mentioned above, 64-bit registers must be accessed with a
> single 32-bit host access.
Fixed in implementation.
> In the case of 32-bit host accesses, should TEST_REG64's value be
> latched until the upper half is written? If so, please mention it and
> describe that this behavior is shared with the other 64-bit Rocker
> registers.
>
>> +Bits written to TEST_IRQ will cause same (unmasked) bits to be written to
>> +IRQ_STAT and an interrupt generated. Use IRQ_MASK to mask and unmask
>> +particular bits.
>
> It looks like actually TEST_IRQ will generate a single interrupt, not
> many of them. So writing 1 sets bits 1 in the PBA, not bit 0. Writing
> 3 sets bits 3, not bits 0 and 1.
Good catch...updated doc.
> Please do not use "IRQ_STAT", call it the PBA instead. Also remove the
> reference to IRQ_MASK, it's uninteresting.
>
>> +SECTION 7: Switch Control
>> +=========================
>> +
>> +This section covers switch-wide register settings.
>> +
>> +Control
>> +-------
>> +
>> +This register is used for low level control of the switch.
>> +
>> + CONTROL: offset 0x0300, 32-bit, (W)
>> +
>> + bit name description
>> +
>> ------------------------------------------------------------------------
>> + [0] CONTROL_RESET If set, device will perform reset (same
>> + as pci reset)
>
> It's not the same as PCI reset, as it will not reset BARs for example.
Fixed.
>> +
>> +SECTION 8: CPU Packet Processing
>> +================================
>> +
>> +For packets ingressing on switch ports that are not forwarded by the switch
>> but
>> +rather directed to the host CPU for further processing are delivered in the
>> +DMA RX ring. Likewise, for host CPU originating packets destined to egress
>> on
>> +switch ports onto the network are scheduled by software using the DMA TX
>> ring.
>
> Ingress packets for ports that are not forwarded by the switch are
> directed to the host CPU for further processing, and delivered in the
> DMA RX ring. Likewise, the host CPU can use the DMA TX ring to schedule
> packets that will egress onto the network.
Fixed by simplifying.
>> +
>> +Tx Packet Processing
>> +--------------------
>> +
>> +Software schedules packets for egress on switch ports using the DMA TX
>> ring. A
>> +TX descriptor buffer describes the packet location and size in host DMA-able
>> +memory, the destination port, and any hardware-offload functions (such as L3
>> +payload checksum offload). Software then bumps the descriptor head to
>> signal
>> +hardware of new Tx work. In response, hardware will DMA read Tx
>> descriptors up
>> +to head, DMA read descriptor buffer and packet data, perform offloading
>> +functions, and finally frame packet on wire (network). Once packet
>> processing
>> +is complete, hardware will writeback status to descriptor(s) to signal to
>> +software that Tx is complete and software resources (e.g. skb) backing
>> packet
>> +can be released.
>> +
>> +Figure 2 shows an example 3-fragment packet queued with one Tx descriptor.
>> A
>> +TLV is used for each packet fragment.
>> +
>> + pkt frag 1
>> + +–––––––+ +–+
>> + +–––+ | |
>> + desc buf | | | |
>> + +––––––––+ | | | |
>> + Tx ring +–––+ +–––––+ | | |
>> + +–––––––––+ | | TLVs | +–––––––+ |
>> + | +–––+ +––––––––+ pkt frag 2 |
>> + | desc 0 | | +–––––+ +–––––––+ |
>> + +–––––––––+ | TLVs | +–––+ | |
>> + head+–+ | +––––––––+ | | |
>> + | desc 1 | | +–––––+ +–––––––+ |pkt
>> + +–––––––––+ | TLVs | | |
>> + | | +––––––––+ | pkt frag 3 |
>> + | | | +–––––––+ |
>> + +–––––––––+ +–––+ | |
>> + | | | | |
>> + | | | | |
>> + +–––––––––+ | | |
>> + | | | | |
>> + | | | | |
>> + +–––––––––+ | | |
>> + | | +–––––––+ +–+
>> + | |
>> + +–––––––––+
>> +
>> + fig 2.
>> +
>> +The TLVs for Tx descriptor buffer are:
>> +
>> + field width description
>> + ---------------------------------------------------------------------
>> + PPORT 4 Destination physical port #
>> + TX_OFFLOAD 1 Hardware offload modes:
>> + 0: no offload
>> + 1: insert IP csum (ipv4 only)
>> + 2: insert TCP/UDP csum
>> + 3: L3 csum calc and insert
>> + into csum offset (TX_L3_CSUM_OFF)
>> + 16-bit 1's complement csum value.
>> + IPv4 pseudo-header and IP
>> + already calculated by OS
>> + and inserted.
>> + 4: TSO (TCP Segmentation Offload)
>> + TX_L3_CSUM_OFF 2 For L3 csum offload mode, the offset,
>> + from the beginning of the packet,
>> + of the csum field in the L3 header
>> + TX_TSO_MSS 2 For TSO offload mode, the
>> + Maximum Segment Size in bytes
>> + TX_TSO_HDR_LEN 2 For TSO offload mode, the
>> + length of ethernet, IP, and
>> + TCP/UDP headers, including IP
>> + and TCP options.
>> + TX_FRAGS <array> Packet fragments
>> + TX_FRAG <nest> Packet fragment
>> + TX_FRAG_ADDR 8 DMA address of packet fragment
>> + TX_FRAG_LEN 2 Packet fragment length
>> +
>> +Possible status return codes in descriptor on completion are:
>> +
>> + DESC_COMP_ERR reason
>> + --------------------------------------------------------------------
>> + 0 OK
>> + ENXIO address or data read err on desc buf or packet
>> + fragment
>
> This is more like EFAULT actually.
>
>> + EINVAL bad pport or TSO or csum offloading error
>> + ENOMEM no memory for internal staging tx fragment
>
> QEMU is portable and these values are not, unfortunately. So please
> hardcode them to be 6/22/12 respectively.
>
> Or even better, to avoid the temptation, make them 1/2/3 and create new
> constants ROCKER_OK, ROCKER_ERR_FAULT, ROCKER_ERR_INVAL, ROCKER_ERR_NOMEM.
Since Linux driver is already out there in 3.18, we're stuck with the
values defined in errno.h for x86_64. But, no problem, I've
hard-coded those values for ROCKER_EINVAL, ROCKER_ENOMEM, etc. I'll
switch the Linux driver over to these constants when it's touched
again.
> In any case, since you are at it, sort them in either numeric order or
> alphabetic order (apart from OK which can remain first).
>
>> +Rx Packet Processing
>> +--------------------
>> +
>> +For packets ingressing on switch ports that are not forwarded by the switch
>> but
>> +rather directed to the host CPU for further processing are delivered in the
>> +DMA RX ring. Rx descriptor buffers are allocated by software and placed on
>> the
>> +ring. Hardware will fill Rx descriptor buffers with packet data, write the
>> +completion, and signal to software that a new packet is ready. Since Rx
>> packet
>> +size is not known a-priori, the Rx descriptor buffer must be allocated for
>> +worst-case packet size. A single Rx descriptor will contain the entire Rx
>> +packet data in one RX_PACKET TLV. Other Rx TLVs describe and hardware
>> offloads
>> +performed on the packet, such as checksum validation.
>> +
>> +The TLVs for Rx descriptor buffer are:
>> +
>> + field width description
>> + ---------------------------------------------------
>> + PPORT 4 Source physical port #
>> + RX_FLAGS 2 Packet parsing flags:
>> + (1 << 0): IPv4 packet
>> + (1 << 1): IPv6 packet
>> + (1 << 2): csum calculated
>> + (1 << 3): IPv4 csum good
>> + (1 << 4): IP fragment
>> + (1 << 5): TCP packet
>> + (1 << 6): UDP packet
>> + (1 << 7): TCP/UDP csum good
>> + RX_CSUM 2 IP calculated checksum:
>> + IPv4: IP payload csum
>> + IPv6: header and payload csum
>> + (Only valid is RX_FLAGS:csum calc is set)
>> + RX_PACKET (N) <var> Packet data
>> +
>> +Possible status return codes in descriptor on completion are:
>> +
>> + DESC_COMP_ERR reason
>> + --------------------------------------------------------------------
>> + 0 OK
>> + ENXIO address or data read err on desc buf
>> + ENOMEM no memory for internal staging desc buf
>> + EMSGSIZE Rx descriptor buffer wasn't big enough to contain
>> + pactet data TLV and other TLVs.
>
> EMSGSIZE in fact doesn't exist on Windows even. So make this
> ROCKER_ERR_MSGSIZE==4.
>
>
>> + field width description
>> + ----------------------------------------------------
>> + OF_DPA_CMD 2 CMD_[ADD|MOD]
>> + OF_DPA_TBL 2 Flow table ID
>> + 0: ingress port
>> + 10: vlan
>> + 20: termination mac
>> + 30: unicast routing
>> + 40: multicast routing
>> + 50: bridging
>> + 60: ACL policy
>
> Decimal, I guess. Better mention it, if only for completeness.
>
>> +Possible status return codes in descriptor on completion are:
>> +
>> + DESC_COMP_ERR command reason
>> + --------------------------------------------------------------------
>> + 0 all OK
>> + EFAULT all head or tail index outside
>> + of ring
>> + ENXIO all address or data read err on
>> + desc buf
>> + ENOSPC GET_STATS cmd descriptor buffer wasn't
>> + big enough to contain
>> write-back
>> + TLVs
>> + EINVAL ADD|MOD invalid parameters passed in
>> + EEXIST ADD entry already exists
>> + ENOSPC ADD no space left in flow table
>> + ENOENT MOD|DEL|GET_STATS group ID invalid
>> + EBUSY DEL group reference count non-zero
>> + ENODEV ADD next group ID doesn't exist
>
> Same as above, please add decimal values instead of overloading errno.
Updated doc with new ROCKER_Exxx return codes.
>
> Paolo
- [Qemu-devel] [PATCH v3 0/9] rocker: add new rocker ethernet switch device, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 2/9] virtio-net: use qemu_mac_strdup_printf, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 5/9] pci: add network device class 'other' for network switches, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 3/9] rocker: add register programming guide, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 1/9] net: add MAC address string printer, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 4/9] pci: add rocker device ID, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 8/9] rocker: add tests, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 9/9] MAINTAINERS: add rocker, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 7/9] qmp: add rocker device support, sfeldma, 2015/01/10
- [Qemu-devel] [PATCH v3 6/9] rocker: add new rocker switch device, sfeldma, 2015/01/10