Attempt to fix powerpc64 LINT kernel broken by r308000. Netmap's use of
a uint64_t wchan seems odd, but in the interest of minimizing this
change just cast through uintptr_t to silence the compiler warning.
Reviewed by: jhb
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D8669
we have to refresh it ... always. This fixes problems reported in NetMap
with em(4) devices after conversion to extended descriptor format in
svn r293331.
Submitted by: luigi@
Reported by: franco@opnsense.org
MFC after: 2 days
- use PCI_VENDOR and PCI_DEVICE ids from a publicly allocated range
(thanks to RedHat)
- export memory pool information through PCI registers
- improve mechanism for configuring passthrough on different hypervisors
Code is from Vincenzo Maffione as a follow up to his GSOC work.
fix build on 32 bit platforms
simplify logic in netmap_virt.h
The commands (in net/netmap.h) to configure communication with the
hypervisor may be revised soon.
At the moment they are unused so this will not be a change of API.
This commit, long overdue, contains contributions in the last 2 years
from Stefano Garzarella, Giuseppe Lettieri, Vincenzo Maffione, including:
+ fixes on monitor ports
+ the 'ptnet' virtual device driver, and ptnetmap backend, for
high speed virtual passthrough on VMs (bhyve fixes in an upcoming commit)
+ improved emulated netmap mode
+ more robust error handling
+ removal of stale code
+ various fixes to code and documentation (some mixup between RX and TX
parameters, and private and public variables)
We also include an additional tool, nmreplay, which is functionally
equivalent to tcpreplay but operating on netmap ports.
netmap_kern.h currently requires all drivers including it to include
selinfo.h.
Submitted by: mmacy@nextbsd.org
Reviewed by: gnn
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D5334
The variables that are extern in the netmap header file should be
defined in ixl_txrx.c (the file that is included in both ixl(4)/ixlv(4),
not in the main driver source files.
Reported by: ed@, dim@, ngie@
Several files use the internal name of `struct device` instead of
`device_t` which is part of the public API. This patch changes all
`struct device *` to `device_t`.
The remaining occurrences of `struct device` are those referring to the
Linux or OpenBSD version of the structure, or the code is not built on
FreeBSD and it's unclear what to do.
Submitted by: Matthew Macy <mmacy@nextbsd.org> (previous version)
Approved by: emaste, jhibbits, sbruno
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D7447
m_unshare passes on the source mbuf's flags as-is to m_getcl and this
results in a leak if the flags include M_NOFREE. The fix is to clear
the bits not listed in M_COPYALL before calling m_getcl. M_RDONLY
should probably be filtered out too but that's outside the scope of this
fix.
Add assertions in the zone_mbuf and zone_pack ctors to catch similar
bugs.
Update netmap_get_mbuf to not pass M_NOFREE to m_getcl. It's not clear
what the original code was trying to do but it's likely incorrect.
Updated code is no different functionally but it avoids the newly added
assertions.
Reviewed by: gnn@
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D5698
e1000/e1000e split in linux.
Split rxbuffer and txbuffer apart to support the new RX descriptor format
structures. Move rxbuffer manipulation to em_setup_rxdesc() to unify the
new behavior changes.
Add a RSSKEYLEN macro for help in generating the RSSKEY data structures
in the card.
Change em_receive_checksum() to process the new rxdescriptor format
status bit.
MFC after: 2 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D3447
This is a subtle use-after-free race that results in some very undesirable
hang behaviour.
Reviewed by: pkelsey
Obtained from: Kip Macy, NextBSD (91a9bd1dbb)
(detected by jenkins run with gcc 4.9)
Update documentation on the use of netmap_priv_d,
rename the refcount and use the same structure in
FreeBSD and linux
No functional changes.
This commit contains large contributions from Giuseppe Lettieri and
Stefano Garzarella, is partly supported by grants from Verisign and Cisco,
and brings in the following:
- fix zerocopy monitor ports and introduce copying monitor ports
(the latter are lower performance but give access to all traffic
in parallel with the application)
- exclusive open mode, useful to implement solutions that recover
from crashes of the main netmap client (suggested by Patrick Kelsey)
- revised memory allocator in preparation for the 'passthrough mode'
(ptnetmap) recently presented at bsdcan. ptnetmap is described in
S. Garzarella, G. Lettieri, L. Rizzo;
Virtual device passthrough for high speed VM networking,
ACM/IEEE ANCS 2015, Oakland (CA) May 2015
http://info.iet.unipi.it/~luigi/research.html
- fix rx CRC handing on ixl
- add module dependencies for netmap when building drivers as modules
- minor simplifications to device-specific routines (*txsync, *rxsync)
- general code cleanup (remove unused variables, introduce macros
to access rings and remove duplicate code,
Applications do not need to be recompiled, unless of course
they want to use the new features (monitors and exclusive open).
Those willing to try this code on stable/10 can just update the
sys/dev/netmap/*, sys/net/netmap* with the version in HEAD
and apply the small patches to individual device drivers.
MFC after: 1 month
Sponsored by: (partly) Verisign, Cisco
up to 2 rx/tx queues for the 82574.
Program the 82574 to enable 5 msix vectors, assign 1 to each rx queue,
1 to each tx queue and 1 to the link handler.
Inspired by DragonFlyBSD, enable some RSS logic for handling tx queue
handling/processing.
Move multiqueue handler functions so that they line up better in a diff
review to if_igb.c
Always enqueue tx work to be done in em_mq_start, if unable to acquire
the TX lock, then this will be processed in the background later by the
taskqueue. Remove mbuf argument from em_start_mq_locked() as the work
is always enqueued. (stolen from igb)
Setup TARC, TXDCTL and RXDCTL registers for better performance and stability
in multiqueue and singlequeue implementations. Handle Intel errata 3 and
generic multiqueue behavior with the initialization of TARC(0) and TARC(1)
Bind interrupt threads to cpus in order. (stolen from igb)
Add 2 new DDB functions, one to display the queue(s) and their settings and
one to reset the adapter. Primarily used for debugging.
In the multiqueue configuration, bump RXD and TXD ring size to max for the
adapter (4096). Setup an RDTR of 64 and an RADV of 128 in multiqueue configuration
to cut down on the number of interrupts. RADV was arbitrarily set to 2x RDTR
and can be adjusted as needed.
Cleanup the display in top a bit to make it clearer where the taskqueue threads
are running and what they should be doing.
Ensure that both queues are processed by em_local_timer() by writing them both
to the IMS register to generate soft interrupts.
Ensure that an soft interrupt is generated when em_msix_link() is run so that
any races between assertion of the link/status interrupt and a rx/tx interrupt
are handled.
Document existing tuneables: hw.em.eee_setting, hw.em.msix, hw.em.smart_pwr_down, hw.em.sbp
Document use of hw.em.num_queues and the new kernel option EM_MULTIQUEUE
Thanks to Intel for their continued support of FreeBSD.
Reviewed by: erj jfv hiren gnn wblock
Obtained from: Intel Corporation
MFC after: 2 weeks
Relnotes: Yes
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D1994
was granted via rings and ni_bufs_list_head represented in those rings
and lists (e.g., via SIGKILL), those buffers are no longer available
for subsequent users for the lifetime of the system. To mitigate this
resource leak, reset the allocator state when the last ref to that
allocator is released.
Note that this only recovers leaked resources for an allocator when
there are no longer any users of that allocator, so there remain
circumstances in which leaked allocator resources may not ever be
recovered - consider a set of multiple netmap processes that are all
using the same allocator (say, the global allocator) where members of
that set may be killed and restarted over time but at any given point
there is one member of that set running.
Based on intial work by adrian@.
Reviewed by: Giuseppe Lettieri (g.lettieri@iet.unipi.it), luigi
Approved by: jmallett (mentor)
MFC after: 1 week
Sponsored by: Norse Corp, Inc.
the right solution but I will leave it to experts to untangle this
problem to properly stop the build failures.
At the moment only if_ix.c includes dev/netmap/ixgbe_netmap.h which is
good as ixgbe_netmap.h defines a couple of (file) static variables--thus
local to if_ix.c.
static int ix_crcstrip however now also got checked from ix_txrx.c
(as an extern) and should not be visible there. In fact we do see
powerpc and powerpc64 build failures because of this. It is unclear
to me why on other (clang built?) architectures this does not lead
to a reference of an undefined symbol and similar build breakage.
Preliminary tests indicate 32 Mpps on tx, 24 Mpps on rx
with source and receiver on two different ports of the same 40G card.
Optimizations are likely possible.
The code follows closely the one for ixgbe so i do not
expect stability issues.
Hardware kindly supplied by Intel.
Reviewed by: Jack Vogel
MFC after: 1 week
1. handle errors from nm_config(), if any (none of the FreeBSD drivers
currently returns an error on this function, so this change
is a no-op at this time
2. use a full memory barrier on ioctls
from the FreeBSD network code. The flag is still kept around in the
"sys/mbuf.h" header file, but does no longer have any users. Instead
the "m_pkthdr.rsstype" field in the mbuf structure is now used to
decide the meaning of the "m_pkthdr.flowid" field. To modify the
"m_pkthdr.rsstype" field please use the existing "M_HASHTYPE_XXX"
macros as defined in the "sys/mbuf.h" header file.
This patch introduces new behaviour in the transmit direction.
Previously network drivers checked if "M_FLOWID" was set in "m_flags"
before using the "m_pkthdr.flowid" field. This check has now now been
replaced by checking if "M_HASHTYPE_GET(m)" is different from
"M_HASHTYPE_NONE". In the future more hashtypes will be added, for
example hashtypes for hardware dedicated flows.
"M_HASHTYPE_OPAQUE" indicates that the "m_pkthdr.flowid" value is
valid and has no particular type. This change removes the need for an
"if" statement in TCP transmit code checking for the presence of a
valid flowid value. The "if" statement mentioned above is now a direct
variable assignment which is then later checked by the respective
network drivers like before.
Additional notes:
- The SCTP code changes will be committed as a separate patch.
- Removal of the "M_FLOWID" flag will also be done separately.
- The FreeBSD version has been bumped.
MFC after: 1 month
Sponsored by: Mellanox Technologies
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.
Basically no user API changes (some bugfixes in sys/net/netmap_user.h).
In detail:
1. netmap support for virtio-net, including in netmap mode.
Under bhyve and with a netmap backend [2] we reach over 1Mpps
with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.
2. (kernel) add support for multiple memory allocators, so we can
better partition physical and virtual interfaces giving access
to separate users. The most visible effect is one additional
argument to the various kernel functions to compute buffer
addresses. All netmap-supported drivers are affected, but changes
are mechanical and trivial
3. (kernel) simplify the prototype for *txsync() and *rxsync()
driver methods. All netmap drivers affected, changes mostly mechanical.
4. add support for netmap-monitor ports. Think of it as a mirroring
port on a physical switch: a netmap monitor port replicates traffic
present on the main port. Restrictions apply. Drive carefully.
5. if_lem.c: support for various paravirtualization features,
experimental and disabled by default.
Most of these are described in our ANCS'13 paper [1].
Paravirtualized support in netmap mode is new, and beats the
numbers in the paper by a large factor (under qemu-kvm,
we measured gues-host throughput up to 10-12 Mpps).
A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.
Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.
This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.
A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.
MFC after: 3 days.
* The way rings are updated changed with the last API bump.
Also sync ->head when moving slots in netmap_sw_to_nic().
* Remove a crashing selrecord() call.
* Unclog the logic surrounding netmap_rxsync_from_host().
* Add timestamping to RX host ring.
* Remove a couple of obsolete comments.
Submitted by: Franco Fichtner
MFC after: 3 days
Sponsored by: Packetwerk
interface allows the ifnet structure to be defined as an opaque
type in NIC drivers. This then allows the ifnet structure to be
changed without a need to change or recompile NIC drivers.
Put differently, NIC drivers can be written and compiled once and
be used with different network stack implementations, provided of
course that those network stack implementations have an API and
ABI compatible interface.
This commit introduces the 'if_t' type to replace 'struct ifnet *'
as the type of a network interface. The 'if_t' type is defined as
'void *' to enable the compiler to perform type conversion to
'struct ifnet *' and vice versa where needed and without warnings.
The functions that implement the API are the only functions that
need to have an explicit cast.
The MII code has been converted to use the driver API to avoid
unnecessary code churn. Code churn comes from having to work with
both converted and unconverted drivers in correlation with having
callback functions that take an interface. By converting the MII
code first, the callback functions can be defined so that the
compiler will perform the typecasts automatically.
As soon as all drivers have been converted, the if_t type can be
redefined as needed and the API functions can be fix to not need
an explicit cast.
The immediate benefactors of this change are:
1. Juniper Networks - The network stack implementation in Junos
is entirely different from FreeBSD's one and this change
allows Juniper to build "stock" NIC drivers that can be used
in combination with both the FreeBSD and Junos stacks.
2. FreeBSD - This change opens the door towards changing ifnet
and implementing new features and optimizations in the network
stack without it requiring a change in the many NIC drivers
FreeBSD has.
Submitted by: Anuranjan Shukla <anshukla@juniper.net>
Reviewed by: glebius@
Obtained from: Juniper Networks, Inc.
- intercept FIONBIO and FIOASYNC ioctls on netmap file descriptors.
libpcap calls them to set non blocking I/O on the file descriptor,
for netmap this is a no-op because there is no read/write,
but not intercepting would cause fcntl() to return -1
- rate limit and put under netmap.verbose some messages that occur
when threads use concurrently the same file descriptor.
- netmap pipes, providing bidirectional blocking I/O while moving
100+ Mpps between processes using shared memory channels
(no mistake: over one hundred million. But mind you, i said
*moving* not *processing*);
- kqueue support (BHyVe needs it);
- improved user library. Just the interface name lets you select a NIC,
host port, VALE switch port, netmap pipe, and individual queues.
The upcoming netmap-enabled libpcap will use this feature.
- optional extra buffers associated to netmap ports, for applications
that need to buffer data yet don't want to make copies.
- segmentation offloading for the VALE switch, useful between VMs.
and a number of bug fixes and performance improvements.
My colleagues Giuseppe Lettieri and Vincenzo Maffione did a substantial
amount of work on these features so we owe them a big thanks.
There are some external repositories that can be of interest:
https://code.google.com/p/netmap
our public repository for netmap/VALE code, including
linux versions and other stuff that does not belong here,
such as python bindings.
https://code.google.com/p/netmap-libpcap
a clone of the libpcap repository with netmap support.
With this any libpcap client has access to most netmap
feature with no recompilation. E.g. tcpdump can filter
packets at 10-15 Mpps.
https://code.google.com/p/netmap-ipfw
a userspace version of ipfw+dummynet which uses netmap
to send/receive packets. Speed is up in the 7-10 Mpps
range per core for simple rulesets.
Both netmap-libpcap and netmap-ipfw will be merged upstream at some
point, but while this happens it is useful to have access to them.
And yes, this code will be merged soon. It is infinitely better
than the version currently in 10 and 9.
MFC after: 3 days
add separate rx/tx ring indexes
add ring specifier in nm_open device name
netmap.c, netmap_vale.c
more consistent errno numbers
netmap_generic.c
correctly handle failure in registering interfaces.
tools/tools/netmap/
massive cleanup of the example programs
(a lot of common code is now in netmap_user.h.)
nm_util.[ch] are going away soon.
pcap.c will also go when i commit the native netmap support for libpcap.
Most relevant features:
- netmap emulation on any NIC, even those without native netmap support.
On the ixgbe we have measured about 4Mpps/core/queue in this mode,
which is still a lot more than with sockets/bpf.
- seamless interconnection of VALE switch, NICs and host stack.
If you disable accelerations on your NIC (say em0)
ifconfig em0 -txcsum -txcsum
you can use the VALE switch to connect the NIC and the host stack:
vale-ctl -h valeXX:em0
allowing sharing the NIC with other netmap clients.
- THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
instead of pointers/count as before). This was unavoidable to support,
in the future, multiple threads operating on the same rings.
Netmap clients require very small source code changes to compile again.
On the plus side, the new API should be easier to understand
and the internals are a lot simpler.
The manual page has been updated extensively to reflect the current
features and give some examples.
This is the result of work of several people including Giuseppe Lettieri,
Vincenzo Maffione, Michio Honda and myself, and has been financially
supported by EU projects CHANGE and OPENLAB, from NetApp University
Research Fund, NEC, and of course the Universita` di Pisa.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates
There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).
In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;
MFC after: 3 days
to this event, adding if_var.h to files that do need it. Also, include
all includes that now are included due to implicit pollution via if_var.h
Sponsored by: Netflix
Sponsored by: Nginx, Inc.
- This version has support for the new Intel Avoton systems,
including 2.5Gb support, further it now has IPv6/TSO6 support as
well. Shared code has been updated where necessary as well. Thanks
to my new assistant Eric Joyner for doing the transmit path changes
to bring in the IPv6/TSO6 support. Thanks to Gleb for catching the
one bug and change needed in NETMAP.
Approved by: re
from each batch flowing on the VALE switch
- feature: add glue for 'indirect' buffers on the sender side:
if a slot has NS_INDIRECT set, the netmap buffer contains pointer(s)
to the actual userspace buffers, which are accessed with copyin().
The feature is not finalised yet, as it will likely need to deal
with some iovec variant for proper scatter/gather support.
This will save one copy for clients (e.g. qemu) that cannot
use the netmap buffer directly.
A curiosity: on amd64 copyin() appears to be 10-15% faster than pkt_copy()
or bcopy() at least for sizes of 256 and greater.
- the VALE switch now support up to 254 destinations per switch,
unicast or broadcast (multicast goes to all ports).
- we can attach hw interfaces and the host stack to a VALE switch,
which means we will be able to use it more or less as a native bridge
(minor tweaks still necessary).
A 'vale-ctl' program is supplied in tools/tools/netmap
to attach/detach ports the switch, and list current configuration.
- the lookup function in the VALE switch can be reassigned to
something else, similar to the pf hooks. This will enable
attaching the firewall, or other processing functions (e.g. in-kernel
openvswitch) directly on the netmap port.
The internal API used by device drivers does not change.
Userspace applications should be recompiled because we
bump NETMAP_API as we now use some fields in the struct nmreq
that were previously ignored -- otherwise, data structures
are the same.
Manpages will be committed separately.
- netmap_rx_irq()/netmap_tx_irq() can now be called by FreeBSD drivers
hiding the logic for handling NIC interrupts in netmap mode.
This also simplifies the case of NICs attached to VALE switches.
Individual drivers will be updated with separate commits.
- use the same refcount() API for FreeBSD and linux
- plus some comments, typos and formatting fixes
Portions contributed by Michio Honda
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.
The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
- VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
- VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
- VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
- VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
(in order to avoid visibility of implementation details)
- The read-mode operations are added:
VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
sys/mutex.h in consumers directly to cater its inlining functions
using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
the compat layer because the name clash between FreeBSD and solaris
versions must be avoided.
At this purpose zfs redefines the vm_object locking functions
directly, isolating the FreeBSD components in specific compat stubs.
The KPI results heavilly broken by this commit. Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).
Sponsored by: EMC / Isilon storage division
Reviewed by: jeff
Reviewed by: pjd (ZFS specific review)
Discussed with: alc
Tested by: pho
By setting dev.netmap.fwd=1 (or enabling the feature with a per-ring flag),
packets are forwarded between the NIC and the host stack unless the
netmap client clears the NS_FORWARD flag on the individual descriptors.
This feature greatly simplifies applications where some traffic
(think of ARP, control traffic, ssh sessions...) must be processed
by the host stack, whereas the bulk is handled by the netmap process
which simply (un)marks packets that should not be forwarded.
The default is chosen so that now a netmap receiver operates
in a mode very similar to bpf.
Of course there is no free lunch: traffic to/from the host stack
still operates at OS speed (or less, as there is one extra copy in
one direction).
HOWEVER, since traffic goes to the user process before being
reinjected, and reinjection occurs in a user context, you get some
form of livelock protection for free.
two upcoming features:
semi-transparent mode:
when a device is opened in this mode, the
user program will be able to mark slots that must be forwarded
to the "other" side (i.e. from NIC to host stack, or viceversa),
and the forwarding will occur automatically at the next netmap syscall.
This saves the need to open another file descriptor and do
the forwarding manually.
direct-forwarding mode:
when operating with a VALE port, the user can specify in the slot
the actual destination port, overriding the forwarding decision
made by a lookup of the destination MAC. This can be useful to
implement packet dispatchers.
No API changes will be introduced.
No new functionality in this patch yet.
previous names, 'ptag' and 'pmap' -- p stands for packet.
This change reduces the difference between the code in stable/9
and head, and also helps using the same ixgbe_netmap.h on both branches.
Approved by: Jack Vogel
that revises the netmap memory allocator so that the
various parameters (number and size of buffers, rings, descriptors)
can be modified at runtime through sysctl variables.
The changes become effective when no netmap clients are active.
The API is mostly unchanged, although the NIOCUNREGIF ioctl now
does not bring the interface back to normal mode: and you
need to close the file descriptor for that.
This change was necessary to track who is using the mapped region,
and since it is a simplification of the API there was no
incentive in trying to preserve NIOCUNREGIF.
We will remove the ioctl from the kernel next time we need
a real API change (and version bump).
Among other things, buffer allocation when opening devices is
now much faster: it used to take O(N^2) time, now it is linear.
Submitted by: Giuseppe Lettieri
- Move destruction of per-ring locks to netmap_dtor_locked to mirror the
initialization that happens in NIOCREGIF. Otherwise unloading a netmap-
capable interface that was never put into netmap mode would try to
mtx_destroy an uninitialized mutex, and panic.
- Destroy core_lock in netmap_detach, mirroring init in netmap_attach.
- Also comment out the knlist_destroy for now as there is currently no
knlist_init.
Sponsored by: ADARA Networks
Reviewed by: luigi@
http://info.iet.unipi.it/~luigi/vale/
VALE lets you dynamically instantiate multiple software bridges
that talk the netmap API (and are *extremely* fast), so you can test
netmap applications without the need for high end hardware.
This is particularly useful as I am completing a netmap-aware
version of ipfw, and VALE provides an excellent testing platform.
Also, I also have netmap backends for qemu mostly ready for commit
to the port, and this too will let you interconnect virtual machines
at high speed without fiddling with bridges, tap or other slow solutions.
The API for applications is unchanged, so you can use the code
in tools/tools/netmap (which i will update soon) on the VALE ports.
This commit also syncs the code with the one in my internal repository,
so you will see some conditional code for other platforms.
The code should run mostly unmodified on stable/9 so people interested
in trying it can just copy sys/dev/netmap/ and sys/net/netmap*.h
from HEAD
VALE is joint work with my colleague Giuseppe Lettieri, and
is partly supported by the EU Projects CHANGE and OPENLAB
Contrarily to what i wrote in my previous commit, the 82599
does include the CRC in the length. The operating mode is
reset in ixgbe_init_locked() and so we need to hook into
the places where the two registers (HLREG0 and RDRXCTL) are
modified.
does not include the CRC irrespective of the setting
of CRCSTRIP. The 82599 data sheets (sec. 7.1.6) say differently.
Very strange. Need to check what happens on legacy descriptors,
but for the time being this restores functionality.
and make it easier to replace it with a different implementation.
On passing, also fix indentation.
NOTE: I know that #include "foo.c" is ugly, but the alternative
(add another entry to sys/conf/files, add a separate header with
structs and prototypes, and expose functions that are meant to
be private) looks even worse to me.
We need a more modular way to specify dependencies and build options.
- add a sysctl, dev.netmap.ix_crcstrip, to control whether ixgbe should
strip the CRC on received frames. Defaults to 0, which keeps the CRC.
and improves performance when receiving min-sized (64-byte) frames.
This matters because min-sized frames is one of the standard
benchmarks for switches and routers, some chipsets seem to issue
read-modify-write cycles for PCIe transactions that are not a
full cache line, and a min-sized frame triggers the bug, resulting
in reduced throughput -- 9.7 instead of 14.88 Mpps -- and heavy
bus load.
- for the time being, always look for incoming packets on a select/poll
even if there has not been an interrupt in the meantime. This is
only a temporary workaround for a probable race condition in keeping
track of rx interrupts.
Add a couple of diagnostic vars to help studying the problem.
USERSPACE:
1. add support for devices with different number of rx and tx queues;
2. add better support for zero-copy operation, adding an extra field
to the netmap ring to indicate how many buffers we have already processed
but not yet released (with help from Eddie Kohler);
3. The two changes above unfortunately require an API change, so while
at it add a version field and some spares to the ioctl() argument
to help detect mismatches.
4. update the manual page for the two changes above;
5. update sample applications in tools/tools/netmap
KERNEL:
1. simplify the internal structures moving the global wait queues
to the 'struct netmap_adapter';
2. simplify the functions that map kring<->nic ring indexes
3. normalize device-specific code, helps mainteinance;
4. start exploring the impact of micro-optimizations (prefetch etc.)
in the ixgbe driver.
Use 'legacy' descriptors on the tx ring and prefetch slots gives
about 20% speedup at 900 MHz. Another 7-10% would come from removing
the explict calls to bus_dmamap* in the core (they are effectively
NOPs in this case, but it takes expensive load of the per-buffer
dma maps to figure out that they are all NULL.
Rx performance not investigated.
I am postponing the MFC so i can import a few more improvements
before merging.
- remove the KEVENT code, which was incomplete and not compiled anyways;
- change some while() loops into for()
- adjust indentation
- remove extra whitespace
MFC after: 1 week
Introduce some functions to map NIC ring indexes into netmap ring
indexes and vice versa. This way we can implement the bound
checks only in one place (and hopefully in a correct way).
On passing, make the code and comments more uniform across the
various drivers.
txsync() and rxsync() callbacks, removing some variables made
useless by this change;
- add generic lock and irq handling routines. These can be useful
in case there are no driver locks that we can reuse;
- add a few macros to reduce differences with the Linux version.
TUNABLE variable (hw.netmap.buf_size) so we can experiment
with values different from 2048 which may give better cache performance.
- rearrange the memory allocation code so it will be easier
to replace it with a different implementation. The current code
relies on a single large contiguous chunk of memory obtained through
contigmalloc.
The new implementation (not committed yet) uses multiple
smaller chunks which are easier to fit in a fragmented address
space.
- remove experimental code for disabling CRC
- use the correct constant for conversion between interrupt rate
and EITR values (the previous values were off by a factor of 2)
- make dev.ix.N.queueM.interrupt_rate a RW sysctl variable.
Changing individual values affects the queue immediately,
and propagates to all interfaces at the next reinit.
- add dev.ix.N.queueM.irqs rdonly sysctl, to export the actual
interrupt counts
Netmap-related changes for ixgbe:
- use the "new" format for TX descriptors in netmap mode.
- pass interrupt mitigation delays to the user process doing poll()
on a netmap file descriptor.
On the RX side this means we will not check the ring more than once
per interrupt. This gives the process a chance to sleep and process
packets in larger batches, thus reducing CPU usage.
On the TX side we take this even further: completed transmissions are
reclaimed every half ring even if the NIC interrupts more often.
This saves even more CPU without any additional tx delays.
Generic Netmap-related changes:
- align the netmap_kring to cache lines so that there is no false sharing
(possibly useful for multiqueue NICs and MSIX interrupts, which are
handled by different cores). It's a minor improvement but it does not
cost anything.
Reviewed by: Jack Vogel
Approved by: Jack Vogel
1. as reported by Alexander Fiveg, the allocator was reporting
half of the allocated memory. Fix this by exiting from the
loop earlier (not too critical because this code is going
away soon).
2. following a discussion on freebsd-current
http://lists.freebsd.org/pipermail/freebsd-current/2012-January/031144.html
turns out that (re)loading the dmamap was expensive and not optimized.
This operation is in the critical path when doing zero-copy forwarding
between interfaces.
At least on netmap and i386/amd64, the bus_dmamap_load can be
completely bypassed if the map is NULL, so we do it.
The latter change gives an almost 3x improvement in forwarding
performance, from the previous 9.5Mpps at 2.9GHz to the current
line rate (14.2Mpps) at 1.733GHz. (this is for 64+4 byte packets,
in other configurations the PCIe bus is a bottleneck).
the memory allocator used by netmap. No functional change,
two small bug fixes:
- in if_re.c add a missing bus_dmamap_sync()
- in netmap.c comment out a spurious free() in an error handling block