mean that you need a world built to reliably build pkg-gen but this keeps
the build from failing when your source doesn't match your host running
version, e.g. building 12 on 11.
Submitted by: Matt Macy <mmacy@nextbsd.org>
MFC after: 2 weeks
Sponsored by: Limelight Networks
This commit, long overdue, contains contributions in the last 2 years
from Stefano Garzarella, Giuseppe Lettieri, Vincenzo Maffione, including:
+ fixes on monitor ports
+ the 'ptnet' virtual device driver, and ptnetmap backend, for
high speed virtual passthrough on VMs (bhyve fixes in an upcoming commit)
+ improved emulated netmap mode
+ more robust error handling
+ removal of stale code
+ various fixes to code and documentation (some mixup between RX and TX
parameters, and private and public variables)
We also include an additional tool, nmreplay, which is functionally
equivalent to tcpreplay but operating on netmap ports.
IPv4 addresses/ports.
When doing traffic testing of actual code that /does/ things to the
packet (rather than say, 'bridge.c'), it's typically a good idea to
use a variety of cache-busting and flow-tracking-busting packet
spreads. The pkt-gen method of testing an IP range was to walk
it linearly - which is fine, but not useful enough.
This can be used to completely randomize the source/destination
addresses (eg to test out flow-tracking-busting) and to keep the
destination fixed whilst randomising the source (eg to test out
what a DDoS may look like.)
Tested:
* Intel ixgbe 10G (82599) netmap
Differential Revision: https://reviews.freebsd.org/D2309
MFC after: 2 weeks
Sponsored by: Norse Corp, Inc.
Mostly bugfixes or features developed in the past 6 months,
so this is a 10.1 candidate.
Basically no user API changes (some bugfixes in sys/net/netmap_user.h).
In detail:
1. netmap support for virtio-net, including in netmap mode.
Under bhyve and with a netmap backend [2] we reach over 1Mpps
with standard APIs (e.g. libpcap), and 5-8 Mpps in netmap mode.
2. (kernel) add support for multiple memory allocators, so we can
better partition physical and virtual interfaces giving access
to separate users. The most visible effect is one additional
argument to the various kernel functions to compute buffer
addresses. All netmap-supported drivers are affected, but changes
are mechanical and trivial
3. (kernel) simplify the prototype for *txsync() and *rxsync()
driver methods. All netmap drivers affected, changes mostly mechanical.
4. add support for netmap-monitor ports. Think of it as a mirroring
port on a physical switch: a netmap monitor port replicates traffic
present on the main port. Restrictions apply. Drive carefully.
5. if_lem.c: support for various paravirtualization features,
experimental and disabled by default.
Most of these are described in our ANCS'13 paper [1].
Paravirtualized support in netmap mode is new, and beats the
numbers in the paper by a large factor (under qemu-kvm,
we measured gues-host throughput up to 10-12 Mpps).
A lot of refactoring and additional documentation in the files
in sys/dev/netmap, but apart from #2 and #3 above, almost nothing
of this stuff is visible to other kernel parts.
Example programs in tools/tools/netmap have been updated with bugfixes
and to support more of the existing features.
This is meant to go into 10.1 so we plan an MFC before the Aug.22 deadline.
A lot of this code has been contributed by my colleagues at UNIPI,
including Giuseppe Lettieri, Vincenzo Maffione, Stefano Garzarella.
MFC after: 3 days.
and finish the job. ncurses is now the only Makefile in the tree that
uses it since it wasn't a simple mechanical change, and will be
addressed in a future commit.
- netmap pipes, providing bidirectional blocking I/O while moving
100+ Mpps between processes using shared memory channels
(no mistake: over one hundred million. But mind you, i said
*moving* not *processing*);
- kqueue support (BHyVe needs it);
- improved user library. Just the interface name lets you select a NIC,
host port, VALE switch port, netmap pipe, and individual queues.
The upcoming netmap-enabled libpcap will use this feature.
- optional extra buffers associated to netmap ports, for applications
that need to buffer data yet don't want to make copies.
- segmentation offloading for the VALE switch, useful between VMs.
and a number of bug fixes and performance improvements.
My colleagues Giuseppe Lettieri and Vincenzo Maffione did a substantial
amount of work on these features so we owe them a big thanks.
There are some external repositories that can be of interest:
https://code.google.com/p/netmap
our public repository for netmap/VALE code, including
linux versions and other stuff that does not belong here,
such as python bindings.
https://code.google.com/p/netmap-libpcap
a clone of the libpcap repository with netmap support.
With this any libpcap client has access to most netmap
feature with no recompilation. E.g. tcpdump can filter
packets at 10-15 Mpps.
https://code.google.com/p/netmap-ipfw
a userspace version of ipfw+dummynet which uses netmap
to send/receive packets. Speed is up in the 7-10 Mpps
range per core for simple rulesets.
Both netmap-libpcap and netmap-ipfw will be merged upstream at some
point, but while this happens it is useful to have access to them.
And yes, this code will be merged soon. It is infinitely better
than the version currently in 10 and 9.
MFC after: 3 days
add separate rx/tx ring indexes
add ring specifier in nm_open device name
netmap.c, netmap_vale.c
more consistent errno numbers
netmap_generic.c
correctly handle failure in registering interfaces.
tools/tools/netmap/
massive cleanup of the example programs
(a lot of common code is now in netmap_user.h.)
nm_util.[ch] are going away soon.
pcap.c will also go when i commit the native netmap support for libpcap.
Most relevant features:
- netmap emulation on any NIC, even those without native netmap support.
On the ixgbe we have measured about 4Mpps/core/queue in this mode,
which is still a lot more than with sockets/bpf.
- seamless interconnection of VALE switch, NICs and host stack.
If you disable accelerations on your NIC (say em0)
ifconfig em0 -txcsum -txcsum
you can use the VALE switch to connect the NIC and the host stack:
vale-ctl -h valeXX:em0
allowing sharing the NIC with other netmap clients.
- THE USER API HAS SLIGHTLY CHANGED (head/cur/tail pointers
instead of pointers/count as before). This was unavoidable to support,
in the future, multiple threads operating on the same rings.
Netmap clients require very small source code changes to compile again.
On the plus side, the new API should be easier to understand
and the internals are a lot simpler.
The manual page has been updated extensively to reflect the current
features and give some examples.
This is the result of work of several people including Giuseppe Lettieri,
Vincenzo Maffione, Michio Honda and myself, and has been financially
supported by EU projects CHANGE and OPENLAB, from NetApp University
Research Fund, NEC, and of course the Universita` di Pisa.
This includes the following:
- use separate memory regions for VALE ports
- locking fixes
- some simplifications in the NIC-specific routines
- performance improvements for the VALE switch
- some new features in the pkt-gen test program
- documentation updates
There are small API changes that require programs to be recompiled
(NETMAP_API has been bumped so you will detect old binaries at runtime).
In particular:
- struct netmap_slot now is 16 bytes to support an extra pointer,
which may save one data copy when using VALE ports or VMs;
- the struct netmap_if has two extra fields;
MFC after: 3 days
+ pkt-gen -f rx now remains active even when traffic stops
Previous behaviour (exit after 1 second of silence) can be
restored with the -W option
+ the -X option does a hexdump of the content of a packet (both tx and rx).
This can be useful to check what goes in and out.
+ the -I option instructs the sender to use indirect buffers
(not really useful other than to test the kernel module in the
VALE switch)
- the VALE switch now support up to 254 destinations per switch,
unicast or broadcast (multicast goes to all ports).
- we can attach hw interfaces and the host stack to a VALE switch,
which means we will be able to use it more or less as a native bridge
(minor tweaks still necessary).
A 'vale-ctl' program is supplied in tools/tools/netmap
to attach/detach ports the switch, and list current configuration.
- the lookup function in the VALE switch can be reassigned to
something else, similar to the pf hooks. This will enable
attaching the firewall, or other processing functions (e.g. in-kernel
openvswitch) directly on the netmap port.
The internal API used by device drivers does not change.
Userspace applications should be recompiled because we
bump NETMAP_API as we now use some fields in the struct nmreq
that were previously ignored -- otherwise, data structures
are the same.
Manpages will be committed separately.
Set a flag and allow worker threads to finish upon ^C, instead of
immediately cancelling them, so that final packet count and rate
stats can be displayed.
- in pcap_dispatch(), issue a prefetch on the buffer before the
callback, this may save a little bit of time if the client
is very fast.
- in pcap_inject(), use a fast copy routine, which also helps
saving a few nanoseconds with fast clients.
USERSPACE:
1. add support for devices with different number of rx and tx queues;
2. add better support for zero-copy operation, adding an extra field
to the netmap ring to indicate how many buffers we have already processed
but not yet released (with help from Eddie Kohler);
3. The two changes above unfortunately require an API change, so while
at it add a version field and some spares to the ioctl() argument
to help detect mismatches.
4. update the manual page for the two changes above;
5. update sample applications in tools/tools/netmap
KERNEL:
1. simplify the internal structures moving the global wait queues
to the 'struct netmap_adapter';
2. simplify the functions that map kring<->nic ring indexes
3. normalize device-specific code, helps mainteinance;
4. start exploring the impact of micro-optimizations (prefetch etc.)
in the ixgbe driver.
Use 'legacy' descriptors on the tx ring and prefetch slots gives
about 20% speedup at 900 MHz. Another 7-10% would come from removing
the explict calls to bus_dmamap* in the core (they are effectively
NOPs in this case, but it takes expensive load of the per-buffer
dma maps to figure out that they are all NULL.
Rx performance not investigated.
I am postponing the MFC so i can import a few more improvements
before merging.
TUNABLE variable (hw.netmap.buf_size) so we can experiment
with values different from 2048 which may give better cache performance.
- rearrange the memory allocation code so it will be easier
to replace it with a different implementation. The current code
relies on a single large contiguous chunk of memory obtained through
contigmalloc.
The new implementation (not committed yet) uses multiple
smaller chunks which are easier to fit in a fragmented address
space.
A link reset now is completely transparent for the netmap client:
even if the NIC resets its own ring (e.g. restarting from 0),
the client will not see any change in the current rx/tx positions,
because the driver will keep track of the offset between the two.
2. make the device-specific code more uniform across different drivers
There were some inconsistencies in the implementation of the netmap
support routines, now drivers have been aligned to a common
code structure.
3. import netmap support for ixgbe . This is implemented as a very
small patch for ixgbe.c (233 lines, 11 chunks, mostly comments:
in total the patch has only 54 lines of new code) , as most of
the code is in an external file sys/dev/netmap/ixgbe_netmap.h ,
following some initial comments from Jack Vogel about making
changes less intrusive.
(Note, i have emailed Jack multiple times asking if he had
comments on this structure of the code; i got no reply so
i assume he is fine with it).
Support for other drivers (em, lem, re, igb) will come later.
"ixgbe" is now the reference driver for netmap support. Both the
external file (sys/dev/netmap/ixgbe_netmap.h) and the device-specific
patches (in sys/dev/ixgbe/ixgbe.c) are heavily commented and should
serve as a reference for other device drivers.
Tested on i386 and amd64 with the pkt-gen program in tools/tools/netmap,
the sender does 14.88 Mpps at 1050 Mhz and 14.2 Mpps at 900 MHz
on an i7-860 with 4 cores and 82599 card. Haven't tried yet more
aggressive optimizations such as adding 'prefetch' instructions
in the time-critical parts of the code.
I/O from userspace, capable of line rate at 10G, see
http://info.iet.unipi.it/~luigi/netmap/
At this time I am bringing in only the generic code (sys/dev/netmap/
plus two headers under sys/net/), and some sample applications in
tools/tools/netmap. There is also a manpage in share/man/man4 [1]
In order to make use of the framework you need to build a kernel
with "device netmap", and patch individual drivers with the code
that you can find in
sys/dev/netmap/head.diff
The file will go away as the relevant pieces are committed to
the various device drivers, which should happen in a few days
after talking to the driver maintainers.
Netmap support is available at the moment for Intel 10G and 1G
cards (ixgbe, em/lem/igb), and for the Realtek 1G card ("re").
I have partial patches for "bge" and am starting to work on "cxgbe".
Hopefully changes are trivial enough so interested third parties
can submit their patches. Interested people can contact me
for advice on how to add netmap support to specific devices.
CREDITS:
Netmap has been developed by Luigi Rizzo and other collaborators
at the Universita` di Pisa, and supported by EU project CHANGE
(http://www.change-project.eu/)
The code is distributed under a BSD Copyright.
[1] In my opinion is a bad idea to have all manpage in one directory.
We should place kernel documentation in the same dir that contains
the code, which would make it much simpler to keep doc and code
in sync, reduce the clutter in share/man/ and incidentally is
the policy used for all of userspace code.
Makefiles and doc tools can be trivially adjusted to find the
manpages in the relevant subdirs.