Commit graph

2131 commits

Author SHA1 Message Date
Gordon Bergling
a24d709243 netinet6: Fix a typo in a sysctl description
- remove a double 'a'

(cherry picked from commit 3cf59750eb)
2021-12-03 16:53:41 +01:00
Mark Johnston
f763c81c95 netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 44775b163b)
2021-12-01 07:43:18 -05:00
Roy Marples
ec5691aa2f net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563

(cherry picked from commit 5c5340108e)
2021-11-18 19:28:56 -05:00
Mark Johnston
bcc7e68fcf nd6: Make the DAD callout MPSAFE
Interface addresses with pending duplicate address detection (DAD) live
in a global queue.  In this case, a callout is associated with each
entry.  The callout transmits neighbour solicitations until the system
decides the address is no longer tentative, or until a duplicate address
is discovered.  At this point the entry is dequeued and freed.  DAD may
be manually stopped as well.

The callout currently runs (and potentially transmits packets) with
Giant held.  Reorganize DAD queue locking to interlock properly with the
callout:

- Configure the callout to acquire the DAD queue lock before running.
  The lock is dropped before transmitting any packets.  Stop protecting
  the callout with Giant.
- When looking up DAD queue entries for an incoming NS or NA, don't
  bother fiddling with the DAD queue entry reference count.
- Split nd6_dad_starttimer() so that the caller is responsible to
  transmitting a NS if it so desires.
- Remove the DAD entry from the queue before stopping the timer.  Use a
  temporary reference to make sure that the entry doesn't get freed by
  the callout while we're draining.

Reported by:	mav
Reviewed by:	bz, hrs
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 9a94097cd0)
2021-09-21 09:37:52 -04:00
Mark Johnston
3c9c3ab05c ip6mrouter: Make the expiration callout MPSAFE
- Protect the `expire_upcalls` callout with the MFC6 mutex.  The callout
  handler needs this mutex anyway.
- Convert the MROUTER6 mutex to a sleepable sx lock.  It is only used
  when configuring the global v6 multicast routing socket, so is only
  used in system call paths where sleeping is safe.  This lets us drain
  the callout without having to drop the lock.
- For all locking macros in the file, convert to using a _LOCKPTR macro.

Reported by:	mav
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 353783964c)
2021-09-21 09:37:32 -04:00
Alexander V. Chernikov
4e97cbba1c lltable: fix crash introduced in c541bd368f.
Reported by:	cy

(cherry picked from commit f8c1b1a929)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
48f38f47b1 lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.
Currently we use pre-calculated headers inside LLE entries as prepend data
 for `if_output` functions. Using these headers allows saving some
 CPU cycles/memory accesses on the fast path.

However, this approach makes adding L2 header for IPv4 traffic with IPv6
 nexthops more complex, as it is not possible to store multiple
 pre-calculated headers inside lle. Additionally, the solution space is
 limited by the fact that PCB caching saves LLEs in addition to the nexthop.

Thus, add support for creating special "child" LLEs for the purpose of holding
 custom family encaps and store mbufs pending resolution. To simplify handling
 of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
 Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
 to the "parent" LLE - it is not possible to delete "child" when parent is alive.
 Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
 machine used by the standard LLEs.

nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
 allowing to return such child LLEs. This change uses `LLE_SF()` macro which
 packs family and flags in a single int field. This is done to simplify merging
 back to stable/. Once this code lands, most of the cases will be converted to
 use a dedicated `family` parameter.

Differential Revision: https://reviews.freebsd.org/D31379

(cherry picked from commit c541bd368f)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
4151d8ccdc [lltable] Restructure nd6 code.
Factor out lltable locking logic from lltable_try_set_entry_addr()
 into a separate lltable_acquire_wlock(), so the latter can be used
 in other parts of the code w/o duplication.

Create nd6_try_set_entry_addr() to avoid code duplication in nd6.c
 and nd6_nbr.c.

Move lle creation logic from nd6_resolve_slow() into a separate
 nd6_get_llentry() to simplify the former.

These changes serve as a pre-requisite for implementing
 RFC8950 (IPv4 prefixes with IPv6 nexthops).

Differential Revision: https://reviews.freebsd.org/D31432

(cherry picked from commit 0b79b007eb)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
0ea561762b Use lltable calculated header when sending lle holdchain after successful lle resolution.
Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D31391

(cherry picked from commit 8482aa7748)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
2802014380 [lltable] Unify datapath feedback mechamism.
Use newly-create llentry_request_feedback(),
 llentry_mark_used() and llentry_get_hittime() to
 request datapatch usage check and fetch the results
 in the same fashion both in IPv4 and IPv6.

While here, simplify llentry_provide_feedback() wrapper
 by eliminating 1 condition check.

Differential Revision: https://reviews.freebsd.org/D31390

(cherry picked from commit f3a3b06121)
2021-09-07 21:02:58 +00:00
Gordon Bergling
82c70f2f16 inet6(4): Fix a few common typos in source code comments
- s/reshedule/reschedule/

(cherry picked from commit 10e0082fff)
2021-08-31 08:11:24 +02:00
Mateusz Guzik
86a96281df frag6: do less work in frag6_slowtimo if possible
frag6_slowtimo avoidably uses CPU on otherwise idle boxes

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31528

(cherry picked from commit 8afe9481cf)
2021-08-18 09:44:45 +00:00
Mateusz Guzik
d368dc0b42 frag6: drop the volatile keyword from frag6_nfrags and mark with __exclusive_cache_line
The keyword adds nothing as all operations on the var are performed
through atomic_*

Reviewed by:	kp
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D31528

(cherry picked from commit c17ae18080)
2021-08-18 09:44:44 +00:00
Mark Johnston
3c785b12e0 in6: Enter the net epoch in in6_tmpaddrtimer()
We need to do so to safely traverse the ifnet list.

Reviewed by:	bz
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 8ee0826f75)
2021-08-16 09:01:39 -04:00
Mark Johnston
75c0e1a8e0 nd6: Mark several callouts as MPSAFE
The use of Giant here is vestigal and does not provide any useful
synchronization.  Furthermore, non-MPSAFE callouts can cause the
softclock threads to block waiting for long-running newbus operations to
complete.

Reported by:	mav
Reviewed by:	bz
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 663428ea17)
2021-08-16 09:01:11 -04:00
Andrey V. Elsukov
40ec2323e6 Fix panic in IPv6 multicast code.
Add check that ifp supports IPv6 multicasts in in6_getmulti.
This fixes panic when user application tries to join into multicast
group on an interface that doesn't support IPv6 multicasts, like
IFT_PFLOG interfaces.

PR:             257302
Reviewed by:	melifaro
Differential Revision: https://reviews.freebsd.org/D31420

(cherry picked from commit d477a7feed)
2021-08-13 10:31:11 +03:00
Roy Marples
f452713408 socket: Implement SO_RERROR
SO_RERROR indicates that receive buffer overflows should be handled as
errors. Historically receive buffer overflows have been ignored and
programs could not tell if they missed messages or messages had been
truncated because of overflows. Since programs historically do not
expect to get receive overflow errors, this behavior is not the
default.

This is really really important for programs that use route(4) to keep
in sync with the system. If we loose a message then we need to reload
the full system state, otherwise the behaviour from that point is
undefined and can lead to chasing bogus bug reports.

Reviewed by:	philip (network), kbowling (transport), gbe (manpages)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D26652

(cherry picked from commit 7045b1603b)
2021-08-10 18:54:00 -07:00
Michael Tuexen
a3f96f1e57 sctp: Fix errno in case of association setup failures
Do not report always ETIMEDOUT, but only when appropriate. In
other cases report ECONNABORTED.

(cherry picked from commit 105b68b42d)
2021-07-13 20:30:57 +02:00
Michael Tuexen
6a4f29a3c4 sctp: initialize sequence numbers for ECN correctly
Reported by:	Junseok Yang (for the userland stack)

(cherry picked from commit c7f048ab35)
2021-07-13 20:28:48 +02:00
Michael Tuexen
fa50e98328 mend 2021-06-07 11:01:28 +02:00
Mark Johnston
544b5d2482 Fix mbuf leaks in various pru_send implementations
The various protocol implementations are not very consistent about
freeing mbufs in error paths.  In general, all protocols must free both
"m" and "control" upon an error, except if PRUS_NOTREADY is specified
(this is only implemented by TCP and unix(4) and requires further work
not handled in this diff), in which case "control" still must be freed.

This diff plugs various leaks in the pru_send implementations.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit d8acd2681b)
2021-05-25 21:49:53 -04:00
Mark Johnston
010f197770 nd6: Avoid using an uninitialized sockaddr in nd6_prefix_offlink()
Commit 81728a538 ("Split rtinit() into multiple functions.") removed
the initialization of sa6, but not one of its uses.  This meant that we
were passing an uninitialized sockaddr as the address to
lltable_prefix_free().  Remove the variable outright to fix the problem.
The caller is expected to hold a reference on pr.

Fixes:		81728a538 ("Split rtinit() into multiple functions.")
Reported by:	KMSAN
Reviewed by:	donner
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D30166

(cherry picked from commit c1dd4d642f)
2021-05-19 09:32:28 -04:00
Mark Johnston
1e066db6cd Add missing sockaddr length and family validation to various protocols
Several protocol methods take a sockaddr as input.  In some cases the
sockaddr lengths were not being validated, or were validated after some
out-of-bounds accesses could occur.  Add requisite checking to various
protocol entry points, and convert some existing checks to assertions
where appropriate.

Reported by:	syzkaller+KASAN
Reviewed by:	tuexen, melifaro
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D29519

(cherry picked from commit f161d294b9)
2021-05-17 13:43:07 -04:00
Kristof Provost
3eebc6234b in6_mcast: Return EADDRINUSE when we've already joined the group
Distinguish between truly invalid requests and those that fail because
we've already joined the group. Both cases fail, but differentiating
them allows userspace to make more informed decisions about what the
error means.

For example. radvd tries to join the all-routers group on every SIGHUP.
This fails, because it's already joined it, but this failure should be
ignored (rather than treated as a sign that the interface's multicast is
broken).

This puts us in line with OpenBSD, NetBSD and Linux.

Reviewed by:	donner
MFC after:	1 week
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D30111

(cherry picked from commit 2ef5d803e3)
2021-05-17 09:50:26 +02:00
Hans Petter Selasky
b7622437f5 net: Introduce IPV6_DSCP(), IPV6_ECN() and IPV6_TRAFFIC_CLASS() macros
Introduce convenience macros to retrieve the DSCP, ECN or traffic class
bits from an IPv6 header.

Use them where appropriate.

Reviewed by:	ae (previous version), rscheff, tuexen, rgrimes
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Differential Revision:	https://reviews.freebsd.org/D29056

(cherry picked from commit bb4a7d94b9)
2021-05-10 16:30:44 +02:00
Alexander V. Chernikov
8aafa7a027 Flush remaining routes from the routing table during VNET shutdown.
Summary:
This fixes rtentry leak for the cloned interfaces created inside the
 VNET.

Loopback teardown order is `SI_SUB_INIT_IF`, which happens after `SI_SUB_PROTO_DOMAIN` (route table teardown).
Thus, any route table operations are too late to schedule.
As the intent of the vnet teardown procedures to minimise the amount of effort by doing global cleanups instead of per-interface ones, address this by adding a relatively light-weight routing table cleanup function, `rib_flush_routes()`.
It removes all remaining routes from the routing table and schedules the deletion, which will happen later, when `rtables_destroy()` waits for the current epoch to finish.

Test Plan:
```
set_skip:set_skip_group_lo  ->  passed  [0.053s]
tail -n 200 /var/log/messages | grep rtentry
```

PR:	253998
Reported by:	rashey at superbox.pl
Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29116

(cherry picked from commit b1d63265ac)
2021-03-13 20:19:17 +00:00
Alexander V. Chernikov
83cedad139 Fix 'in6_purgeaddr: err=65, destination address delete failed' message.
P2P ifa may require 2 routes: one is the loopback route, another is
 the "prefix" route towards its destination.

Current code marks loopback routes existence with IFA_RTSELF and
 "prefix" p2p routes with IFA_ROUTE.

For historic reasons, we fill in ifa_dstaddr for loopback interfaces.
To avoid installing the same route twice, we preemptively set
 IFA_RTSELF when adding "prefix" route for loopback.
However, the teardown part doesn't have this hack, so we try to
 remove the same route twice.

Fix this by checking if ifa_dstaddr is different from the ifa_addr
 and moving this logic into a separate function.

Reviewed By: kp
Differential Revision: https://reviews.freebsd.org/D29121

(cherry picked from commit 7634919e15)
2021-03-11 08:35:22 +00:00
Alexander V. Chernikov
9cd7f222d5 Enforce net epoch in in6_selectsrc().
in6_selectsrc() may call fib6_lookup() in some cases, which requires
 epoch. Wrap in6_selectsrc* calls into epoch inside its users.
Mark it as requiring epoch by adding NET_EPOCH_ASSERT().

Differential Revision:	https://reviews.freebsd.org/D28647

(cherry picked from commit 605284b894)
2021-03-10 21:57:59 +00:00
Alexander V. Chernikov
8a25d3f6ce Fix dpdk/ldradix fib lookup algorithm preference calculation.
The current preference number were copied from IPv4 code,
 assuming 500k routes to be the full-view. Adjust with the current
 reality (100k full-view).

Reported by:	Marek Zarychta <zarychtam at plan-b.pwste.edu.pl>

(cherry picked from commit d5be41beb7)
2021-03-10 21:50:19 +00:00
Alexander V. Chernikov
3f241e7aac Remove per-packet ifa refcounting from IPv6 fast path.
Currently ip6_input() calls in6ifa_ifwithaddr() for
 every local packet, in order to check if the target ip
 belongs to the local ifa in proper state and increase
 its counters.

in6ifa_ifwithaddr() references found ifa.
With epoch changes, both `ip6_input()` and all other current callers
 of `in6ifa_ifwithaddr()` do not need this reference
 anymore, as epoch provides stability guarantee.

Given that, update `in6ifa_ifwithaddr()` to allow
 it to return ifa without referencing it, while preserving
 option for getting referenced ifa if so desired.

Differential Revision:	https://reviews.freebsd.org/D28648

(cherry picked from commit 8268d82cff)
2021-03-10 21:45:55 +00:00
Alexander V. Chernikov
3cff9b2ccf Do not reference returned ifa in in6_ifawithifp().
The only place where in6_ifawithifp() is used is ip6_output(),
 which uses the returned ifa to bump traffic counters.
Given ifa stability guarantees is provided by epoch, do not refcount ifa.

This eliminates 2 atomic ops from IPv6 fast path.

Reviewed By:	rstone
Differential Revision:	https://reviews.freebsd.org/D28649

(cherry picked from commit 1bd44b11e5)
2021-03-10 21:44:31 +00:00
Kristof Provost
05ac80ac0f arp/nd: Cope with late calls to iflladdr_event
When tearing down vnet jails we can move an if_bridge out (as
part of the normal vnet_if_return()). This can, when it's clearing out
its list of member interfaces, change its link layer address.
That sends an iflladdr_event, but at that point we've already freed the
AF_INET/AF_INET6 if_afdata pointers.

In other words: when the iflladdr_event callbacks fire we can't assume
that ifp->if_afdata[AF_INET] will be set.

Reviewed by:	donner@, melifaro@
MFC after:	1 week
Sponsored by:	Orange Business Services
Differential Revision:	https://reviews.freebsd.org/D28860

(cherry picked from commit c139b3c19b)
2021-03-02 15:50:21 +01:00
Alexander V. Chernikov
7dfdd039a3 Fix crash with rtadv-originated multipath IPv6 routes.
PR:		253800
Reported by:	Frederic Denis <freebsdml at hecian.net>
MFC after:	immediately

(cherry picked from commit cc3fa1e29f)
2021-02-25 21:43:37 +00:00
Alexander V. Chernikov
ea10694336 Fix nd6 rib_action() handling.
rib_action() guarantees valid rc filling IFF it returns without error.
Check rib_action() return code instead of checking rc fields.

PR:		253800
Reported by:	Frederic Denis <freebsdml@hecian.net>

(cherry picked from commit 9c4a8d24f0)
2021-02-23 22:42:28 +00:00
Andrey V. Elsukov
8cb4c36316 [udp6] fix possible panic due to lack of locking.
The lookup for a IPv6 multicast addresses corresponding to
the destination address in the datagram is protected by the
NET_EPOCH section. Access to each PCB is protected by INP_RLOCK
during comparing. But access to socket's so_options field is
not protected. And in some cases it is possible, that PCB
pointer is still valid, but inp_socket is not. The patch wides
lock holding to protect access to inp_socket. It copies locking
strategy from IPv4 UDP handling.

PR:	232192
Obtained from:	Yandex LLC
Sponsored by:	Yandex LLC
Differential Revision:	https://reviews.freebsd.org/D28232

(cherry picked from commit 3c782d9c91)
2021-02-15 13:48:56 +03:00
Randall Stewart
538a81520b When we are about to send down to the driver layer
we need to make sure that the m_nextpkt field is NULL
else the lower layers may do unwanted things.

Reviewed By:  gallatin, melifaro
Differential Revision: https://reviews.freebsd.org/D28377

(cherry picked from commit 24a8f6d369)
2021-02-11 21:15:22 +00:00
Gleb Smirnoff
97ded49cae Catch up with 6edfd179c8: mechanically rename IFCAP_NOMAP to IFCAP_MEXTPG.
Originally IFCAP_NOMAP meant that the mbuf has external storage pointer
that points to unmapped address.  Then, this was extended to array of
such pointers.  Then, such mbufs were augmented with header/trailer.
Basically, extended mbufs are extended, and set of features is subject
to change.  The new name should be generic enough to avoid further
renaming.

(cherry-picked from commit 3f43ada98c)
2021-02-08 14:33:35 -08:00
Alexander V. Chernikov
10977b9f84 MFC 91f2c69ec2: Fix unused-function waring when compiling with FIB_ALGO. 2021-02-04 22:33:53 +00:00
Alexander V. Chernikov
f9e0752e35 Create new in6_purgeifaddr() which purges bound ifa prefix if
it gets unused.

Currently if_purgeifaddrs() uses in6_purgeaddr() to remove IPv6
 ifaddrs. in6_purgeaddr() does not trrigger prefix removal if
 number of linked ifas goes to 0, as this is a low-level function.
 As a result, if_purgeifaddrs() purges all IPv4/IPv6 addresses but
 keeps corresponding IPv6 prefixes.

Fix this by creating higher-level wrapper which handles unused
 prefix usecase and use it in if_purgeifaddrs().

Differential revision:	https://reviews.freebsd.org/D28128
2021-01-17 20:32:25 +00:00
Alexander V. Chernikov
81728a538d Split rtinit() into multiple functions.
rtinit[1]() is a function used to add or remove interface address prefix routes,
  similar to ifa_maintain_loopback_route().
It was intended to be family-agnostic. There is a problem with this approach
 in reality.

1) IPv6 code does not use it for the ifa routes. There is a separate layer,
  nd6_prelist_(), providing interface for maintaining interface routes. Its part,
  responsible for the actual route table interaction, mimics rtenty() code.

2) rtinit tries to combine multiple actions in the same function: constructing
  proper route attributes and handling iterations over multiple fibs, for the
  non-zero net.add_addr_allfibs use case. It notably increases the code complexity.

3) dstaddr handling. flags parameter re-uses RTF_ flags. As there is no special flag
 for p2p connections, host routes and p2p routes are handled in the same way.
 Additionally, mapping IFA flags to RTF flags makes the interface pretty messy.
 It make rtinit() to clash with ifa_mainain_loopback_route() for IPV4 interface
 aliases.

4) rtinit() is the last customer passing non-masked prefixes to rib_action(),
 complicating rib_action() implementation.

5) rtinit() coupled ifa announce/withdrawal notifications, producing "false positive"
 ifa messages in certain corner cases.

To address all these points, the following has been done:

* rtinit() has been split into multiple functions:
- Route attribute construction were moved to the per-address-family functions,
 dealing with (2), (3) and (4).
- funnction providing net.add_addr_allfibs handling and route rtsock notificaions
 is the new routing table inteface.
- rtsock ifa notificaion has been moved out as well. resulting set of funcion are only
 responsible for the actual route notifications.

Side effects:
* /32 alias does not result in interface routes (/32 route and "host" route)
* RTF_PINNED is now set for IPv6 prefixes corresponding to the interface addresses

Differential revision:	https://reviews.freebsd.org/D28186
2021-01-16 22:42:41 +00:00
Alexander V. Chernikov
e58c8da068 Map IPv6 link-local prefix to the link-local ifa.
Currently we create link-local route by creating an always-on IPv6 prefix
 in the prefix list. This prefix is not tied to the link-local ifa.

This leads to the following problems:

First, when flushing interface addresses we skip on-link route, leaving
 fe80::/64 prefix on the interface without any IPv6 addresses.
Second, when creating and removing link-local alias we lose fe80::/64 prefix
 from the routing table.

Fix this by attaching link-local prefix to the ifa at the initial creation.

Reviewed by:	hrs
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D28129
2021-01-13 10:03:15 +00:00
Alexander V. Chernikov
2defbe9f0e Use rn_match instead of doing indirect calls in fib_algo.
Relevant inet/inet6 code has the control over deciding what
 the RIB lookup function currently is. With that in mind,
 explicitly set it to the current value (rn_match) in the
 datapath lookups. This avoids cost on indirect call.

Differential Revision: https://reviews.freebsd.org/D28066
2021-01-11 23:30:35 +00:00
Alexander V. Chernikov
0da3f8c98d Bump amount of queued packets in for unresolved ARP/NDP entries to 16.
Currently default behaviour is to keep only 1 packet per unresolved entry.
Ability to queue more than one packet was added 10 years ago, in r215207,
 though the default value was kep intact.

Things have changed since that time. Systems tend to initiate multiple
 connections at once for a variety of reasons.
For example, recent kern/252278 bug report describe happy-eyeball DNS
 behaviour sending multiple requests to the DNS server.

The primary driver for upper value for the queue length determination is
 memory consumption. Remote actors should not be able to easily exhaust
 local memory by sending packets to unresolved arp/ND entries.

For now, bump value to 16 packets, to match Darwin implementation.

The proper approach would be to switch the limit to calculate memory
 consumption instead of packet count and limit based on memory.

We should MFC this with a variation of D22447.

Reviewers: #manpages, #network, bz, emaste

Reviewed By: emaste, gbe(doc), jilles(doc)
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D28068
2021-01-11 19:51:11 +00:00
Alexander V. Chernikov
d68cf57b7f Refactor rt_addrmsg() and rt_routemsg().
Summary:
* Refactor rt_addrmsg(): make V_rt_add_addr_allfibs decision locally.
* Fix rt_routemsg() and multipath by accepting nexthop instead of interface pointer.
* Refactor rtsock_routemsg(): avoid accessing rtentry fields directly.
* Simplify in_addprefix() by moving prefix search to a separate  function.

Reviewers: #network

Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D28011
2021-01-07 19:38:19 +00:00
Alexander V. Chernikov
f5baf8bb12 Add modular fib lookup framework.
This change introduces framework that allows to dynamically
 attach or detach longest prefix match (lpm) lookup algorithms
 to speed up datapath route tables lookups.

Framework takes care of handling initial synchronisation,
 route subscription, nhop/nhop groups reference and indexing,
 dataplane attachments and fib instance algorithm setup/teardown.
Framework features automatic algorithm selection, allowing for
 picking the best matching algorithm on-the-fly based on the
 amount of routes in the routing table.

Currently framework code is guarded under FIB_ALGO config option.
An idea is to enable it by default in the next couple of weeks.

The following algorithms are provided by default:
IPv4:
* bsearch4 (lockless binary search in a special IP array), tailored for
  small-fib (<16 routes)
* radix4_lockless (lockless immutable radix, re-created on every rtable change),
  tailored for small-fib (<1000 routes)
* radix4 (base system radix backend)
* dpdk_lpm4 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)
IPv6:
* radix6_lockless (lockless immutable radix, re-created on every rtable change),
  tailed for small-fib (<1000 routes)
* radix6 (base system radix backend)
* dpdk_lpm6 (DPDK DIR24-8-based lookups), lockless datastrucure, optimized
  for large-fib (D27412)

Performance changes:
Micro benchmarks (I7-7660U, single-core lookups, 2048k dst, code in D27604):
IPv4:
8 routes:
  radix4: ~20mpps
  radix4_lockless: ~24.8mpps
  bsearch4: ~69mpps
  dpdk_lpm4: ~67 mpps
700k routes:
  radix4_lockless: 3.3mpps
  dpdk_lpm4: 46mpps

IPv6:
8 routes:
  radix6_lockless: ~20mpps
  dpdk_lpm6: ~70mpps
100k routes:
  radix6_lockless: 13.9mpps
  dpdk_lpm6: 57mpps

Forwarding benchmarks:
+ 10-15% IPv4 forwarding performance (small-fib, bsearch4)
+ 25% IPv4 forwarding performance (full-view, dpdk_lpm4)
+ 20% IPv6 forwarding performance (full-view, dpdk_lpm6)

Control:
Framwork adds the following runtime sysctls:

List algos
* net.route.algo.inet.algo_list: bsearch4, radix4_lockless, radix4
* net.route.algo.inet6.algo_list: radix6_lockless, radix6, dpdk_lpm6
Debug level (7=LOG_DEBUG, per-route)
net.route.algo.debug_level: 5
Algo selection (currently only for fib 0):
net.route.algo.inet.algo: bsearch4
net.route.algo.inet6.algo: radix6_lockless

Support for manually changing algos in non-default fib will be added
soon. Some sysctl names will be changed in the near future.

Differential Revision: https://reviews.freebsd.org/D27401
2020-12-25 11:33:17 +00:00
Andrew Gallatin
a034518ac8 Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
In order to efficiently serve web traffic on a NUMA
machine, one must avoid as many NUMA domain crossings as
possible. With SO_REUSEPORT_LB, a number of workers can share a
listen socket. However, even if a worker sets affinity to a core
or set of cores on a NUMA domain, it will receive connections
associated with all NUMA domains in the system. This will lead to
cross-domain traffic when the server writes to the socket or
calls sendfile(), and memory is allocated on the server's local
NUMA node, but transmitted on the NUMA node associated with the
TCP connection. Similarly, when the server reads from the socket,
he will likely be reading memory allocated on the NUMA domain
associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A
server can now tell the kernel to filter traffic so that only
incoming connections associated with the desired NUMA domain are
given to the server. (Of course, in the case where there are no
servers sharing the listen socket on some domain, then as a
fallback, traffic will be hashed as normal to all servers sharing
the listen socket regardless of domain). This allows a server to
deal only with traffic that is local to its NUMA domain, and
avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use
TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted
https media content from dual-socket Xeons with only 13% (as
measured by pcm.x) cross domain traffic on the memory controller.

Reviewed by:	jhb, bz (earlier version), bcr (man page)
Tested by: gonzo
Sponsored by:	Netfix
Differential Revision:	https://reviews.freebsd.org/D21636
2020-12-19 22:04:46 +00:00
Hans Petter Selasky
ac4dd4cd95 Expose nonstandard IPv6 kernel definitions to standalone builds.
No functional change.

Reviewed by:	bz@
MFC after:	1 week
Sponsored by:	Mellanox Technologies // NVIDIA Networking
2020-12-04 21:51:47 +00:00
Alexander V. Chernikov
d1d941c5b9 Remove RADIX_MPATH config option.
ROUTE_MPATH is the new config option controlling new multipath routing
 implementation. Remove the last pieces of RADIX_MPATH-related code and
 the config option.

Reviewed by:	glebius
Differential Revision:	https://reviews.freebsd.org/D27244
2020-11-29 19:43:33 +00:00
Alexander V. Chernikov
b712e3e343 Refactor fib4/fib6 functions.
No functional changes.

* Make lookup path of fib<4|6>_lookup_debugnet() separate functions
 (fib<46>_lookup_rt()). These will be used in the control plane code
 requiring unlocked radix operations and actual prefix pointer.
* Make lookup part of fib<4|6>_check_urpf() separate functions.
 This change simplifies the switch to alternative lookup implementations,
 which helps algorithmic lookups introduction.
* While here, use static initializers for IPv4/IPv6 keys

Differential Revision:	https://reviews.freebsd.org/D27405
2020-11-29 13:41:49 +00:00
Bjoern A. Zeeb
dd4d5a5ffb IPv6: set ifdisabled in the kernel rather than in rc
Enable ND6_IFF_IFDISABLED when the interface is created in the
kernel before return to user space.

This avoids a race when an interface is create by a program which
also calls ifconfig IF inet6 -ifdisabled and races with the
devd -> /etc/pccard_ether -> .. netif start IF -> ifdisabled
calls (the devd/rc framework disabling IPv6 again after the program
had enabled it already).

In case the global net.inet6.ip6.accept_rtadv was turned on,
we also default to enabling IPv6 on the interfaces, rather than
disabling them.

PR:		248172
Reported by:	Gert Doering (gert greenie.muc.de)
Reviewed by:	glebius (, phk)
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D27324
2020-11-25 20:58:01 +00:00