Commit graph

6987 commits

Author SHA1 Message Date
Mike Karels
cdae3f501d kernel: deprecate Internet Class A/B/C
Hide historical Class A/B/C macros unless IN_HISTORICAL_NETS is defined;
define it for user level.  Define IN_MULTICAST separately from IN_CLASSD,
and use it in pf instead of IN_CLASSD.  Stop using class for setting
default masks when not specified; instead, define new default mask
(24 bits).  Warn when an Internet address is set without a mask.

(cherry picked from commit 20d5940396)
2021-12-10 10:24:15 -06:00
Michael Tuexen
c5f3adcce3 sctp: unbreak NOINET6 builds.
PR:		260119
Reported by:	kostikbel
MFC after:	1 week

(cherry picked from commit 54912d47b6)
2021-12-10 11:41:44 +01:00
Michael Tuexen
89bbc84988 sctp: inherit IP level socket options from listening socket
Ensure that TTL and TOS values set on a listener get inheritet
to the accepted sockets.

PR:		260119
MFC after:	1 week

(cherry picked from commit d79676fb13)
2021-12-10 11:40:59 +01:00
Michael Tuexen
7a49fc8439 sctp: use the correct traffic class when sending SCTP/IPv6 packets
When sending packets the stcb was used to access the inp and then
access the endpoint specific IPv6 level options. This fails when
there exists an inp, but no stcb yet. This is the case for sending
an INIT-ACK in response to an INIT when no association already
exists. Fix this by just providing the inp instead of the stcb.

PR:		260120
MFC after:	1 week

(cherry picked from commit f32357be53)
2021-12-10 11:39:44 +01:00
Michael Tuexen
70d91ea04c sctp: improve handling of assoc ids in socket options
For socket options related to local and remote addresses providing
generic association ids does not make sense. Report EINVAL in this
case.

MFC after:	1 week

(cherry picked from commit 13c196a41e)
2021-12-10 11:29:43 +01:00
Michael Tuexen
6f7ab8ac9e sctp: cleanup, no functional change intended.
MFC after:	1 week

(cherry picked from commit a01b8859cb)
2021-12-10 11:23:42 +01:00
Alexander V. Chernikov
b9772822a6 routing: fix source address selection rules for IPv4 over IPv6.
Current logic always selects an IFA of the same family from the
 outgoing interfaces. In IPv4 over IPv6 setup there can be just
 single non-127.0.0.1 ifa, attached to the loopback interface.

Create a separate rt_getifa_family() to handle entire ifa selection
 for the IPv4 over IPv6.

Differential Revision: https://reviews.freebsd.org/D31868
MFC after:	1 week

(cherry picked from commit 4b631fc832)
2021-12-04 19:02:52 +00:00
Alexander V. Chernikov
e72b873b7c lltable: do not require prefix lookup when checking lle allocation rules.
With the new FIB_ALGO infrastructure, nearly all subsystems use
 fib[46]_lookup() functions, which provides lockless lookups.
A number of places remains that uses old-style lookup functions, that
 still requires RIB read lock to return the result. One of such places
 is arp processing code.
FIB_ALGO implementation makes some tradeoffs, resulting in (relatively)
 prolonged periods of holding RIB_WLOCK. If the lock is held and datapath
 competes for it, the RX ring may get blocked, ending in traffic delays and losses.
As currently arp processing is performed directly in the interrupt handler,
 handling ARP replies triggers the problem descibed above when the amount of
 ARP replies is high.

To be more specific, prior to creating new ARP entry, routing lookup for the entry
 address in interface fib is executed. The following conditions are the verified:

1. If lookup returns an empty result, or the resulting prefix is non-directly-reachable,
 failure is returned. The only exception are host routes w/ gateway==address.
2. If the routing lookup returns different interface and non-host route,
 we want to support the use case of having multiple interfaces with the same prefix.
 In fact, the current code just checks if the returned prefix covers target address
 (always true) and effectively allow allocating ARP entries for any directly-reachable prefix,
 regardless of its interface.

Change the code to perform the following:

1) use fib4_lookup() to get the nexthop, instead of requesting exact prefix.
2) Rewrite first condition check using nexthop flags (1:1 match)
3) Rewrite second condition to check for interface addresses matching target address on
 the input interface.

Differential Revision: https://reviews.freebsd.org/D31824
Reviewed by:	ae
MFC after:	1 week
PR:	257965

(cherry picked from commit 936f4a42fa)
2021-12-04 19:02:23 +00:00
Gordon Bergling
baf6ef6951 netinet: Fix a common typo in source code comments
- s/segement/segment/

(cherry picked from commit 1dadeab367)
2021-12-03 16:53:56 +01:00
Gordon Bergling
f2cd4c877c tcp(4): Fix a typo in a sysctl description
- s/entires/entries/

(cherry picked from commit b4aa9cb217)
2021-12-03 16:53:14 +01:00
Gordon Bergling
a79a0d9cdd inet(3): Fix two typos in sysctl descriptions
- s/sequental/sequential/

(cherry picked from commit 27c4abc7cd)
2021-12-03 16:52:31 +01:00
Mark Johnston
f763c81c95 netinet: Remove unneeded mb_unmapped_to_ext() calls
in_cksum_skip() now handles unmapped mbufs on platforms where they're
permitted.

Reviewed by:	glebius, jhb
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 44775b163b)
2021-12-01 07:43:18 -05:00
Mark Johnston
dfd5240189 netinet: Implement in_cksum_skip() using m_apply()
This allows it to work with unmapped mbufs.  In particular,
in_cksum_skip() calls no longer need to be preceded by calls to
mb_unmapped_to_ext() to avoid a page fault.

PR:		259645
Reviewed by:	gallatin, glebius, jhb
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 0d9c3423f5)
2021-12-01 07:43:03 -05:00
Mark Johnston
1d250ec707 netinet: Deduplicate most in_cksum() implementations
in_cksum() and related routines are implemented separately for each
platform, but only i386 and arm have optimized versions.  Other
platforms' copies of in_cksum.c are identical except for style
differences and support for big-endian CPUs.

Deduplicate the implementations for the rest of the platforms.  This
will make it easier to implement in_cksum() for unmapped mbufs.  On arm
and i386, define HAVE_MD_IN_CKSUM to mean that the MI implementation is
not to be compiled.

No functional change intended.

Reviewed by:	kp, glebius
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit ecbbe83144)
2021-12-01 07:42:43 -05:00
Mark Johnston
53e965ff94 netinet: Remove in_cksum.c
It does not get compiled into the kernel.  No functional change
inteneded.

Reviewed by:	kp, glebius, cy
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 5195bcc212)
2021-12-01 07:42:26 -05:00
Mark Johnston
5b5bbf2e7c sctp: Remove now-unneeded mb_unmapped_to_ext() calls
sctp_delayed_checksum() now handles unmapped mbufs, thanks to m_apply().

No functional change intended.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 756bb50b6a)
2021-11-29 20:35:20 -05:00
Mark Johnston
422456ae27 sctp: Use m_apply() to calcuate a checksum for an mbuf chain
m_apply() works on unmapped mbufs, so this will let us elide
mb_unmapped_to_ext() calls preceding sctp_calculate_cksum() calls in
the network stack.

Modify sctp_calculate_cksum() to assume it's passed an mbuf header.
This assumption appears to be true in practice, and we need to know the
full length of the chain.

No functional change intended.

Reviewed by:	tuexen, jhb
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit b4d758a0cc)
2021-11-29 20:35:08 -05:00
Marius Halden
0909e05779 carp: deal with negative net.inet.carp.demotion
Given nodes 1 and 2, where node 1 has an advskew of 0 and node 2 has an
advskew of 100, making them master and backup respectively.

If net.inet.carp.demotion is set to a negative value on node 1, node 2
might become master while node 1 still retains it master status. Wether
or not node 2 becomes master seems to depend on the nodes advskew and
what the demotion sysctl was set to on node 1.

The reason for node 2 becoming master seems to be that the calculated
advskew taking demotion into account is truncated to a single unsigned
byte when copied into the carp header for sending, and node 1 stays
master since it takes uses the whole non-truncated calculated advskew
when deciding wether to stay master.

PR:		259528
Reviewed by:	donner, glebius
MFC after:	3 weeks
Sponsored by:	Modirum MDPay
Differential Revision:	https://reviews.freebsd.org/D32759

(cherry picked from commit 1019354b54)
2021-11-22 02:55:02 +01:00
Roy Marples
ec5691aa2f net: Allow binding of unspecified address without address existance
Previously in_pcbbind_setup returned EADDRNOTAVAIL for empty
V_in_ifaddrhead (i.e., no IPv4 addresses configured) and in6_pcbbind
did the same for empty V_in6_ifaddrhead (no IPv6 addresses).

An equivalent test has existed since 4.4-Lite.  It was presumably done
to avoid extra work (assuming the address isn't going to be found
later).

In normal system operation *_ifaddrhead will not be empty: they will
at least have the loopback address(es).  In practice no work will be
avoided.

Further, this case caused net/dhcpd to fail when run early in boot
before assignment of any addresses.  It should be possible to bind the
unspecified address even if no addresses have been configured yet, so
just remove the tests.

The now-removed "XXX broken" comments were added in 59562606b9,
which converted the ifaddr lists to TAILQs.  As far as I (emaste) can
tell the brokenness is the issue described above, not some aspect of
the TAILQ conversion.

PR:		253166
Reviewed by:	ae, bz, donner, emaste, glebius
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D32563

(cherry picked from commit 5c5340108e)
2021-11-18 19:28:56 -05:00
Andrey V. Elsukov
faba420cb9 ip_divert: calculate delayed checksum for IPv6 adress family
Before passing an IPv6 packet to application apply delayed checksum
calculation. Mbuf flags will be lost when divert listener will return a
packet back, so we will not be able to do delayed checksum calculation
later. Also an application will get a packet with correct checksum.

Reviewed by:	donner
Differential Revision: https://reviews.freebsd.org/D32807

(cherry picked from commit 4a9e95286c)
2021-11-12 15:19:19 +03:00
Gordon Bergling
e3f2519c5c Fix a common typo in syctl descriptions
- s/maxiumum/maximum/

(cherry picked from commit c28e39c3d6)
2021-11-06 08:52:57 +01:00
Gordon Bergling
d843e777a5 netinet: Fix a common typo in source code comments
- s/writting/writing/

(cherry picked from commit bb91496a85)
2021-11-06 08:52:38 +01:00
Mike Karels
3ee882bf21 Change lowest address on subnet (host 0) not to broadcast by default.
The address with a host part of all zeros was used as a broadcast long
ago, but the default has been all ones since 4.3BSD and RFC1122.  Until
now, we would broadcast the host zero address as well as the configured
address.  Change to not broadcasting that address by default, but add a
sysctl (net.inet.ip.broadcast_lowest) to re-enable it.  Note that the
correct way to use the zero address for broadcast would be to configure
it as the broadcast address for the network.

See https:/datatracker.ietf.org/doc/draft-schoen-intarea-lowest-address/
and the discussion in https://reviews.freebsd.org/D19316.  Note, Linux
now implements this.

Reviewed by:	rgrimes, tuexen; melifaro (previous version)
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D31861

(cherry picked from commit fd0765933c)
2021-10-19 08:16:32 -05:00
Marko Zec
602f81ea50 [fib_algo][dxr] Retire counters which are no longer used
The number of chunks can still be tracked via vmstat -z|fgrep dxr.

MFC after:	3 days
2021-10-13 22:06:49 +02:00
Marko Zec
0eeef61aec [fib_algo][dxr] Improve incremental updating strategy
Tracking the number of unused holes in the trie and the range table
was a bad metric based on which full trie and / or range rebuilds
were triggered, which would happen in vain by far too frequently,
particularly with live BGP feeds.

Instead, track the total unused space inside the trie and range table
structures, and trigger rebuilds if the percentage of unused space
exceeds a sysctl-tunable threshold.

MFC after:	3 days
PR:		257965
2021-10-13 22:06:10 +02:00
Mark Johnston
f983298883 socket: Rename sb(un)lock() and interlock with listen(2)
In preparation for moving sockbuf locks into the containing socket,
provide alternative macros for the sockbuf I/O locks:
SOCK_IO_SEND_(UN)LOCK() and SOCK_IO_RECV_(UN)LOCK().  These operate on a
socket rather than a socket buffer.  Note that these locks are used only
to prevent concurrent readers and writters from interleaving I/O.

When locking for I/O, return an error if the socket is a listening
socket.  Currently the check is racy since the sockbuf sx locks are
destroyed during the transition to a listening socket, but that will no
longer be true after some follow-up changes.

Modify a few places to check for errors from
sblock()/SOCK_IO_(SEND|RECV)_LOCK() where they were not before.  In
particular, add checks to sendfile() and sorflush().

Reviewed by:	tuexen, gallatin
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit f94acf52a4)
2021-10-07 09:56:47 -04:00
Marko Zec
94ad8d7c7a [fib_algo][dxr] Split unused range chunk list in multiple buckets
Traversing a single list of unused range chunks in search for a block
of optimal size was suboptimal.

The experience with real-world BGP workloads has shown that on average
unused range chunks are tiny, mostly in length from 1 to 4 or 5, when
DXR is configured with K = 20 which is the current default (D16X4R).

Therefore, introduce a limited amount of buckets to accomodate descriptors
of empty blocks of fixed (small) size, so that those can be found in O(1)
time.  If no empty chunks of the requested size can be found in fixed-size
buckets, the search continues in an unsorted list of empty chunks of
variable lengths, which should only happen infrequently.

This change should permit us to manage significantly more empty range
chunks without sacrifying the speed of incremental range table updating.

MFC after:	3 days
2021-09-29 22:40:56 +02:00
Marko Zec
c5981a8130 [fib_algo][dxr] Merge adjacent empty range table chunks.
MFC after:	3 days
2021-09-29 22:40:01 +02:00
Gordon Bergling
81d34d466c sctp: Fix a typo in a comment
- s/assue/assume/

(cherry picked from commit d2e616147d)
2021-09-29 19:18:27 +02:00
Mark Johnston
32f1d05f78 sctp: Allow blocking on I/O locks even with non-blocking sockets
There are two flags to request a non-blocking receive on a socket:
MSG_NBIO and MSG_DONTWAIT.  They are handled a bit differently in that
soreceive_generic() and soreceive_stream() will block on the socket I/O
lock when MSG_NBIO is set, but not if MSG_DONTWAIT is set.  In general,
MSG_NBIO seems to mean, "don't block if there is no data to receive" and
MSG_DONTWAIT means "don't go to sleep for any reason".

SCTP's soreceive implementation did not allow blocking on the I/O lock
if either flag is set, but this violates an assumption in
aio_process_sb(), which specifies MSG_NBIO but nonetheless
expects to make progress if data is available to read.  Change
sctp_sorecvmsg() to block on the I/O lock only if MSG_DONTWAIT
is not set.

Reported by:	syzbot+c7d22dbbb9aef509421d@syzkaller.appspotmail.com
Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit e6c19aa94d)
2021-09-21 09:38:39 -04:00
Marko Zec
ec47ee78b4 [fib algo][dxr] Fix division by zero.
A division by zero would occur if DXR would be activated on a vnet
with no IP addresses configured on any interfaces.

PR:		257965
MFC after:	3 days
Reported by:	Raul Munoz

(cherry picked from commit eb3148cc4d)
2021-09-18 19:38:09 +02:00
Marko Zec
ad2cca48ed [fib algo][dxr] Optimize trie updating.
Don't rebuild in vain trie parts unaffected by accumulated incremental
RIB updates.

PR:		257965
Tested by:	Konrad Kreciwilk
MFC after:	3 days

(cherry picked from commit b51f8bae57)
2021-09-18 19:37:35 +02:00
Marko Zec
d3b9b83623 [fib algo][dxr] Fix undefined behavior.
The result of shifting uint32_t by 32 (or more) is undefined: fix it.

(cherry picked from commit 442c8a245e)
2021-09-18 19:36:32 +02:00
orange30
7959799d93 net: Fix memory leaks upon arp_fillheader() failures
Free memory before return from arprequest_internal().  In in_arpinput(),
if arp_fillheader() fails, it should use goto drop.

Reviewed by:	melifaro, imp, markj
Pull Request:	https://github.com/freebsd/freebsd-src/pull/534

(cherry picked from commit f5777c123a)
2021-09-17 09:14:12 -04:00
Mark Johnston
adfb7f807c sctp: Clear assoc socket references when freeing a PCB
This restores behaviour present in the first import of SCTP.  Commit
ceaad40ae7 commented this out and commit
62fb761ff2 removed it.  However, once
sctp_inpcb_free() returns, the socket reference is gone no matter what,
so we need to clear it.

Reported by:	syzbot+30dd69297fcbc5f0e10a@syzkaller.appspotmail.com
Reported by:	syzbot+7b2f9d4bcac1c9569291@syzkaller.appspotmail.com
Reported by:	syzbot+ed3e651f7d040af480a6@syzkaller.appspotmail.com
Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 4250aa1188)
2021-09-16 08:37:53 -04:00
Mark Johnston
97d24f3dfa sctp: Fix iterator synchronization in sctp_sendall()
- The SCTP_PCB_FLAGS_SND_ITERATOR_UP check was racy, since two threads
  could observe that the flag is not set and then both set it.  I'm not
  sure if this is actually a problem in practice, i.e., maybe there's no
  problem having multiple sends for a single PCB in the iterator list?
- sctp_sendall() was modifying sctp_flags without the inp lock held.

The change simply acquires the PCB write lock before toggling the flag,
fixing both problems.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 173a7a4ee4)
2021-09-14 08:51:54 -04:00
Mark Johnston
086a3ea828 sctp: Remove an unused sctp_inpcb field
This appears to be unused in usrsctp as well.  No functional change
intended.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit e8e23ec127)
2021-09-14 08:51:45 -04:00
Mark Johnston
072901b7bc sctp: Fix races around sctp_inpcb_free()
sctp_close() and sctp_abort() disassociate the PCB from its socket.
As a part of this, they attempt to free the PCB, which may end up
lingering.  Fix some bugs in this area:

- For some reason, sctp_close() and sctp_abort() set
  SCTP_PCB_FLAGS_SOCKET_GONE using an atomic compare-and-set without the
  PCB lock held.  This is racy since sctp_flags is normally updated
  without atomics, using the PCB lock to synchronize.  So, the update
  can be lost, which can cause all sort of races with other SCTP
  components which look for the _GONE flag.  Fix the problem simply by
  acquiring the PCB lock in order to set the flag.  Note that we have to
  drop and re-acquire the lock again in sctp_inpcb_free(), but I don't
  see a good way around that for now.  If it's a real problem, the _GONE
  flag could be split out of sctp_flags and into a dedicated sctp_inpcb
  field.
- In sctp_inpcb_free(), load sctp_socket after acquiring the PCB lock,
  to avoid possible races with parallel sctp_inpcb_free() calls.
- Add an assertion sctp_inpcb_free() to verify that _ALLGONE is not set.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit c17b531bed)
2021-09-14 08:51:35 -04:00
Artem Khramov
8d783b1dcd netinet: prevent NULL pointer dereference in in_aifaddr_ioctl()
It appears that maliciously crafted ifaliasreq can lead to NULL
pointer dereference in in_aifaddr_ioctl(). In order to replicate
that, one needs to

1. Ensure that carp(4) is not loaded

2. Issue SIOCAIFADDR call setting ifra_vhid field of the request
   to a negative value.

A repro code would look like this.

int main() {
    struct ifaliasreq req;
    struct sockaddr_in sin, mask;
    int fd, error;

    bzero(&sin, sizeof(struct sockaddr_in));
    bzero(&mask, sizeof(struct sockaddr_in));

    sin.sin_len = sizeof(struct sockaddr_in);
    sin.sin_family = AF_INET;
    sin.sin_addr.s_addr = inet_addr("192.168.88.2");

    mask.sin_len = sizeof(struct sockaddr_in);
    mask.sin_family = AF_INET;
    mask.sin_addr.s_addr = inet_addr("255.255.255.0");

    fd = socket(AF_INET, SOCK_DGRAM, 0);
    if (fd < 0)
        return (-1);

    memset(&req, 0, sizeof(struct ifaliasreq));
    strlcpy(req.ifra_name, "lo0", sizeof(req.ifra_name));
    memcpy(&req.ifra_addr, &sin, sin.sin_len);
    memcpy(&req.ifra_mask, &mask, mask.sin_len);
    req.ifra_vhid = -1;

    return ioctl(fd, SIOCAIFADDR, (char *)&req);
}

To fix, discard both positive and negative vhid values in
in_aifaddr_ioctl, if carp(4) is not loaded. This prevents NULL pointer
dereference and kernel panic.

Reviewed by:	imp@
Pull Request:	https://github.com/freebsd/freebsd-src/pull/530

(cherry picked from commit 620cf65c2b)
2021-09-12 19:12:31 -06:00
Mark Johnston
aacbd4dd57 sctp: Implement sctp_inpcb_bind_locked()
This will be used by sctp_listen() to avoid dropping locks when
performing an implicit bind.  No functional change intended.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 457abbb857)
2021-09-08 08:41:16 -04:00
Mark Johnston
6bfe4afe73 sctp: Release the socket reference when detaching an association
Later in sctp_free_assoc(), when we clean up chunk lists,
sctp_free_spbufspace() is used to reset the byte count in the socket
send buffer.  However, if the PCB is going away, the socket may already
have been detached from the PCB, in which case this becomes a use-after
free.  Clear the socket reference from the association before detaching
it from the PCB, if the PCB has already lost its socket reference.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 65f30a39e1)
2021-09-08 08:40:36 -04:00
Mark Johnston
d30602a2b4 sctp: Hold association locks across socket wakeups when freeing
At this point we do not hold the inpcb lock, so the only thing holding
the socket reference live is the TCB lock, which needs to be acquired by
sctp_inpcb_free() in order to destroy associations.  Defer the unlock to
until after we dereference the socket reference.

Reported by:	syzbot+1d0f2c4675de76a4cf1e@syzkaller.appspotmail.com
Reported by:	syzbot+fabee77954fe69d3a5ad@syzkaller.appspotmail.com
Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit d35be50f57)
2021-09-08 08:40:33 -04:00
Mark Johnston
2d0d1d6e07 sctp: Add macros to assert on inp info lock state
Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit be8ee77e9e)
2021-09-08 08:40:29 -04:00
Zhenlei Huang
e8df60a69a routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).
Implement kernel support for RFC 5549/8950.

* Relax control plane restrictions and allow specifying IPv6 gateways
 for IPv4 routes. This behavior is controlled by the
 net.route.rib_route_ipv6_nexthop sysctl (on by default).

* Always pass final destination in ro->ro_dst in ip_forward().

* Use ro->ro_dst to exract packet family inside if_output() routines.
 Consistently use RO_GET_FAMILY() macro to handle ro=NULL case.

* Pass extracted family to nd6_resolve() to get the LLE with proper encap.
 It leverages recent lltable changes committed in c541bd368f.

Presence of the functionality can be checked using ipv4_rfc5549_support feature(3).
Example usage:
  route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0

Differential Revision: https://reviews.freebsd.org/D30398

(cherry picked from commit 62e1a437f3)
2021-09-07 21:25:06 +00:00
Alexander V. Chernikov
48f38f47b1 lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.
Currently we use pre-calculated headers inside LLE entries as prepend data
 for `if_output` functions. Using these headers allows saving some
 CPU cycles/memory accesses on the fast path.

However, this approach makes adding L2 header for IPv4 traffic with IPv6
 nexthops more complex, as it is not possible to store multiple
 pre-calculated headers inside lle. Additionally, the solution space is
 limited by the fact that PCB caching saves LLEs in addition to the nexthop.

Thus, add support for creating special "child" LLEs for the purpose of holding
 custom family encaps and store mbufs pending resolution. To simplify handling
 of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE.
 Such LLEs are not visible when iterating LLE table. Their lifecycle is bound
 to the "parent" LLE - it is not possible to delete "child" when parent is alive.
 Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state
 machine used by the standard LLEs.

nd6_lookup() and nd6_resolve() now accepts an additional argument, family,
 allowing to return such child LLEs. This change uses `LLE_SF()` macro which
 packs family and flags in a single int field. This is done to simplify merging
 back to stable/. Once this code lands, most of the cases will be converted to
 use a dedicated `family` parameter.

Differential Revision: https://reviews.freebsd.org/D31379

(cherry picked from commit c541bd368f)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
10e0976103 Simplify nhop operations in ip_output().
Consistently use `nh` instead of always dereferencing
 ro->ro_nh inside the if block.
Always use nexthop mtu, as it provides guarantee that mtu is accurate.
Pass `nh` pointer to rt_update_ro_flags() to allow upcoming uses
 of updating ro flags based on different nexthop.

Differential Revision: https://reviews.freebsd.org/D31451
Reviewed by:	kp

(cherry picked from commit 9748eb7427)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
0ea561762b Use lltable calculated header when sending lle holdchain after successful lle resolution.
Subscribers: imp, ae, bz

Differential Revision: https://reviews.freebsd.org/D31391

(cherry picked from commit 8482aa7748)
2021-09-07 21:02:58 +00:00
Alexander V. Chernikov
2802014380 [lltable] Unify datapath feedback mechamism.
Use newly-create llentry_request_feedback(),
 llentry_mark_used() and llentry_get_hittime() to
 request datapatch usage check and fetch the results
 in the same fashion both in IPv4 and IPv6.

While here, simplify llentry_provide_feedback() wrapper
 by eliminating 1 condition check.

Differential Revision: https://reviews.freebsd.org/D31390

(cherry picked from commit f3a3b06121)
2021-09-07 21:02:58 +00:00
Mark Johnston
6053349c46 sctp: Fix racy UNBOUND flag check in sctp_inpcb_bind()
SCTP needs to avoid binding a given socket twice.  The check used to
avoid this is racy since neither the inpcb lock nor the global info lock
is held.  Fix it by synchronizing using the global info lock.  In
particular, sctp_inpcb_bind() may drop the inpcb lock in some cases, but
the info lock is sufficient to prevent double insertion into PCB hash
tables.

Reported by:	syzbot+548a8560d959669d0e12@syzkaller.appspotmail.com
Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 4a36122b1d)
2021-09-07 09:36:19 -04:00
Mark Johnston
8522f7ddac sctp: Simplify the free port search in sctp_inpcb_bind()
Eliminate a flag variable and reduce indentation.  No functional change
intended.

Reviewed by:	tuexen
Sponsored by:	The FreeBSD Foundation

(cherry picked from commit 2496d812a9)
2021-09-07 09:36:19 -04:00