This very theoretical edge case was discovered by Coverity, not sure if
it was introduced by 2af953b132 or was there before.
CID: 1593695
Fixes: 2af953b132
do not free the original mbuf, it is already freed by the
mb_unmapped_to_ext().
Reviewed by: glebius
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D49305
The SO_REUSEPORT_LB isn't a standard option, neither ENOMEM is a specified
return code from bind(2), but it definitely is more appropriate than
EAGAIN or the masked ENOBUFS.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49153
This structure originates from the pre-FreeBSD times when system RAM was
measured in single digits of MB and Internet speeds were measured in Kb.
At first level the database hashes the port value only to calculate index
into array of pointers to lazily allocated headers that hold lists of
inpcbs with the same local port. This design apparently was made to
preserve kernel memory.
In the modern kernel size of the first level of the hash is derived from
maxsockets, which is derived from maxfiles, which in its turn is derived
from amount of physical memory. Then the size of the hash is capped by
IPPORT_MAX, cause it doesn't make any sense to have hash table larger then
the set of possible values. In practice this cap works even on my laptop.
I haven't done precise calculation or experiments, but my guess is that
any system with > 8 Gb of RAM will be autotuned to IPPORT_MAX sized hash.
Apparently, this hash is a degenerate one: it never has more than one
entries in any slot. You can check this with kgdb:
set $i = 0
while ($i <= tcbinfo->ipi_porthashmask)
set $p = tcbinfo->ipi_porthashbase[$i].clh_first
set $c = 0
while ($p != 0)
set $c = $c + 1
set $p = $p->phd_hash.cle_next
end
if ($c > 1)
printf "Slot %u count %u", $i, $c
end
set $i = $i + 1
end
Retiring the two level hash we remove a lot of complexity at the cost of
only one comparison 'inp->inp_lport != lport' in the lookup cycle, which
is going to be always false on most machines anyway. This comparison
definitely shall be cheaper than extra pointer traversal.
Another positive change to be singled out is that now we no longer need to
allocate memory in non-sleepable context in in_pcbinshash(), so a
potential ENOMEM on connect(2) is removed.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49151
This combination doesn't make any sense. This socket option makes sense
only on a socket that is going to be a listening one. There are two
options here: refuse connect(2) on a socket that has the option set
previously, or ignore (and clear) the option. After some discussion on
phabricator, we have chosen the former, for safety and consistency
reasons. Any programmer that runs this sequence is doing something wrong
and should be informed of that with appropriate error code.
Since connect(2) is a SUS API that has a defined set of error codes, none
of which corresponds to "a socket has non-standard incompatible socket
option set", we decided to return the same error that an already listening
socket would return.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49150
The separation had been done back in 5200e00e72 for the purposes of
removing a true temporary connect of an unconnected UDP socket that does
sendto(2) in 90162a4e87. Now, with 69c05f4287 in place, the
separation is no longer needed. There should be no functional change.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49142
This is the error code specified by SUS. Only the TCP over IPv6 required
this fix.
Fixes: bd4a39cc93
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49275
A globally enabled random IP id generation maybe useful in most IP
contexts, but it may be unnecessary in the case of IPsec encapsulated
packets because IPsec can be configured to use anti-replay windows.
This commit adds a new net.inet.ipsec.random_id sysctl to control whether
or not IPsec packets should use random IP id generation.
Rest of the protocols/modules are still controlled by the global
net.inet.ip.random_id, but can be easily augmented with a knob.
Reviewed by: glebius
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D49164
This changes ABI due to the changed opcodes and includes the
following:
* rule numbers and named object indexes converted to 32-bits
* all hardcoded maximum rule number was replaced with
IPFW_DEFAULT_RULE macro
* now it is possible to grow maximum numbers or rules in
build time
* several opcodes converted to ipfw_insn_u32 to keep rulenum:
O_CALL, O_SKIPTO
* call stack modified to keep u32 rulenum. The behaviour of
O_CALL opcode was changed to avoid possible packets looping.
Now when call stack is overflowed or mbuf tag allocation
failed, a packet will be dropped instead of skipping to next
rule.
* 'return' action now have two modes to specify return point:
'next-rulenum' and 'next-rule'
* new lookup key added for O_IP_DST_LOOKUP opcode 'lookup rulenum'
* several opcodes converted to keep u32 named object indexes
in special structure ipfw_insn_kidx
* tables related opcodes modified to use two structures:
ipfw_insn_kidx and ipfw_insn_table
* added ability for table value matching for specific value type
in 'table(name,valtype=value)' opcode
* dynamic states and eaction code converted to use u32 rulenum
and named objects indexes
* added insntod() and insntoc() macros to cast to specific
ipfw instruction type
* default sockopt version was changed to IP_FW3_OPVER=1
* FreeBSD 7-11 rule format support was removed
* added ability to generate special rtsock messages via log opcode
* added IP_FW_SKIPTO_CACHE sockopt to enable/disable skipto cache.
It helps to reduce overhead when many rules are modified in batch.
* added ability to keep NAT64LSN states during sets swapping
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D46183
The function in_localip() was changed to return bool but the comment was
left unchanged.
Fixes: c8ee75f231 Use network epoch to protect local IPv4 addresses hash
MFC after: 3 days
in_pcblisten() moves an inpcb from the per-group list into the array, at
which point it becomes visible to inpcb lookups in the datapath. It
assumes that there is space in the array for this, but that's not
guaranteed, since in_pcbinslbgrouphash() doesn't reserve space in the
array if the inpcb isn't associated with a listening socket.
We could resize the array in in_pcblisten(), but that would introduce a
failure case where there currently is none. Instead, keep track of the
number of pending inpcbs as well, and modify in_pcbinslbgrouphash() to
reserve space for each pending (i.e., not-yet-listening) inpcb.
Add a regression test.
Reviewed by: glebius
Reported by: netchild
Fixes: 7cbb6b6e28 ("inpcb: Close some SO_REUSEPORT_LB races, part 2")
Differential Revision: https://reviews.freebsd.org/D49100
UDP allows to sendto(2) on unconnected socket. The original BSD devise
was that such action would create a temporary (for the duration of the
syscall) connection between our inpcb and remote addr:port specified in
sockaddr 'to' of the syscall. This devise was broken in 2002 in
90162a4e87. For more motivation on the removal of the temporary
connection see email [1].
Since the removal of the true temporary connection the sendto(2) on
unconnected socket has the following side effects:
1) After first sendto(2) the "unconnected" socket will receive datagrams
destined to the selected port.
2) All subsequent sendto(2) calls will use the same source port.
Effectively, such sendto(2) acts like a bind(2) to INADDR_ANY:0. Indeed,
if you do this:
s1 = socket(PF_INET, SOCK_DGRAM, 0);
s2 = socket(PF_INET, SOCK_DGRAM, 0);
sendto(s1, ..., &somedestination, ...);
bind(s2, &{ .sin_addr = INADDR_ANY, sin_port = 0 });
And then look into kgdb at resulting inpcbs, you would find them equal in
all means modulo bound to different anonymous ports.
What is even more interesting is that Linux kernel had picked up same
behavior, including that "unconnected" socket will receive datagrams. So
it seems that such behavior is now an undocumented standard, thus I
covered it in recently added tests/sys/netinet/udp_bindings.
Now, with the above knowledge at hand, why are we using
in_pcbconnect_setup() and in_pcbinshash(), which are supposed to be
private to in_pcb.c, to achieve the binding? Let's use public KPI
in_pcbbind() on the first sendto(2) and use in_pcbladdr() on all
sendto(2)s. Apart from finally hiding these two should be private
functions, we no longer acquire global INP_HASH_WLOCK() for every
sendto(2) on unconnected socket as well as remove a couple workarounds.
[1] https://mail-archive.FreeBSD.org/cgi/mid.cgi?200210141935.aa83883
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49043
When a socket has SO_BROADCAST set and destination address is INADDR_ANY
or INADDR_BROADCAST, the kernel shall pick up first broadcast capable
interface and broadcast the packet out of it. Since this API is not
reliable on a machine with > 1 broadcast capable interfaces, all practical
software seems to use IP_ONESBCAST or other mechanisms to send broadcasts.
This has been broken at least since FreeBSD 6.0, see bug 99558. Back then
the problem was in the fact that in_broadcast() check was always done
against the gateway address, not the destination address. Later, with
90cc51a1ab, a second problem piled on top - we aren't checking for
INADDR_ANY and INADDR_BROADCAST at all.
Better late than never, fix that by checking destination address.
PR: 99558
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49042
This aligns with existing in_ifaddr_broadcast() and aligns with other
simple functions or macros with bare "in_" prefix that operator just on
struct in_addr and nothing else, e.g. in_nullhost(). No functional
change.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49041
During my progress on updating cc_cubic to RFC9438, found such redundancy as:
- W_est: we use the alternative stack local variable `W_est` in
`cubic_ack_received()`.
- cwnd_prior: it is used for Reno-Friendly Region in RFC9438 Section 4.3,
but we use the alternative cwnd from NewReno for Reno-Friendly as
in commit ee45061051.
No functional change intended.
Reviewed by: rscheff, tuexen
Differential Revision: https://reviews.freebsd.org/D49008
While here annotate deprecated condition with __predict_false() and
slightly refactor in_broadcast() removing leftovers from old address list
locking. Should be no functional change.
There are several functions that keep database locked and do address
and port selection before a caller commits the changes to the inpcb.
Mark the inpcb argument with a good documenting const.
The in_pcbconnect_setup() function is not supposed to modify inpcb.
It may be entered with read-only lock via UDP path. Also at this
point we aren't yet sure that the binding is going to be successful.
Thus, update the multipath routing information only at the end of a
succesful in_pcbconnect().
Fixes: 0c325f53f1
Using the same random jitter for multiple rate limits allows an
attacker to use one rate limiter to figure out the current jitter
and then use this knowledge to de-randomize the other rate limiters.
This can be mitigated by using a separate randomized jitter for each
rate limiter.
This issue was reported as issue number 10 in Keyu Man et al.:
SCAD: Towards a Universal and Automated Network Side-Channel
Vulnerability Detection
Reviewed by: rrs, Peter Lei, glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48804
It's only needed for in_pcb.c and in6_pcb.c, so can go to the private
header.
No functional change intended.
Reported by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
As with net.inet.{tcp,udp}.bind_all_fibs, this causes raw sockets to
accept only packets from the same FIB.
Reviewed by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48707
In particular, we store a FIB number in both struct socket and in struct
inpcb. When updating the FIB number with setsockopt(SO_SETFIB), make
the update atomic. This is required to support the new bind_all_fibs
mode, since in that mode changing the FIB of a bound socket is not
permitted.
This requires a bit more code, but avoids a layering violation in
sosetopt(), where we hard-code the list of protocol families that
implement SO_SETFIB.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48666
Introduce the net.inet.udp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all received
datagrams will be dropped unless an inpcb bound to the same FIB exists.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48664
Introduce the net.inet.tcp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all TCP
listening sockets are private to their FIB. Inbound connection requests
will only succeed if a matching inpcb is bound to the same FIB as the
request.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48663
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48662
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number. When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port. If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48661