This structure originates from the pre-FreeBSD times when system RAM was
measured in single digits of MB and Internet speeds were measured in Kb.
At first level the database hashes the port value only to calculate index
into array of pointers to lazily allocated headers that hold lists of
inpcbs with the same local port. This design apparently was made to
preserve kernel memory.
In the modern kernel size of the first level of the hash is derived from
maxsockets, which is derived from maxfiles, which in its turn is derived
from amount of physical memory. Then the size of the hash is capped by
IPPORT_MAX, cause it doesn't make any sense to have hash table larger then
the set of possible values. In practice this cap works even on my laptop.
I haven't done precise calculation or experiments, but my guess is that
any system with > 8 Gb of RAM will be autotuned to IPPORT_MAX sized hash.
Apparently, this hash is a degenerate one: it never has more than one
entries in any slot. You can check this with kgdb:
set $i = 0
while ($i <= tcbinfo->ipi_porthashmask)
set $p = tcbinfo->ipi_porthashbase[$i].clh_first
set $c = 0
while ($p != 0)
set $c = $c + 1
set $p = $p->phd_hash.cle_next
end
if ($c > 1)
printf "Slot %u count %u", $i, $c
end
set $i = $i + 1
end
Retiring the two level hash we remove a lot of complexity at the cost of
only one comparison 'inp->inp_lport != lport' in the lookup cycle, which
is going to be always false on most machines anyway. This comparison
definitely shall be cheaper than extra pointer traversal.
Another positive change to be singled out is that now we no longer need to
allocate memory in non-sleepable context in in_pcbinshash(), so a
potential ENOMEM on connect(2) is removed.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D49151
This changes ABI due to the changed opcodes and includes the
following:
* rule numbers and named object indexes converted to 32-bits
* all hardcoded maximum rule number was replaced with
IPFW_DEFAULT_RULE macro
* now it is possible to grow maximum numbers or rules in
build time
* several opcodes converted to ipfw_insn_u32 to keep rulenum:
O_CALL, O_SKIPTO
* call stack modified to keep u32 rulenum. The behaviour of
O_CALL opcode was changed to avoid possible packets looping.
Now when call stack is overflowed or mbuf tag allocation
failed, a packet will be dropped instead of skipping to next
rule.
* 'return' action now have two modes to specify return point:
'next-rulenum' and 'next-rule'
* new lookup key added for O_IP_DST_LOOKUP opcode 'lookup rulenum'
* several opcodes converted to keep u32 named object indexes
in special structure ipfw_insn_kidx
* tables related opcodes modified to use two structures:
ipfw_insn_kidx and ipfw_insn_table
* added ability for table value matching for specific value type
in 'table(name,valtype=value)' opcode
* dynamic states and eaction code converted to use u32 rulenum
and named objects indexes
* added insntod() and insntoc() macros to cast to specific
ipfw instruction type
* default sockopt version was changed to IP_FW3_OPVER=1
* FreeBSD 7-11 rule format support was removed
* added ability to generate special rtsock messages via log opcode
* added IP_FW_SKIPTO_CACHE sockopt to enable/disable skipto cache.
It helps to reduce overhead when many rules are modified in batch.
* added ability to keep NAT64LSN states during sets swapping
Obtained from: Yandex LLC
Relnotes: yes
Sponsored by: Yandex LLC
Differential Revision: https://reviews.freebsd.org/D46183
Using the same random jitter for multiple rate limits allows an
attacker to use one rate limiter to figure out the current jitter
and then use this knowledge to de-randomize the other rate limiters.
This can be mitigated by using a separate randomized jitter for each
rate limiter.
This issue was reported as issue number 10 in Keyu Man et al.:
SCAD: Towards a Universal and Automated Network Side-Channel
Vulnerability Detection
Reviewed by: rrs, Peter Lei, glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48804
As with net.inet.{tcp,udp}.bind_all_fibs, this causes raw sockets to
accept only packets from the same FIB.
Reviewed by: glebius
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48707
In particular, we store a FIB number in both struct socket and in struct
inpcb. When updating the FIB number with setsockopt(SO_SETFIB), make
the update atomic. This is required to support the new bind_all_fibs
mode, since in that mode changing the FIB of a bound socket is not
permitted.
This requires a bit more code, but avoids a layering violation in
sosetopt(), where we hard-code the list of protocol families that
implement SO_SETFIB.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48666
Introduce the net.inet.udp.bind_all_fibs tunable, set to 1 by default
for compatibility with current behaviour. When set to 0, all received
datagrams will be dropped unless an inpcb bound to the same FIB exists.
No functional change intended, as the new behaviour is not enabled by
default.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48664
Allow protocol layers to look up an inpcb belonging to a particular FIB.
This is indicated by setting INPLOOKUP_FIB; if it is set, the FIB to be
used is obtained from the specificed mbuf or ifnet.
No functional change intended.
Reviewed by: glebius, melifaro
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48662
Add a flag, INPBIND_FIB, which means that the inpcb is local to its FIB
number. When this flag is specified, duplicate bindings are permitted,
so long as each FIB contains at most one inpcb bound to the same
address/port. If an inpcb is bound with this flag, it'll have the
INP_BOUNDFIB flag set.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48661
This is to enable a mode where duplicate inpcb bindings are permitted,
and we want to look up an inpcb with a particular FIB. Thus, add a
"fib" parameter to in_pcblookup() and related functions, and plumb it
through.
A fib value of RT_ALL_FIBS indicates that the lookup should ignore FIB
numbers when searching. Otherwise, it should refer to a valid FIB
number, and the returned inpcb should belong to the specific FIB. For
now, just add the fib parameter where needed, as there are several
layers to plumb through.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48660
To comply with Common Criteria certification requirements, it may be
necessary to ensure that packets to 0.0.0.0/::0 are dropped and logged
by the system firewall. Currently, such packets are dropped by
ip_input() and ip6_input() before reaching pfil hooks; let's defer the
checks slightly to give firewalls a chance to drop the packets
themselves, as this gives better observability. Add some regression
tests for this with pf+pflog.
Note that prior to commit 713264f6b8, v4 packets to the unspecified
address were not dropped by the IP stack at all.
Note that ip_forward() and ip6_forward() ensure that such packets are
not forwarded; they are passed back unmodified.
Add a regression test which ensures that such packets are visible to
pflog.
Reviewed by: glebius
MFC after: 3 weeks
Sponsored by: Klara, Inc.
Sponsored by: OPNsense
Differential Revision: https://reviews.freebsd.org/D48163
Only map mbuf when a policy is looked up and indicates that IPSEC needs
to transform the packet. If IPSEC is inline offloaded, it is up to the
interface driver to request remap if needed.
Fetch the IP header using m_copydata() instead of using mtod() to select
policy/SA.
Reviewed by: markj
Sponsored by: NVidia networking
Differential revision: https://reviews.freebsd.org/D48265
but instead of tripping the assert in debug kernel, and silently falling
into UB for prod, skip IPSEC processing for KTLS framed packets when
mb_unmapped_to_ext() failed.
Reviewed by: markj
Sponsored by: NVidia networking
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D48265
See commit 4f02a7d739 for more background.
I cannot see a good reason to continue ignoring mismatching UIDs when
binding to INADDR_ANY. Looking at the sdr.V2.4a7n sources (mentioned in
bugzilla PR 7713), there is a CANT_MCAST_BIND hack wherein the
application binds to INADDR_ANY instead of a multicast address, but
CANT_MCAST_BIND isn't defined for FreeBSD builds.
It seems unlikely that we still have a use-case for allowing sockets
from different UIDs to bind to the same port when binding to the
unspecified address. And, as noted in D47832, applications like sdr
would have been broken by the inverted SO_REUSEPORT check removed in
that revision, apparently without any bug reports. Let's break
compatibility and simply disallow this case outright.
Also, add some comments, remove a hack in a regression test which tests
this funtionality, and add a new regression test to exercise the
remaining checks that were added in commit 4658dc8325.
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47870
Actually check the conditions that are enforced by the error checking
code instead of a condition which is
* checking a number to be non-negative instead of positive
* depending on a random number
Perform the checks consistently for ICMPv4 and ICMPv6.
Reviewed by: glebius, rrs, cc
MFC after: 1 week
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D48001
This check for SO_REUSEPORT was added way back in commit 52b65dbe85.
Per the commit log, this commit restricted this port-stealing check to
unicast addresses, and then only if the existing socket does not have
SO_REUSEPORT set. In other words, if there exists a socket bound to
INADDR_ANY, and we bind a socket to INADDR_ANY with the same port, then
the two sockets need not be owned by the same user if the existing
socket has SO_REUSEPORT set.
This is a surprising semantic; bugzilla PR 7713 gives some additional
context. That PR makes a case for the behaviour described above when
binding to a multicast address. But, the SO_REUSEPORT check is only
applied when binding to a non-multicast address, so it doesn't really
make sense. In the PR the committer notes that "unicast applications
don't set SO_REUSEPORT", which makes some sense, but also refers to
"multicast applications that bind to INADDR_ANY", which sounds a bit
suspicious.
OpenBSD performs the multicast check, but not the SO_REUSEPORT check.
DragonflyBSD removed the SO_REUSEPORT (and INADDR_ANY) checks back in
2014 (commit 0323d5fde12a4). NetBSD explicitly copied our logic and
still has it.
The plot thickens: 20 years later, SO_REUSEPORT_LB was ported from
DragonflyBSD: this option provides similar semantics to SO_REUSEPORT,
but for unicast addresses it causes incoming connections/datagrams to be
distributed among all sockets in the group. This commit (1a43cff92a)
inverted the check for SO_REUSEPORT while adding one for
SO_REUSEPORT_LB; this appears to have been inadvertent. However:
- apparently no one has noticed that the semantics were changed;
- sockets belonging to different users can now be bound to the same port
so long as they belong to a single lbgroup bound to INADDR_ANY, which
is not correct.
Simply remove the SO_REUSEPORT(_LB) checks, as their original
justification was dubious and their current implementation is wrong; add
some tests.
Reviewed by: glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47832
For a long time, the inpcb lookup path has been lockless in the common
case: we use net_epoch to synchronize lookups. However, the routines
which update lbgroups were not careful to synchronize with unlocked
lookups. I believe that in the worst case this can result in spurious
connection aborts (I have a regression test case to exercise this), but
it's hard to be certain.
Modify in_pcblbgroup* routines to synchronize with unlocked lookup:
- When removing inpcbs from an lbgroup, do not shrink the array.
The maximum number of lbgroup entries is INPCBLBGROUP_SIZMAX (256),
and it doesn't seem worth the complexity to shrink the array when a
socket is removed.
- When resizing an lbgroup, do not insert it into the hash table until
it is fully initialized; otherwise lookups may observe a partially
constructed lbgroup.
- When adding an inpcb to the group, increment the counter after adding
the array entry, using a release store. Otherwise it's possible for
lookups to observe a null array slot.
- When looking up an entry, use a corresponding acquire load.
Reviewed by: ae, glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D48020
A large portion of these functions just determines whether the inpcb can
bind to the address/port. This portion has no side effects, so is a
good candidate to move into its own helper function. This patch does
so, making the callers less complicated and reducing indentation.
While moving this code, also make some changes:
- Load socket options (SO_REUSEADDR etc.) only once. There is nothing
preventing another thread from toggling the socket options, so make
this function easier to reason about by avoiding races.
- When checking whether the bind address is an interface address, make a
separate sockaddr rather than temporarily modifying the one passed to
in_pcbbind().
Reviewed by: ae, glebius
MFC after: 1 month
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47590
- Use the local var "laddr" instead of sin->sin_addr in one block.
- Use in_nullhost() instead of explicit comparisons with INADDR_ANY.
- Combine multiple socket options checks into one.
- Fix indentation.
- Remove some unhelpful comments.
This is in preparation for some simplification and bug-fixing.
No functional change intended.
Reviewed by: glebius
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47451
in_pcblookup_hash_wild_* looks up unconnected inpcbs, so there is no
point in passing the foreign address and port, and indeed those
parameters are not used. So, remove them.
No functional change intended.
MFC after: 1 week
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Differential Revision: https://reviews.freebsd.org/D47385
Found while auditing calls to M_WRITABLE to see if M_EXTPG could be
removed from its checks.
Reviewed by: gallatin
Differential Revision: https://reviews.freebsd.org/D46785
If the V_connect_ifaddr_wild sysctl says that we shouldn't infer a
destination address, return an error. Otherwise it's possible for use
of an unspecified foreign address to trigger a subsequent assertion
failure, for example in in_pcblookup_hash_locked().
Similarly, if no interface addresses are assigned, fail quickly upon an
attempt to connect to the unspecified address.
Reported by: Shawn Webb <shawn.webb@hardenedbsd.org>
MFC after: 2 weeks
Reviewed by: zlei, allanjude, emaste
Differential Revision: https://reviews.freebsd.org/D46454
The inp_route pointer should only be provided to the network
layer, when no destination address is provided. This is only
one of the conditions, where a write lock is needed.
If, for example, the route is also cached, when the socket is
unbound, problems show up, when the sendto is called, then
connect and finally send, when the route for the addresses
provided in the sendto and connect call use different outgoing
interfaces.
While there, clearly document why the write lock is taken.
Reported by: syzbot+59122d2e848087d3355a@syzkaller.appspotmail.com
Reviewed by: Peter Lei, glebius
MFC after: 3 days
Sponsored by: Netflix, Inc.
Differential Revision: https://reviews.freebsd.org/D46056
The nd6 code listens for RTM_DELETE events so that it can mark the
corresponding default router as inactive in the case where the default
route is deleted. A subsequent RA from the router may then reinstall
the default route.
Commit fedeb08b6a broke this for non-multipath routes, as
rib_decompose_notification() only invokes the callback for multipath
routes. Restore the old behaviour. Also ensure that we update the
router only for RTM_DELETE notifications, lost in commit 2259a03020.
Reviewed by: bz
Fixes: fedeb08b6a ("Introduce scalable route multipath.")
Fixes: 2259a03020 ("Rework part of routing code to reduce difference to D26449.")
MFC after: 2 weeks
Sponsored by: Klara, Inc.
Sponsored by: Bell Tower Integration
Differential Revision: https://reviews.freebsd.org/D46020
To be able to pass ifp and mtu to the ipsec_output() and ipsec
accelerator filter.
Sponsored by: NVIDIA networking
Differential revision: https://reviews.freebsd.org/D44225
Similarly, mtu is needed to decide inline IPSEC offloiad for the driver.
Sponsored by: NVIDIA networking
Differential revision: https://reviews.freebsd.org/D44224
The information about the interface is needed to coordinate inline
offloading of IPSEC processing with corresponding driver.
Sponsored by: NVIDIA networking
Differential revision: https://reviews.freebsd.org/D44223
The code deleted predates FreeBSD history. The comment deleted is 99%
outdated. Why KAME decided to use these constants instead of normal ones
also lost in centuries.
The only element of of in6_addr that is specified in RFC 3493 or
in POSIX.1-2017 is s6_addr, implemented via a #define to a union
member. However, FreeBSD and other BSD systems have additional
definitions for the other union members, s6_addr{8,16,32} which
are defined for the kernel and loader. Some Linux applications
also use them, and they seem to be allowed by the RFC and POSIX.
Remove the current ifdefs, exposing the additional fields to user
level, and replace with #if __BSD_VISIBLE. Add an explanatory
comment expanding on the previous "nonstandard" comment.
MFC after: 1 week
Reviewed by: bz
Differential Revision: https://reviews.freebsd.org/D44979
in6_mapped_sockaddr() and in6_mapped_peeraddr() both define a local
variable named 'inp', but in the non-INET case, this variable is set
and never used, causing a compiler error:
/src/freebsd/src/lf/sys/netinet6/in6_pcb.c:547:16: error:
variable 'inp' set but not used [-Werror,-Wunused-but-set-variable]
547 | struct inpcb *inp;
| ^
/src/freebsd/src/lf/sys/netinet6/in6_pcb.c:573:16: error:
variable 'inp' set but not used [-Werror,-Wunused-but-set-variable]
573 | struct inpcb *inp;
Fix this by guarding all the INET-specific logic, including the variable
definition, behind #ifdef INET.
While here, tweak formatting in in6_mapped_peeraddr() so both functions
are the same.
Reviewed by: imp
Pull Request: https://github.com/freebsd/freebsd-src/pull/1155
When debugging network issues one common clue is an unexpectedly
incrementing error counter. This is helpful, in that it gives us an
idea of what might be going wrong, but often these counters may be
incremented in different functions.
Add a static probe point for them so that we can use dtrace to get
futher information (e.g. a stack trace).
For example:
dtrace -n 'mib:ip:count: { printf("%d", arg0); stack(); }'
This can be disabled by setting the following kernel option:
options KDTRACE_NO_MIB_SDT
Reviewed by: gallatin, tuexen (previous version), gnn (previous version)
Sponsored by: Rubicon Communications, LLC ("Netgate")
Differential Revision: https://reviews.freebsd.org/D43504
Use counter_ratecheck() instead of racy and slow ppsratecheck. Use a
separate counter for every currently known type of ICMPv6. Provide logging
of ratelimit events. Provide jitter to counter open UDP port detection.
Reviewed by: tuexen, zlei
Differential Revision: https://reviews.freebsd.org/D44482
Most of them can be declared as static after the move out of in6_proto.c.
Keeping sysctl(9) declarations with their text descriptions next to the
variable declaration create self-documenting code. There should be no
functional changes.
Differential Revision: https://reviews.freebsd.org/D44481