kernel. This will be used for passing in things like the system table
from EFI or other similar metadata that can be used by the kernel to
communicate with the firmware.
The ci20 port (by kan@) is going to reuse almost all of the intrng code
since the SoC in question looks suspiciously like someone took an ARM
SoC design and replaced the ARM core with a MIPS core.
* migrate out the code;
* rename ARM_ -> INTR_;
* rename arm_ -> intr_;
* move the interrupt flush routine from intr.c / intrng.c into
arm/machdep_intr.c - removing the code duplication and removing
the ARM specific bits from here.
Thanks to the Star Wars: The Force Awakens premiere line for allowing
me a couple hours of quiet time to finish the universe builds.
Tested:
* make universe
TODO:
* The structure definitions in subr_intr.c still includes machine/intr.h
which requires one duplicates all of the intrng definitions in
the platform code (which kan has done, and I think we don't have to.)
Instead I should break out the generic things (function declarations,
common intr structures, etc) into a separate header.
* Kan has requested I make the PIC based IPI stuff optional.
I don't know what alternate universe I was inhabiting when I wrote it
originally, but apparently the basic workings of mathematics were different
than in this universe. I also can't explain how it ever worked, except "by
accident", because completely bogus values were being written into the
divisor register.
Different revisions support different operations. Refer to Intel
External Design Specifications to figure out what your hardware
supports.
Sponsored by: EMC / Isilon Storage Division
instead of GMBUS access for I2C transfers. The GMBUS driver falls back
to this mode when a transfer times out. However, the first transfer to
timeout was sending the request back to itself resulting in an panic due
to recursing on a lock. Fix it to forward the request on to the proper
device. This appears to have been accidentally changed in r277487.
Reported by: Joe Maloney <jmaloney@pcbsd.org>
Reviewed by: adrian, dumbbell, imp
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D4599
It turns out the callers of vm_page_replace know exactly which page they are
replacing and would like to assert about it. Change those from hard panics to
KASSERTs, and provide them with a wrapper so they don't have to deal with
warnings from an INVARIANTS-dependent dead store of the return value of
vm_page_replace.
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib (earlier version)
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4497
the #address-cells property set. For this we need to read more data before
the parent interrupt description.
this is only enabled on arm64 for now as it's not quite compliant with the
ePAPR spec. We should use a default of 2 where the #address-cells property
is missing, however this will need further testing across architectures.
Obtained from: ABT Systems Ltd
Sponsored by: SoftIron Inc
Differential Revision: https://reviews.freebsd.org/D4518
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.
Sponsored by: EMC / Isilon Storage Division
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.
While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.
Pointy hat to: glebius
- Use SDT_PROBE<N>() instead of SDT_PROBE(). This has no functional effect
at the moment, but will be needed for some future changes.
- Don't hardcode the module component of the probe identifier. This is
set automatically by the SDT framework.
MFC after: 1 week
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
pxeboot in tftp loader mode (when built with LOADER_TFTP_SUPPORT) now
prefix all the path to open with the path obtained via the option 'root-path'
directive.
This allows to be able to use the traditional content /boot out of box. Meaning
it now works pretty much like all other loaders. It simplifies hosting hosting
multiple version of FreeBSD on a tftp server.
As a consequence, pxeboot does not look anymore for a pxeboot.4th (which was
never provided)
Note: that pxeboot in tftp loader mode is not built by default.
Reviewed by: rpokala
Relnotes: yes
Sponsored by: Gandi.net
Differential Revision: https://reviews.freebsd.org/D4590
The EFI memory map may change before or during the first
ExitBootServices call. In that case ExitBootServices returns an error,
and GetMemoryMap and ExitBootServices must be retried.
Glue together calls to GetMemoryMap(), ExitBootServices() and storage of
(now up-to-date) MODINFOMD_EFI_MAP metadata within a single function.
That new function - bi_add_efi_data_and_exit() - uses space previously
allocated in bi_load_efi_data() to store the memory map (it will fail if
that space is too short). It handles re-calling GetMemoryMap() once to
update the map key if necessary. Finally, if ExitBootServices() is
successful, it stores the memory map and its header as MODINFOMD_EFI_MAP
metadata.
ExitBootServices() calls are now done earlier, from within arch-
independent bi_load() code.
PR: 202455
Submitted by: Ganael LAPLANCHE
Reviewed by: kib
MFC after: 2 weeks
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D4296
Before the change, things like lle state were queried via
SIOCGNBRINFO_IN6 by ndp(8) for _each_ lle entry in dump.
This ioctl was added in 1999, probably to avoid touching rtsock code.
This change maps SIOCGNBRINFO_IN6 data to standard rtsock dump the
following way:
expire (already) maps to rtm_rmx.rmx_expire
isrouter -> rtm_flags & RTF_GATEWAY
asked -> rtm_rmx.rmx_pksent
state -> rtm_rmx.rmx_state (maps to rmx_weight via define)
Reviewed by: ae
If source of ARP request didn't pass the routing check
(e.g. not in directly connected network), be polite and
still answer the request instead of dropping frame.
Reported by: quadro at irc@rusnet
buffer for each block number in the range with gbincore(), look up the
next instantiated buffer with the logical block number which is
greater or equal to the next lblkno. This significantly speeds up the
iteration for sparce-populated range.
Move the iteration into new helper bnoreuselist(), which is structured
similarly to flushbuflist().
Reported and tested by: pho
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
the type stability of the buffers memory. Instead of memoizing
pointer to the next buffer and validating it, remember the next
logical block number in the bo list and re-lookup.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
to do is to clean up the timer handling using the async-drain.
Other optimizations may be coming to go with this. Whats here
will allow differnet tcp implementations (one included).
Reviewed by: jtl, hiren, transports
Sponsored by: Netflix Inc.
Differential Revision: D4055
When a multipath member is orphaned its access members are zeroed before its
removed if marked for wither, so prevent any future calls to g_access on
such members.
This prevents a panic on debug kernels which validates the resultant values
aren't negative.
Reviewed by: mav
MFC after: 2 weeks
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4416
The intention was to just limit leading zeroes on numeric names. That
check is now improved to also catch the leading spaces and '+' that
strtoul can pass through.
PR: 204897
MFC after: 3 days
(1) The pmap argument passed to the function must be current pmap only.
(2) The process must be single threaded as the function is called either
when a process is exiting or from exec_new_vmspace().
Remove pmap_tlb_flush_ng() which is not used anywhere now.
Approved by: kib (mentor)
When using lagg failover mode neither Gratuitous ARP (IPv4) or Unsolicited
Neighbour Advertisements (IPv6) are sent to notify other nodes that the
address may have moved.
This results is slow failover, dropped packets and network outages for the
lagg interface when the primary link goes down.
We now use the new if_link_state_change_cond with the force param set to
allow lagg to force through link state changes and hence fire a
ifnet_link_event which are now monitored by rip and nd6.
Upon receiving these events each protocol trigger the relevant
notifications:
* inet4 => Gratuitous ARP
* inet6 => Unsolicited Neighbour Announce
This also fixes the carp IPv6 NA's that stopped working after r251584 which
added the ipv6_route__llma route.
The new behavour can be controlled using the sysctls:
* net.link.ether.inet.arp_on_link
* net.inet6.icmp6.nd6_on_link
Also removed unused param from lagg_port_state and added descriptions for the
sysctls while here.
PR: 156226
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4111
in pmap_remove_pages().
Some points were considered:
(1) There is no range TLB flush cp15 function.
(2) There is no target selection for hardware TLB flush broadcasting.
(3) Some memory ranges could be mapped sparsely.
(4) Some memory ranges could be quite large.
Tested by buildworld on RPi2 and Jetson TK1, i.e. 4 core platforms.
It turned out that the buildworld time is faster. On the other hand,
when the postponed TLB flush was also removed from pmap_remove_pages(),
the result was worse. But pmap_remove_pages() is called for removing
all user mapping from a process, thus it's quite expected.
Note that the postponed TLB flushes came here from i386 pmap where
hardware TLB flush broadcasting is not available.
Approved by: kib (mentor)
This fixes an issue observed on Cortex A7 (RPi2) and on Cortex A15
(Jetson TK1) causing various memory corruptions. It turned out that
even L2 page table with no valid mapping might be a subject of such
caching.
Note that not all platforms have intermediate TLB caching implemented.
An open question is if this fix is sufficient for all platforms with
this feature.
Approved by: kib (mentor)
Without the patch, there is a race condition: when poll() is invoked(),
if kvp_globals.daemon_busy is false, the daemon won't be timely
woke up, because hv_kvp_send_msg_to_daemon() can't wake up the daemon
in this case.
Submitted by: Dexuan Cui <decui@microsoft.com>
Sponsored by: Microsoft OSTC
Reviewed by: delphij, royger
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D4258
Current code doesn't try to make use of the full page when bouncing because
the size is only expanded to be a multiple of the alignment. Instead try to
always create segments of PAGE_SIZE when using bounce pages.
This allows us to remove the specific casing done for
BUS_DMA_KEEP_PG_OFFSET, since the requirement is to make sure the offsets
into contiguous segments are aligned, and now this is done by default.
Sponsored by: Citrix Systems R&D
Reviewed by: hps, kib
Differential revision: https://reviews.freebsd.org/D4119
panics when unloading the dummynet and IPFW modules:
- The callout drain function can sleep and should not be called having
a non-sleepable lock locked. Remove locks around "ipfw_dyn_uninit(0)".
- Add a new "dn_gone" variable to prevent asynchronous restart of
dummynet callouts when unloading the dummynet kernel module.
- Call "dn_reschedule()" locked so that "dn_gone" can be set and
checked atomically with regard to starting a new callout.
Reviewed by: hiren
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D3855
This is a holdover from how reset is handled in the ARGE_MDIO world.
You need to define the mdio bus device if you want to use the ethernet
device or the arge setup path doesn't bring the MAC out of reset.
The new flag, -c <period>, sets the interrupt coalescing period in
microseconds through the new ioat(4) API ioat_set_interrupt_coalesce().
Also add a -z flag to zero ioat statistics before tests, to make it easy
to measure results.
Sponsored by: EMC / Isilon Storage Division
In I/OAT, this is done through the INTRDELAY register. On supported
platforms, this register can coalesce interrupts in a set period to
avoid excessive interrupt load for small descriptor workflows. The
period is configurable anywhere from 1 microsecond to 16.38
milliseconds, in microsecond granularity.
Sponsored by: EMC / Isilon Storage Division
directly into a loader (and thus kernel) env var, using the syntax
ubenv import ldvarname=ubvarname
Without the varname= prefix it uses the historical behavior of importing
to the name uboot.ubvarname.
Tested with RTL8188EU and RTL8188CUS in STA mode.
Reviewed by: kevlo
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D4523
Certain interfaces (e.g. pfsync0) do not have ip6 addresses (in other words,
ifp->if_afdata[AF_INET6] is NULL). Ensure we don't panic when the MTU is
updated.
pfsync interfaces will never have ip6 support, because it's explicitly disabled
in in6_domifattach().
PR: 205194
Reviewed by: melifaro, hrs
Differential Revision: https://reviews.freebsd.org/D4522
sys/dev/mpr/mpr_sas_lsi.c
sys/dev/mps/mps_sas_lsi.c
When mp[rs]sas_get_sata_identify returns
MPI2_IOCSTATUS_SCSI_PROTOCOL_ERROR, don't bother retrying. Protocol
errors aren't likely to be fixed by sleeping.
Without this change, a system that generated may protocol errors due
to signal integrity issues was taking more than an hour to boot, due
to all the retries.
Reviewed by: slm
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4553
* Use the interrupt-map property to route interrupts
* Remove the IRQ rman, it's now unneeded
* Support MSI/MSI-X interrupts
With this I'm able to use the two NICs I've tested (em and msk), however
while I can boot with an AHCI devie attached it fails when any drives are
connected.
Obtained from: ABT Systems Ltd
Sponsored by: SoftIron Inc
Currently, in case when npkts >= 2, RSSI and Rx radiotap fields
will be overridden by the next packet. As a result, every packet
from this chain will use the same RSSI / radiotap data.
After this change, RSSI and radiotap structure will be filled
for every frame right before ieee80211_input() call.
Tested with RTL8188EU / RTL8188CUS, STA and MONITOR modes.
Reviewed by: kevlo
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D4487
absolute position. This seems to be correlated with only removing a single
finger. To work around this report no movement on from the first packet
when the user exits scrolling.
the kernel. These registers are all callee saved, and as such will be
restored before returning to the exception handler.
Userland still needs these registers to be restored as they may be changed
by the kernel and we don't currently track these places.
An implementation from rum(4) was used (it looks simpler for me).
Will be used for h/w encryption support.
Reviewed by: kevlo
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D4447
- Add IEEE80211_GET_SLOTTIME(ic) macro.
- Use predefined macroses to set slot time.
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D4044
Before r291643, adding new interface prefix had the following logic:
try_add:
EEXIST && (PINNED) {
try_del(w/o PINNED flag)
if (OK)
try_add(PINNED)
}
In r291643, deletion was performed w/ PINNED flag held which leaded
to new interface prefixes (like ::1) overriding older ones.
Fix this by requesting deletion w/o RTF_PINNED.
PR: kern/205285
Submitted by: Fabian Keil <fk at fabiankeil.de>
Strictly speaking, missing devinfo is error which can be caused
by instantiating child using device_add_child() instead of
BUS_ADD_CHILD(). However, we can tolerate it.
Approved by: kib (mentor)
By using this functions, we can parse a list of tuples, each of them holds
xref and variable number of values.
This kind of list is used in DT for clocks, gpios, resets ...
Discussed with: ian, nwhitehorn
Approved by: kib (mentor)
Differential Revision: https://reviews.freebsd.org/D4316
LLE structure is mostly unchanged during its lifecycle: there are only 2
things relevant for fast path lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we send NS to perform reachability verification instead of expiring entry.
The only signal that is needed from fast path is something like binary
yes/no.
The latter is solved by the following changes:
Special r_skip_req (introduced in D3688) value is used for fast path feedback.
It is read lockless by fast path, but updated under req_mutex mutex. If this
field is non-zero, then fast path will acquire lock and set it back to 0.
After transitioning to STALE state, callout timer is armed to run each
V_nd6_delay seconds to make sure that if packet was transmitted at the start
of given interval, we would be able to switch to PROBE state in V_nd6_delay
seconds as user expects.
(in STALE state) timer is rescheduled until original V_nd6_gctimer expires
keeping lle in STALE state (remaining timer value stored in lle_remtime).
(in STALE state) timer is rescheduled if packet was transmitted less that
V_nd6_delay seconds ago to make sure we transition to PROBE state exactly
after V_n6_delay seconds.
As a result, all packets towards lle in REACHABLE/STALE/PROBE states are handled
by fast path without acquiring lle read lock.
Differential Revision: https://reviews.freebsd.org/D3780
requests which page alignment + size is greater than MAXPHYS. Right
now md(4) over vnode would use the physical buffer of the size MAXPHYS
to map a data of size MAXPHYS + page offset of the user buffer. This
typically corrupts next pbuf, or, if the pbuf used was the last pbuf
in the map, the next page after the pbuf's map.
Split request up to the size of io which fits into pbuf KVA with
alignment, and retry if a part of the bio is left unprocessed.
Reported by: Fabian Keil <fk@fabiankeil.de>
Tested by: Fabian Keil, pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
A panicking thread always executes with a critical section held, so any
attempt to allocate or free memory while dumping will otherwise cause a
second panic. This can occur, for example, if xpt_polled_action() completes
non-dump I/O that was pending at the time of the panic. The fact that this
can occur is itself a bug, but asserting in this case does little but
reduce the reliability of kernel dumps.
Suggested by: kib
Reported by: pho
http://www.t-es-t.hu/download/mips/md00090c.pdf this is bit 3 of the
config0 word, not bit 2. This should fix virtually indexed caches
(relatively new in the MIPS world, so no current platforms used this
and current code just uses it as an optimization). It was causing
false positives on newer platforms that default to large values for
the kseg0 cache coherency attribute.
Submitted by: Stanislav Galabov
PR: 205249
It is a part of MCDI rework to share more code among NIC families.
Submitted by: Andy Moreton <amoreton at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D4481
tables. Some drivers needed some slight re-arrangement of declarations
to accommodate this. Change the USB pnp tables slightly to allow
better compatibility with the system by moving linux driver info from
start of each entry to the end. All other PNP tables in the system
have the per-device flags and such at the end of the elements rather
that at the beginning.
Differential Review: https://reviews.freebsd.org/D3458
block. Use it in all the PNP drivers to export either the current PNP
table. For uart, create a custom table and export it using
MODULE_PNP_INFO since it's the only one that matches on function
number.
Differential Review: https://reviews.freebsd.org/D3461
Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.
This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.
This can be controller by the new sysctl kern.nvme.max_optimal_sectorsize.
MFC after: 1 week
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D4446
e500mc, e5500, and e6500 all use the normal FPU, with the same behavior as AIM
hardware. e6500 also supports Altivec, so, although we don't yet have e6500
hardware to test on, add these IVORs as well. Theoretically, since it boots the
same as a e5500, it should work, single-threaded, single-core, with full altivec
support as of this commit.
With this commit, and some other patches to be committed shortly FreeBSD now
boots on the P5020, single-core, all the way to user space, and should boot just
fine on e500mc.
Relnotes: Yes (e500mc, e5500 support)
Sponsored by: Alex Perez/Inertial Computing
IAP_F_FM's as well as incorrect umask specifications for
some of the new Broadwell/Skylake PMC's. Also silvermont
had a *lot* of missing IAP_F_FM.
Sponsored by: Netflix Inc.
Remove redundant lookup of the old page from vm_page_replace. Verification
that the old page exists is already done by vm_radix_replace.
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: alc, kib
Sponsored by: EMC / Isilon Storage Division
Follow-up to: https://reviews.freebsd.org/D4326
Differential Revision: https://reviews.freebsd.org/D4471
except during split, add, or create operations. This fixes a bug where the
wrong disk could be returned, and higher layers of ZFS would immediately
eject it again.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
o When opening by GUID, require both the pool and vdev GUIDs to
match. While it is highly unlikely for two vdevs to have the same
vdev GUIDs, the ZFS storage pool allocator only guarantees they
are unique within a pool.
o Modify the open behavior to:
- If we are opening a vdev that hasn't previously been opened,
open by path without checking GUIDs.
- Otherwise, open by path and verify GUIDs.
- If that fails, search all geom providers for a device with
matching GUIDs.
- If that fails, return ENOENT.
Submitted by: gibbs, asomers
Reviewed by: smh
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4486
This is (oddly) specified in the ARM Server Base System Architecture. It
extends the GICv2 to support MSI and MSI-X interrupts, however only the
latter are currently supported.
Only the FDT attachment is currently supported, however the attachment
and core driver are split to help adding ACPI support in the future.
Obtained from: ABT Systems Ltd
Relnotes: yes
Sponsored by: SoftIron Inc
This routine checks that there are no locks held for an inp,
without having any lock on the inp. This breaks if the inp
goes away when it is called. This happens on stress tests
on a RPi B+.
MFC after: 3 days
When we are detecting a partition table and didn't find PMBR, try to
read backup GPT header from the last sector and if it is correct,
assume that we have GPT.
Reviewed by: rpokala
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D4282
To reduce code duplication in common code, consolidate similar privilege
check functions.
Submitted by: Richard Houldsworth <rhouldsworth at solarflare.com>
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4480
Submitted by: Richard Houldsworth <rhouldsworth at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4455
The hardware supports descriptors with two non-contiguous pages. This
allows issuing one descriptor for an 8k copy from/to non-contiguous but
otherwise page-aligned memory.
Sponsored by: EMC / Isilon Storage Division
ip_dooptions(), icmp6_redirect_input(), in6_lltable_rtcheck(),
in6p_lookup_mcast_ifp() and in6_selecthlim() use new routing api.
Eliminate now-unused ip_rtaddr().
Fix lookup key fib6_lookup_nh_basic() which was lost diring merge.
Make fib6_lookup_nh_basic() and fib6_lookup_nh_extended() always
return IPv6 destination address with embedded scope. Currently
rw_gateway has it scope embedded, do the same for non-gatewayed
destinations.
Sponsored by: Yandex LLC
Update of common code to provide a query on the MAC_SPOOFING_TX and
CHANGE_MAC privileges instead of the deprecated MAC_SPOOFING privilege.
Submitted by: Richard Houldsworth <rhouldsworth at solarflare.com>
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4436
other end till it reaches predetermined threshold which is 3 for us right now.
Once that happens, we trigger fast-retransmit to do loss recovery.
Main problem with the current implementation is that we don't honor SACK
information well to detect whether an incoming ack is a dupack or not. RFC6675
has latest recommendations for that. According to it, dupack is a segment that
arrives carrying a SACK block that identifies previously unknown information
between snd_una and snd_max even if it carries new data, changes the advertised
window, or moves the cumulative acknowledgment point.
With the prevalence of Selective ACK (SACK) these days, improper handling can
lead to delayed loss recovery.
With the fix, new behavior looks like following:
0) th_ack < snd_una --> ignore
Old acks are ignored.
1) th_ack == snd_una, !sack_changed --> ignore
Acks with SACK enabled but without any new SACK info in them are ignored.
2) th_ack == snd_una, window == old_window --> increment
Increment on a good dupack.
3) th_ack == snd_una, window != old_window, sack_changed --> increment
When SACK enabled, it's okay to have advertized window changed if the ack has
new SACK info.
4) th_ack > snd_una --> reset to 0
Reset to 0 when left edge moves.
5) th_ack > snd_una, sack_changed --> increment
Increment if left edge moves but there is new SACK info.
Here, sack_changed is the indicator that incoming ack has previously unknown
SACK info in it.
Note: This fix is not fully compliant to RFC6675. That may require a few
changes to current implementation in order to keep per-sackhole dupack counter
and change to the way we mark/handle sack holes.
PR: 203663
Reviewed by: jtl
MFC after: 3 weeks
Sponsored by: Limelight Networks
Differential Revision: https://reviews.freebsd.org/D4225
Vast majority of rtalloc(9) users require only basic info from
route table (e.g. "does the rtentry interface match with the interface
I have?". "what is the MTU?", "Give me the IPv4 source address to use",
etc..).
Instead of hand-rolling lookups, checking if rtentry is up, valid,
dealing with IPv6 mtu, finding "address" ifp (almost never done right),
provide easy-to-use API hiding all the complexity and returning the
needed info into small on-stack structure.
This change also helps hiding route subsystem internals (locking, direct
rtentry accesses).
Additionaly, using this API improves lookup performance since rtentry is not
locked.
(This is safe, since all the rtentry changes happens under both radix WLOCK
and rtentry WLOCK).
Sponsored by: Yandex LLC
r281257 added support for lazyload mode by allowing dtrace(1) to register
a DOF section on behalf of a traced process. This was implemented by
having libdtrace copy the DOF section into a heap-allocated buffer and
passing its address to the ioctl handler. However, DTrace uses the DOF
section address as a lookup key in certain cases, so the ioctl handler
should be given the target process' DOF section address instead. This
change modifies the ADDDOF handler to copy the DOF section in from the
target process, rather than from dtrace(1).
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().
This change also adds a manual page for proc_rwmem() and the new functions.
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D4245
r259397 (it contained the CAM_EXTLUN_VALID bit) and I added the
same type name with a different set of values back in r291716.
The old ccb_xflags enumeration still exists in FreeBSD stable/10.
Shift all of the new values by one bit to avoid compatibility
issues when merged to stable/10.
MFC after: 3 days
Sponsored by: Spectra Logic
from 1500 to 1496 bytes. The MTU should remain at 1500, extending the
frame size as per IEEE 802.3. Adding IFCAP_VLAN_MTU to the
if_capabilities field in the smsc driver solves the problem. The
datasheet for the LAN9512 chip, section 3.2.3 states that the chip
supports the extended frame.
Submitted by: rpp@ci.com.au
MFC after: 1 week
PR: 205050
new headers x86/include x86_var.h and x86_smp.h.
Reviewed by: emaste, jhb
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4358
The pcb is saved at the top of the kernel stack on x86 platforms.
The initial kenrel stack pointer is set in the TSS so that the trapframe
from user -> kernel transitions begins directly below the pcb and grows
down.
The XSAVE changes moved the FPU save area out of the pcb and into a
variable-sized area after the pcb. This required updating the expressions
to calculate the initial stack pointer from 'stacktop - sizeof(pcb)' to
'stacktop - sizeof(pcb) + FPU save area size'.
The i386_set_ioperm() system call allows user applications to access
individual I/O ports via the I/O port permission bitmap in the TSS.
On FreeBSD this requires allocating a custom per-process TSS instead of
using the shared per-CPU TSS.
The expression to initialize the initial kernel stack pointer in the
per-process TSS created for i386_set_ioperm() was not properly updated
after the XSAVE changes. Processes that used i386_set_ioperm() would
trash the trapframe during subsequent context switches resulting in
panics from memory corruption.
This changes fixes the kernel stack pointer calculation for the per-process
TSS.
Reviewed by: kib, n_hibma
Reported by: n_hibma
MFC after: 1 week
include the following list of changes:
- Added eswitch ACL table management
Introduce API for managing ACL table.
This API include the following features:
1) vlan filter - for VST/VGT+ support.
2) spoofcheck.
3) robust functionality to allow/drop general untagged/tagged traffic.
4) support for both ingress and egress ACL types.
- Added loopback filter to the vacl table.
- Added multicast list set in the vPort context
- Added promiscuous mode set in the vPort context
- Set the vlan list in vPort context
1) Check caps if VLAN list is not longer than FW supports
2) Set MODIFY_NIC_VPORT_CONTEXT command
- Changed MLX5_EEPROM_MAX_BYTES from 48 to 32 so that a single EEPROM
reading cannot cross the 128-byte boundary. Previously reading the
MCIA register was done in batches of 48 bytes. The third reading
would then by-pass the 127th byte, which means that part of the low
page and part of the high page would be read at the same time, which
created a bug:
1st: 0-47 bytes
2nd: 48-95 bytes
3rd: 96-143 bytes
MFC after: 1 week
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4411
driver. This includes binding all interrupt and worker threads
according to the RSS configuration, setting up correct Toeplitz
hashing keys as given by RSS and setting the correct mbuf
hashtype for all received traffic.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4410
clock_gettime(2) on ARMv7 and ARMv8 systems which have architectural
generic timer hardware. It is similar how the RDTSC timer is used in
userspace on x86.
Fix a permission problem where generic timer access from EL0 (or
userspace on v7) was not properly initialized on APs.
For ARMv7, mark the stack non-executable. The shared page is added for
all arms (including ARMv8 64bit), and the signal trampoline code is
moved to the page.
Reviewed by: andrew
Discussed with: emaste, mmel
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D4209
Cleanup setting of ctime/mtime/birthtime: do not set IN_ACCESS or
IN_UPDATE, then clear them with ufs_itimes(), making transient
(possibly inconsistent) change to the times, and then copy
user-supplied times into the inode. Instead, directly clear IN_ACCESS
or IN_UPDATE when user supplied the time, and copy the value into the
inode.
Minor inconsistency left is that the inode ctime is updated even when
birthtime update attempt is performed on a UFS1 volume.
Submitted by: bde
MFC after: 2 weeks
completion events can be moderated in the same way like RX completion
events. Expose this functionality by a sysctl variable.
MFC after: 1 week
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4409
Set the port MTU and then query it and report if any problems instead.
MFC after: 1 week
Submitted by: Shahar Klein <shahark@mellanox.com>
Sponsored by: Mellanox Technologies
Differential Revision: https://reviews.freebsd.org/D4408
TLV routines use 'uint8_t *', NVRAM code uses caddr_t. Just cast to
required type to fix the warning.
Required to build with -Werror=pointer-signg.
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4391
The header is not present on FreeBSD, but exists on OmniOS where sfxge
common code is used as well.
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4390
It is better do not mix TxQ creation and receive event flags since only
checksum flags are applicable to TxQ.
Also it will allow to add a new TxQ creation specific flags.
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4389
When writing the MUM firmware the chunk size must be equal to the erase
size.
Submitted by: Laurence Evans <levans at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4388
Use flag on vadapter alloc when reported as a supported capability.
Use the slow device reset only when the capability is missing.
Submitted by: Richard Houldsworth <rhouldsworth at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4387
iscsi's shutdown_pre_sync prio was SHUTDOWN_PRI_FIRST which caused it to
run before other high priority handlers such as filesystems e.g. ZFS.
This meant the iscsi sessions where removed before the ZFS geom consumer
was closed, resulting in a panic from g_access calls on debug kernels
due to negative acr.
Instead use the same as the old iscsi_initiator SHUTDOWN_PRI_DEFAULT-1
which allows it to run before dashutdown etc but after filesystems.
MFC after: 2 weeks
Sponsored by: Multiplay
On vm_page_rename failure, fix a missing object unlock and a double free of
a page.
First remove the old page, then rename into other page into first_object,
then free the old page. This avoids the problem on rename failure. This is
a little ugly but seems to be the most straightforward solution.
Tested with:
$ sysctl debug.fail_point.uma_zalloc_arg="1%return"
$ kyua test -k /usr/tests/sys/Kyuafile
Submitted by: Ryan Libby <rlibby@gmail.com>
Reviewed by: kib
Seen by: alc
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4326
Typical TLBs have 40-512 entries available. At some point, iterating
every single page in a requested invalidation range and issuing invlpg
on it is more expensive than flushing the TLB and allowing it to reload
on demand.
Broadwell CPUs have 1536 L2 TLB entries, so I've picked the arbitrary
number 4096 entries as a hueristic at which point we flush TLB rather
than invalidating every single potential page.
Reviewed by: alc
Feedback from: jhb, kib
MFC notes: Depends on r291688
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D4280
* When processing a cookie, use the number of
streams announced in the INIT-ACK.
* When sending an INIT-ACK for an existing
association, use the value from the association,
not from the end-point.
MFC after: 1 week
The erase size is reported by the nvram info command.
Submitted by: Paul Fox <pfox at solarflare.com>
Reviewed by: gnn
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4386
- Restore R92C_TXDW4_HWSEQ_EN bit - it is used by non-8188EU chips.
- Fix DRVRATE bit usage.
Tested with:
- RTL8188EU, STA mode.
- RTL8188CUS, STA mode.
Reviewed by: kevlo
Approved by: adrian (mentor)
Differential Revision: https://reviews.freebsd.org/D4352
While here update for armv6 to a tested value.
Submitted by: Howard Su <howard0su@gmail.com>
Reviewed by: stat
Differential Revision: https://reviews.freebsd.org/D4315
LLE structure is mostly unchanged during its lifecycle.
To be more specific, there are 2 things relevant for fast path
lookup code:
1) link-level address change. Since r286722, these updates are performed
under AFDATA WLOCK.
2) Some sort of feedback indicating that this particular entry is used so
we re-send arp request to perform reachability verification instead of
expiring entry. The only signal that is needed from fast path is something
like binary yes/no.
The latter is solved by the following changes:
1) introduce special r_skip_req field which is read lockless by fast path,
but updated under (new) req_mutex mutex. If this field is non-zero, then
fast path will acquire lock and set it back to 0.
2) introduce simple state machine: incomplete->reachable<->verify->deleted.
Before that we implicitely had incomplete->reachable->deleted state machine,
with V_arpt_keep between "reachable" and "deleted". Verification was performed
in runtime 5 seconds before V_arpt_keep expire.
This is changed to "change state to verify 5 seconds before V_arpt_keep,
set r_skip_req to non-zero value and check it every second". If the value
is zero - then send arp verification probe.
These changes do not introduce any signifficant control plane overhead:
typically lle callout timer would fire 1 time more each V_arpt_keep (1200s)
for used lles and up to arp_maxtries (5) for dead lles.
As a result, all packets towards "reachable" lle are handled by fast path without
acquiring lle read lock.
Additional "req_mutex" is needed because callout / arpresolve_slow() or eventhandler
might keep LLE lock for signifficant amount of time, which might not be feasible
for fast path locking (e.g. having rmlock as ether AFDATA or lltable own lock).
Differential Revision: https://reviews.freebsd.org/D3688
Boundary Trace to assembly to reduce the overhead of these checks.
Submitted by: Howard Su <howard0su@gmail.com>
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D4266
suppression but the version of the IOAPIC reported is 0x11 and neither
IOAPIC EOIR nor the Linux trick of temporal reprogramming of the pin
to edge-trigger mode to issue EOI work.
Disable eoi suppression if KVM is detected. The mode can still be
forced with the tunable.
Reported and tested by: Roman Mamontov <mr.xanto@gmail.com>
Sponsored by: The FreeBSD Foundation
ptes mapping the kernel on CPUs where global TLB entries are
supported, revert to flushing only non-global entries, i.e. to the
pre-r291688 state. There is no need to flush global TLB entries,
since only global entries created during the previous iterations of
the loop could exist at this moment.
Submitted by: alc
Differential revision: https://reviews.freebsd.org/D4368
Required to build with -Werror=unused-but-set-variable.
Keep it under #if 0 as a reminder for parse error processing.
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
The SFL9122 "Huntington" controller was never built.
Submitted by: Mark Spender <mspender at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Submitted by: Artem V. Andreev <Artem.Andreev at oktetlabs.ru>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4355
If, for example, a VF is configured to use a 1500 byte MTU, but the port
it is attached to is set to 9000 bytes, overlength frames can be received
by the VF. As Huntington scatters by default, these overlength packets
would be scattered across several descriptors, with all except the last
having the CONT bit set.
To avoid this, disable scatter when creating RXQs if the firmware
supports doing so, which all recent versions do. Then we only get
a single descriptor from an overlength frame. This will have the CONT
bit set to indicate it was truncated, so we can discard it.
Submitted by: Mark Spender <mspender at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4354
Submitted by: Paul Fox <pfox at solarflare.com>
Sponsored by: Solarflare Communications, Inc.
MFC after: 2 days
Differential Revision: https://reviews.freebsd.org/D4353
For completeness add a VNASSERT that there are no threads waiting on a
range lock (this was previously checked on every vnode free).
Reported by; Rick Macklem
Fix from: Mateusz Guzik
PR: 204949
Add a new bp argument to g_disk_maxsegs(), and add a new function,
g_disk_maxsize() tha will properly determine the maximum I/O size for a
delete or non-delete bio.
Submitted by: will
MFC after: 1 week
Sponsored by: Spectra Logic