If the affinity watch is removed by setting an empty list the
disconnect callback won't be called which was the only place
the watch ID was cleared. This resulted in the next SetProperty call
to think a watch existed, and attempt to compare the sender address
which would be NULL.
The watch ID should be cleared inside the destroy callback, not
the disconnect callback.
When the affinity is set to the current BSS lower the roaming
threshold to loosly lock IWD to the current BSS. The lower
threshold is automatically removed upon roaming/disconnection
since the affinity array is also cleared out.
This property will hold an array of object paths for
BasicServiceSet (BSS) objects. For the purpose of this patch
only the setter/getter and client watch is implemented. The
purpose of this array is to guide or loosely lock IWD to certain
BSS's provided that some external client has more information
about the environment than what IWD takes into account for its
roaming decisions.
For the time being, the array is limited to only the connected
BSS path, and any roams or disconnects will clear the array.
The intended use case for this is if the device is stationary
an external client could reduce the likelihood of roaming by
setting the affinity to the current BSS.
A user reported a crash which was due to the roam trigger timeout
being overwritten, followed by a disconnect. Post-disconnect the
timer would fire and result in a crash. Its not clear exactly where
the overwrite was happening but upon code inspection it could
happen in the following scenario:
1. Beacon loss event, start roam timeout
2. Signal low event, no check if timeout is running and the timeout
gets overwritten.
The reported crash actually didn't appear to be from the above
scenario but something else, so this logic is being hardened and
improved
Now if a roam timeout already exists and trying to be rearmed IWD
will check the time remaining on the current timer and either keep
the active timer or reschedule it to the lesser of the two values
(current or new rearm time). This will avoid cases such as a long
roam timer being active (e.g. 60 seconds) followed by a beacon or
packet loss event which should trigger a more agressive roam
schedule.
This will help to get rid of magic number use throughout the project.
The definitions should be limited to global magic numbers that are used
throughout the project, for example SSID length, MAC address length,
etc.
Due to an unnoticed bug after adding the BasicServiceSet object into
network, it became clear that since station already owns the scan_bss
objects it makes sense for it to manage the associated DBus objects
as well. This way network doesn't have to jump through hoops to
determine if the scan_bss object was remove, added, or updated. It
can just manage its list as it did prior.
From the station side this makes things very easy. When scan results
come in we either update or add a new DBus object. And any time a
scan_bss is freed we remove the DBus object.
This will tell network the BSS list is being updated and it can
act accordingly as far as the BSS DBus registrations/unregistration.
In addition any scan_bss object needing to be freed has to wait
until after network_bss_stop_update() because network has to be able
to iterate its old list and unregister any BSS's that were not seen
in the scan results. This is done by pushing each BSS needing to be
freed into a queue, then destroying them after the BSS's are all
added.
The workaround for Cisco APs reporting an operating class of zero
is still a bug that remains in Cisco equipment. This is made even
worse with the introduction of 6GHz where the channel numbers
overlap with both 2.4 and 5GHz bands. This makes it impossible to
definitively choose a frequency given only a channel number.
To improve this workaround and cover the 6GHz band we can calculate
a frequency for each band and see what is successful. Then append
each frequency we get to the list. This will result in more
frequencies scanned, but this tradeoff is better than potentially
avoiding a roam to 6GHz or high order 5ghz channel numbers.
Gets the current autoconenct setting. This is not the current
autoconnect state. Will be used in DPP to reset station's autoconnect
setting back to what it was prior to DPP, in case of failure.
After adding the NETDEV_RESULT_DISCONNECTED enum, handshake failures
initiated by the AP come in via this result so the existing logic
to call network_connect_failed() was broken. We could still get a
handshake failure generated internally, so that has been preserved
(via NETDEV_RESULT_HANDSHAKE_FAILED) but a check for a 4-way
handshake timeout reason code was also added.
This new event is sent during a connection if netdev recieves a
disconnect event. This patch cleans up station to handle this
case and leave the existing NETDEV_EVENT_DISCONNECTED_BY_{AP,SME}
handling only for CONNECTED, NETCONFIG, and FW_ROAMING states.
This new result is meant to handle cases where a disconnect
event (deauth/disassoc) was received during an ongoing connection.
Whether that's during authentication, association, the 4-way
handshake, or key setting.
There are a few values which are nice to see in debug logs. Namely
the BSS load and SNR. Both of these values may not be available
either due to the AP or local hardware limiations. Rather than print
dummy values for these refactor the print so append the values only
if they are set in the scan result.
This shouldn't be possible in theory since the roam_bss_list being
iterated is a subset of entire scan_bss list station/network has
but to be safe, and catch any issues due to future changes warn on
this condition.
In order to complete the learned default group behavior station needs
to be aware of when an SAE/OWE connection retried. This is all
handled within netdev/sae so add a new netdev event so station can
set the appropriate network flags to prevent trying the non-default
group again.
There is special handling for buggy OWE APs which set a network flag
to use the default OWE group. Utilize the more persistent setting
within known-networks as well as the network object (in case there
is no profile).
This also renames the get/set APIs to be generic to ECC groups rather
than only OWE.
For anyone debugging or trying to identify network infrastructure
problems the IWD DBus API isn't all that useful and ultimately
requires going through debug logs to figure out exactly what
happened. Having a concise set of debug logs containing only
relavent information would be very useful. In addition, having
some kind of syntax for these logs to be parsed by tooling could
automate these tasks.
This is being done, starting with station, by using iwd_notice
which internally uses l_notice. The use of the notice log level
(5) in IWD will be strictly for the type of messages described
above.
The known frequency list is now a sorted list and the roam scan
results were not complying with this new requirement. The fix is
easy though since the iteration order of the scan results does
not matter (the roam candidates are inserted by rank). To fix
the known frequencies order we can simply reverse the scan results
list before iterating it.
In very large network deployments there could be a vast amount of APs
which could create a large known frequency list after some time once
all the APs are seen in scan results. This then increases the quick
scan time significantly, in the very worst case (but unlikely) just
as long as a full scan.
To help with this support in knownnetworks was added to limit the
number of frequencies per network. Station will now only get 5
recent frequencies per network making the maximum frequencies 25
in the worst case (~2.5s scan).
The magic values are now defines, and the recent roam frequencies
was also changed to use this define as well.
There was an unhandled corner case if netconfig was running and
multiple roam conditions happened in sequence, all before netconfig
had completed. A single roam before netconfig was already handled
(23f0f5717c) but this did not take into account any additional roam
conditions.
If IWD is in this state, having started netconfig, then roamed, and
again restarted netconfig it is still in a roaming state which will
prevent any further roams. IWD will remain "stuck" on the current
BSS until netconfig completes or gets disconnected.
In addition the general state logic is wrong here. If IWD roams
prior to netconfig it should stay in a connecting state (from the
perspective of DBus).
To fix this a new internal station state was added (no changes to
the DBus API) to distinguish between a purely WiFi connecting state
(STATION_STATE_CONNECTING/AUTO) and netconfig
(STATION_STATE_NETCONFIG). This allows IWD roam as needed if
netconfig is still running. Also, some special handling was added so
the station state property remains in a "connected" state until
netconfig actually completes, regardless of roams.
For some background this scenario happens if the DHCP server goes
down for an extended period, e.g. if its being upgraded/serviced.
This gives the tests a lot more fine-tune control to wait for
specific state transitions rather than only what is exposed over
DBus.
The additional events for "ft-roam" and "reassoc-roam" were removed
since these are now covered by the more generic state change events
("ft-roaming" and "roaming" respectively).
Using this will provide netdev with a connect callback and unify the
roaming result notification between FT and reassociation. Both paths
will now end up in station_reassociate_cb.
This also adds another return case for ft_handshake_setup which was
previously ignored by ft_associate. Its likely impossible to actually
happen but should be handled nevertheless.
Fixes: 30c6a10f28 ("netdev: Separate connect_failed and disconnected paths")
If the FT-Authenticate frame has been sent then a deauth is received
the work item for sending the FT-Associate frame is never canceled.
When this runs station->connected_network is NULL which causes a
crash:
src/station.c:station_try_next_transition() 7, target xx:xx:xx:xx:xx:xx
src/wiphy.c:wiphy_radio_work_insert() Inserting work item 5843
src/wiphy.c:wiphy_radio_work_insert() Inserting work item 5844
src/wiphy.c:wiphy_radio_work_done() Work item 5842 done
src/wiphy.c:wiphy_radio_work_next() Starting work item 5843
src/netdev.c:netdev_mlme_notify() MLME notification Remain on Channel(55)
src/ft.c:ft_send_authenticate()
src/netdev.c:netdev_mlme_notify() MLME notification Frame TX Status(60)
src/netdev.c:netdev_link_notify() event 16 on ifindex 7
src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
src/netdev.c:netdev_deauthenticate_event()
src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
src/netdev.c:netdev_disconnect_event()
Received Deauthentication event, reason: 7, from_ap: true
src/station.c:station_disconnect_event() 7
src/station.c:station_disassociated() 7
src/station.c:station_reset_connection_state() 7
src/station.c:station_roam_state_clear() 7
src/netconfig.c:netconfig_event_handler() l_netconfig event 2
src/netconfig-commit.c:netconfig_commit_print_addrs() removing address: yyy.yyy.yyy.yyy
src/resolve.c:resolve_systemd_revert() ifindex: 7
[DHCPv4] l_dhcp_client_stop:1264 Entering state: DHCP_STATE_INIT
src/station.c:station_enter_state() Old State: connected, new state: disconnected
src/station.c:station_enter_state() Old State: disconnected, new state: autoconnect_quick
src/wiphy.c:wiphy_radio_work_insert() Inserting work item 5845
src/netdev.c:netdev_mlme_notify() MLME notification Cancel Remain on Channel(56)
src/wiphy.c:wiphy_radio_work_done() Work item 5843 done
src/wiphy.c:wiphy_radio_work_next() Starting work item 5844
"Program terminated with signal SIGSEGV, Segmentation fault.",
"#0 0x0000565359ee3f54 in network_bss_find_by_addr ()",
"#0 0x0000565359ee3f54 in network_bss_find_by_addr ()",
"#1 0x0000565359ec9d23 in station_ft_work_ready ()",
"#2 0x0000565359ec0af0 in wiphy_radio_work_next ()",
"#3 0x0000565359f20080 in offchannel_mlme_notify ()",
"#4 0x0000565359f4416b in received_data ()",
"#5 0x0000565359f40d90 in io_callback ()",
"#6 0x0000565359f3ff4d in l_main_iterate ()",
"#7 0x0000565359f4001c in l_main_run ()",
"#8 0x0000565359f40240 in l_main_run_with_signal ()",
"#9 0x0000565359eb3888 in main ()"
This mispelling was present in the configuration, so I retained parsing
of the legacy BandModifier*Ghz options for compatibility. Without this
change anyone spelling GHz correctly in their configs would be very
confused.
Beacon loss handling was removed in the past because it was
determined that this even always resulted in a disconnect. This
was short sighted and not always true. The default kernel behavior
waits for 7 lost beacons before emitting this event, then sends
either a few nullfuncs or probe requests to the BSS to determine
if its really gone. If these come back successfully the connection
will remain alive. This can give IWD some time to roam in some
cases so we should be handling this event.
Since beacon loss indicates a very poor connection the roam scan
is delayed by a few seconds in order to give the kernel a chance
to send the nullfuncs/probes or receive more beacons. This may
result in a disconnect, but it would have happened anyways.
Attempting a roam mainly handles the case when the connection can
be maintained after beacon loss, but is still poor.
This is being done to allow the DPP module to work correctly. DPP
currently uses __station_connect_network incorrectly since it
does not (and cannot) change the state after calling. The only
way to connect with a state change is via station_connect_network
which requires a DBus method that triggered the connection; DPP
does not have this due to its potentially long run time.
To support DPP there are a few options:
1. Pass a state into __station_connect_network (this patch)
2. Support a NULL DBus message in station_connect_network. This
would require several NULL checks and adding all that to only
support DPP just didn't feel right.
3. A 3rd connect API in station which wraps
__station_connect_network and changes the state. And again, an
entirely new API for only DPP felt wrong (I guess we did this
for network_autoconnect though...)
Its about 50/50 between call sites that changed state after calling
and those that do not. Changing the state inside
__station_connect_network felt useful enough to cover the cases that
could benefit and the remaining cases could handle it easily enough:
- network_autoconnect(), and the state is changed by station after
calling so it more or less follows the same pattern just routes
through network. This will now pass the CONNECTING_AUTO state
from within network vs station.
- The disconnect/reconnect path. Here the state is changed to
ROAMING prior in order to avoid multiple state changes. Knowing
this the same ROAMING state can be passed which won't trigger a
state change.
- Retrying after a failed BSS. The state changes on the first call
then remains the same for each connection attempt. To support this
the current station->state is passed to avoid a state change.
The packet loss handler puts a higher priority on roaming compared
to the low signal roam path. This is generally beneficial since this
event usually indicates some problem with the BSS and generally is
an indicator that a disconnect will follow sometime soon.
But by immediately issuing a scan we run the risk of causing many
successive scans if more packet loss events arrive following
the roam scans (and if no candidates are found). Logs provided
further.
To help with this handle the first event with priority and
immediately issue a roam scan. If another event comes in within a
certain timeframe (2 seconds) don't immediately scan, but instead
rearm the roam timer instead of issuing a scan. This also handles
the case of a low signal roam scan followed by a packet loss
event. Delaying the roam will at least provide some time for packets
to get out in between roam scans.
Logs were snipped to be less verbose, but this cycled happened
5 times prior. In total 7 scans were issued in 5 seconds which may
very well have been the reason for the local disconnect:
Oct 27 16:23:46 src/station.c:station_roam_failed() 9
Oct 27 16:23:46 src/wiphy.c:wiphy_radio_work_done() Work item 29 done
Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 30
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 30
Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification New Scan Results(34)
Oct 27 16:23:47 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
... scan results ...
Oct 27 16:23:47 src/station.c:station_roam_failed() 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_done() Work item 30 done
Oct 27 16:23:47 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:47 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:47 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 31
Oct 27 16:23:47 src/wiphy.c:wiphy_radio_work_next() Starting work item 31
Oct 27 16:23:47 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:47 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:47 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification New Scan Results(34)
Oct 27 16:23:48 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
... scan results ...
Oct 27 16:23:48 src/station.c:station_roam_failed() 9
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_done() Work item 31 done
Oct 27 16:23:48 src/netdev.c:netdev_mlme_notify() MLME notification Notify CQM(64)
Oct 27 16:23:48 src/station.c:station_packets_lost() Packets lost event: 10
Oct 27 16:23:48 src/station.c:station_roam_scan() ifindex: 9
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_insert() Inserting work item 32
Oct 27 16:23:48 src/wiphy.c:wiphy_radio_work_next() Starting work item 32
Oct 27 16:23:48 src/station.c:station_start_roam() Using cached neighbor report for roam
Oct 27 16:23:48 src/scan.c:scan_notify() Scan notification Trigger Scan(33)
Oct 27 16:23:48 src/scan.c:scan_request_triggered() Active scan triggered for wdev a
Oct 27 16:23:49 src/netdev.c:netdev_link_notify() event 16 on ifindex 9
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Del Station(20)
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Deauthenticate(39)
Oct 27 16:23:49 src/netdev.c:netdev_deauthenticate_event()
Oct 27 16:23:49 src/netdev.c:netdev_mlme_notify() MLME notification Disconnect(48)
Oct 27 16:23:49 src/netdev.c:netdev_disconnect_event()
Oct 27 16:23:49 Received Deauthentication event, reason: 4, from_ap: false
If netconfig is canceled before completion (when roaming) the
settings are freed and never loaded again once netconfig is started
post-roam. Now after a roam make sure to re-load the settings and
start netconfig.
Commit 23f0f5717c did not correctly handle the reassociation
case where the state is set from within station_try_next_transition.
If IWD reassociates netconfig will get reset and DHCP will need to
be done over again after the roam. Instead get the state ahead of
station_try_next_transition.
Fixes: 23f0f5717c ("station: allow roaming before netconfig finishes")
For IWD to work correctly either 2.4GHz or 5GHz bands must be enabled
(even for 6GHz to work). Check this and don't allow IWD to initialize
if both 2.4 and 5GHz is disabled.
The function wiphy_band_is_disabled() return was a bit misleading
because if the band was not supported it would return true which
could be misunderstood as the band is supported, but disabled.
There was only one call site and because of this behavior
wiphy_band_is_disabled needed to be paired with checking if the
band was supported.
To be more descriptive to the caller, wiphy_band_is_disabled() now
returns an int and if the band isn't supported -ENOTSUP will be
returned, otherwise 1 is returned if the band is disabled and 0
otherwise.
This adds support to allow users to disable entire bands, preventing
scanning and connecting on those frequencies. If the
[Rank].BandModifier* options are set to 0.0 it will imply those
bands should not be used for scanning, connecting or roaming. This
now applies to autoconnect, quick, hidden, roam, and dbus scans.
This is a station only feature meaning other modules like RRM, DPP,
WSC or P2P may still utilize those bands. Trying to limit bands in
those modules may sometimes conflict with the spec which is why it
was not added there. In addition modules like DPP/WSC are only used
in limited capacity for connecting so there is little benefit gained
to disallowing those bands.
The auth/action status is now tracked in ft.c. If an AP rejects the
FT attempt with "Invalid PMKID" we can now assume this AP is either
mis-configured for FT or is lagging behind getting the proper keys
from neighboring APs (e.g. was just rebooted).
If we see this condition IWD can now fall back to reassociation in
an attempt to still roam to the best candidate. The fallback decision
is still rank based: if a BSS fails FT it is marked as such, its
ranking is reset removing the FT factor and it is inserted back
into the queue.
The motivation behind this isn't necessarily to always force a roam,
but instead to handle two cases where IWD can either make a bad roam
decision or get 'stuck' and never roam:
1. If there is one good roam candidate and other bad ones. For
example say BSS A is experiencing this FT key pull issue:
Current BSS: -85dbm
BSS A: -55dbm
BSS B: -80dbm
The current logic would fail A, and roam to B. In this case
reassociation would have likely succeeded so it makes more sense
to reassociate to A as a fallback.
2. If there is only one candidate, but its failing FT. IWD will
never try anything other than FT and repeatedly fail.
Both of the above have been seen on real network deployments and
result in either poor performance (1) or eventually lead to a full
disconnect due to never roaming (2).
The PMKID derivation has gotten messy due to the spec
updating/clarifying the hash size for the FT-8021X AKM. This
has led to hostapd updating the derivation which leaves older
hostapd versions using SHA1 and newer versions using SHA256.
To support this the checksum type is being fed to
handshake_state_get_pmkid so the caller can decide what sha to
use. In addition handshake_state_pmkid_matches is being added
which uses get_pmkid() but handles sorting out the hash type
automatically.
This lets preauthentication use handshake_state_get_pmkid where
there is the potential that a new PMKID is derived and eapol
can use handshake_state_pmkid_matches which only derives the
PMKID to compare against the peers.