In the current state, the alertmanager-irc-relay already sends minutely
IRC PINGs. This allows to check the IRC connection's health in protocol
without having to deal with specific TCP settings. However, even when
we are sending those PINGs, we don't process the server's PONGs or their
absence.
On one of my alertmanager-irc-relay instances, the time between the last
received PONG and the TCP read to fail was round about fifteen minutes.
All this time, the connection was already dead, but there was no attempt
to reestablish it.
The introduces changes keep book on the last received PONG and fails if
there was no new PONG within twice the pingFrequencySecs time. When
establishing a new connection during the SetupPhase, the current time
will be set as the last PONG's time to reset the time comparison.
After a connection loss on an IRC session with a ngIRCd, the
alertmanager-irc-relay was unable to reconnect. After some debugging,
the error's origin was the state tracking within the used goirc library.
When using an unidentified session, ngIRCd prefixes the user's ident
with a `~`. The state tracking registers this and keeps `~${NICK}` as
the current and the new ident for future reconnects. However, `~` is not
a valid char for the `<user>` part in the `USER` command, at least not
for ngIRCd.
To clarify this behaviour, take a look at the following log. First, the
initial connection is begin established correctly. Keep an eye on the
`USER` command being sent to the server.
> http.go:132: INFO Starting HTTP server
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- :__SERVER__ 001 alertbot :Welcome to the Internet Relay Network alertbot!~alertbot@__IP__
Now, there was a network incident and the session needs to be recreated.
> connection.go:466: ERROR irc.recv(): read tcp __REDACTED__: read: connection timed out
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> reconciler.go:129: INFO Channel #alerts monitor: context canceled while monitoring
> irc.go:300: INFO Connecting to IRC __SERVER__
> backoff.go:111: INFO Backoff for 0s starts
> backoff.go:114: INFO Backoff for 0s ends
> connection.go:390: INFO irc.Connect(): Connecting to __SERVER__.
> irc.go:308: INFO Connected to IRC server, waiting to establish session
> connection.go:543: DEBUG -> NICK alertbot
> connection.go:543: DEBUG -> USER ~alertbot 12 * :Alertmanager IRC Relay
> connection.go:474: DEBUG <- ERROR :Invalid user name
> connection.go:577: INFO irc.Close(): Disconnected from server.
> irc.go:150: INFO Disconnected from IRC
> irc.go:319: WARN Receiving a session down before the session is up, this is odd
This time, the used `user` part of the `USER` command has the prefixed
`~` and fails. However, without using `-debug` and taking a very close
look, this error can be missed very easy.
As the new ident is invalid, the alertmanager-irc-relay is now stuck in
an endless reconnection loop.
This fix is kind of straight forward and just checks if the ident has
changed before trying to reconnect. It might not be the prettiest
solution, but recreating the whole *irc.Config resulted in other bugs as
it was still referenced - even after being `Close`d.
handle nickserv identify requests instead of blindly issuing a message
when connected
this helps if nickserv's state is wiped and we are being asked to
re-identify
introduce a `nickserv_identify_patterns` config option. these patterns
are used to guess identify requests of the various nickserv implementations
Signed-off-by: Luca Bigliardi <shammash@google.com>
Also adopt interface for querying time information, so it can be faked
properly during at test time
Signed-off-by: Luca Bigliardi <shammash@google.com>
Make sure the underlying library context cancellation happens only
after the session has been shutdown.
Signed-off-by: Luca Bigliardi <shammash@google.com>
Use newly added ConnectContext function.
Note: This is using a yet-to-be-released goirc module version obtained
with "go get -u github.com/fluffle/goirc@master"
Signed-off-by: Luca Bigliardi <shammash@google.com>
Make sure the session is up before consuming alert messages.
Also, split main run loop for readability.
Signed-off-by: Luca Bigliardi <shammash@google.com>