Bug #70589 for Net-SNMP: Max Requests per host/IP patch

Mon Aug 29 14:29:26 2011 BitCard [...] ResonatorSoft.org - Ticket created

Subject:

Max Requests per host/IP patch

Moved from #69514. This implements the following: * Implemented a new "max_requests" parameter, defaulted at 3, which controls the rate that an individual host is processing requests at the same time. This required the following changes: * A new $event array variable called _HOSTNAME (detailed in another patch; this patch assumes it's already in place) * Changes to register/deregister to keep track of hostnames in a very similar fashion to the descriptor object. * Code at the top of _event_handle to skip over events that are past the max request limit until another request has finished * An auto-change of the max_requests to 1 if the host times out once, since it obviously cannot handle the requests it already had. * Changing the name of the unused "return_response_pdu" to "send_pdu_priority", and using that throughout SNMP.pm for _existing_ requests. This way, existing requests for, say, get_tables are sent immediately through the pipe and only the receive timers get put into the event list. This makes existing requests immune to the max_request limits (and post-select lag), and ensures that the host is not waiting too long for our reply for more information. * A new parameter (plus help) in both SNMP.pm and Net::SNMP::Transport This patch, along with the receive buffer patch, fixes both ends of the large request problem. The receive buffer patch fixes the one-to-many IPs problem. In other words, if a single client (us via Net::SNMP) is sending many requests to different hosts, it can be assumed that those hosts are going to collectively process those requests and send it out faster than the one client can process all of the return packets. It's like a 75-core server processing everything and sending it back to the single-core client, which was, before the patch, continuing to overload itself. This patch fixes the one-to-many _requests_ problem. In other words, a single client is sending a host many different requests, and forcing the host to process all of them at the same time. To the client, sending a request for 20 large tables is easy. Actually getting the data is a lot harder. Depending on how smart or dumb the host's SNMP software is, it may be trying to process all 20 requests at the same time. This results in timeouts, as it never gets to send any one packet in time. (In fact, I end up seeing late packets that get rejected because the msgID has already been thrown away.) The retries don't work at all, because all 20 requests time out at the same time, and the code just sends the same 20 requests with the same time frame. Rinse and repeat until the retry limit is reached, and you end up with an angry server and no data to show for it. This is a problem even if you were just sending to a single host, so it's not just for large multi-host requests. So, this patch keeps it at a reasonable 3 requests per host. Existing requests still get processed as normal, but new ones are curbed until one of the other requests have finished. Yes, 3 is somewhat of an arbitrary limit, but: 1. It's reasonable to assume that most hosts probably can't (and shouldn't) handle more than 3 table pull requests at a time. 2. It's adjustable per host by the user. 3. It has the potential to be replaced with an auto-threshold that adjusts this limit according to the response rate of the host, thus eliminating the arbitrary number.

Subject:

MaxRequests.patch.txt

Message body is not shown because it is too large.

Tue Nov 01 23:15:06 2011 dtown [...] cpan.org - Correspondence added 10 min

The application using the Net:SNMP module is in a better position to determine the rate at which messages should be sent to a particular host. If no responses are being received from a particular host, it should up to the application to throttle messages to that host. The lower layers of the Net::SNMP module do not have the appropriate information to make the proper decision as to whether to reduce the rate at which messages are being sent. Messages could time out because the host is busy, they could timeout because the SNMPv1 community string is invalid, they could timeout because a particular table takes longer to response than another. By implementing throttling in the module, the application is restricted by what ever method is used by the module to "decide" when messages should be sent.

Tue Nov 01 23:15:09 2011 The RT System itself - Status changed from 'new' to 'open'

Tue Nov 01 23:15:10 2011 dtown [...] cpan.org - Given to DTOWN

Tue Dec 13 13:21:05 2011 BBYRD [...] cpan.org - Correspondence added

On Tue Nov 01 23:15:06 2011, DTOWN wrote: Show quoted text

> The application using the Net:SNMP module is in a better position to > determine the rate at which messages should be sent to a particular > host. If no responses are being received from a particular host, it > should up to the application to throttle messages to that host.

I actually wrote these patches because the application DIDN'T have enough information about how fast to send the messages. The application has no way of knowing how long the select time took on the receive side, for example, and so I'd no idea if the Dispatcher (even with _once) was blocking because of processing time or because it was waiting for information to come down the pipe. I probably still have the code somewhere that was timing everything down the line to try to increase or decrease the send rate. It didn't work because of lack of information. Show quoted text

> The lower layers of the Net::SNMP module do not have the appropriate > information to make the proper decision as to whether to reduce the rate > at which messages are being sent. Messages could time out because the > host is busy, they could timeout because the SNMPv1 community string is > invalid, they could timeout because a particular table takes longer to > response than another.

I disagree. The lower layers have the best information available about the packet level transfer of information. If the host is busy, then sending 20 more requests at the same time is not a good idea. If the host is timing out because a table takes longer to respond, then sending 20 more requests at the same time is really not a good idea. And if the community string is wrong, then nothing is really going to help that and the app won't be able to tell the difference, either. On the flip side, you don't want the app to think that the community string is wrong when in fact, the 20 requests sent had overloaded the host. Remove the problems associated with request sending/receiving and you give the app -more- information by process of elimination. Show quoted text

> By implementing throttling in the module, the > application is restricted by what ever method is used by the module to > "decide" when messages should be sent.

No, it's not. We're talking about a user-configurable option with what would seem to be a reasonable default. If we are arguing that the app has the best information, then let the app configure the throttle the way it wants it. Heck, we can even make the default be off and let the user decide to use the throttle. I'm not against that option.

Wed Jan 18 12:00:37 2012 BBYRD [...] cpan.org - Correspondence added

Same question here: What kind of tests would be necessary to prove out the different use cases and make sure everything acts predictably?