-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SEGV after infinite loop retrying proxy_socks5_read_connection_response #2352
Comments
If it's a UDP packet it's coming from the testnet. You're probably running toxic with the testnet enabled which is why it's not working, and it's probably trying to send packets to your other client. |
No it's not Toxic - I double checked . I checked to see if it is old Echobot or the like but there's nothing running. It must be coming over the Mainnet as I'm behind a firewall. As I'm still seeing 05 packets, maybe the libtoxcore could define 0x05 on mainnet so that LOG_TRACE could be explicit when dropping them, and maybe log the IP. Then when the log says unknown it really is unknown. But the SEGV is unrelated to this, as is the occurqnce of an infinite networking loop. |
When I see 200 lines like
doesn't that mean something is retrying too hard and should have failed? |
The tcp code right now does not timeout established connections, that don't receive data. The same happens on tcp_relays. (#2332) |
@Green-Sky " tcp code right now does not timeout established connections," - no matter what the root cause of the problem is, I think the code should be bulletprooffed/tightened so that both the client and server TCP code have timeouts. It can't be that hard to do (YMMV :0) and #2332 gives a simple reproducible test case, so it that could be done it's a great step. That can be done right away and make help or solve the problem anyway. The TCP code probably doesn't get as much testing/fuzzing as the UDP and it may get attacked, like Tor gets attacked, so it should be hardened, @nurupo "did you modify the toxcore you are using in any way? " no -it's -DDebug and some CMakeLists things turned on but a clean TokTok/c-toxcore checkout of a couple of weeks ago, that have been running toxic and toxygen fine for weeks/months. It's a recent problem - last few days, and the logs always have these errors:
@nurupo " it's actually not toxcore that caused the crash but toxygen" No I don't think so. I know the stacktrace sez it SEGVed in the logger, but the logger is just print statements and hasn't changed in months. You can't ignore the logger worked fine for the preceding 200 lines of errors! My feeling is, and I've seen this before, the code infinite looped until resource exhaustion or something, which caused the SEGV - the code is spending most of its time in the logger, so it got caught there. I can CNP the logger code but the code is a CTYPEs wrapping of a Python callback that allocates no resources. We're very lucky because:
PS: in writing TCP client or server code remember that "reasonable" timeouts may have to be seriously extended when it's over a proxy. |
According to that stack trace, it crashes when calling Line 78 in 0a277b5
Either the As for the
No it doesn't. The reason why you are seeing so many of such messages is because there are A LOT of SOCKS5 connections going on, and toxcore keeps checking all of them for the data on every iteration, until the TCP connection is timed out, which is 10 seconds. It's not an infinite loop in the regular sense of a hot
Yes it does time out. As mentioned on IRC, TCP client has a clear timeout: if it doesn't establish a connection in 10 seconds -- the connection gets closed:
Now to the weird stuff going on in the log. A SOCKS5 negotiation goes like this:
Does this look familiar? For example, take a look at the TCP communication with the
First toxcore sends 3 bytes of Note that the first message is logged as Lines 636 to 645 in 8054854
Note that (2) is similarly logged as zeros, however because toxcore replies with (3), we can deduct that the node has in fact sent us Also note that step 4 of SOCKS5 negotiation is missing, the node never replies back, which is the cause of what @emdee-is called an "infinite loop" -- toxcore is checking on the connection request reply from the proxy, with
So a list of issues discovered:
|
I checked this locally and everything seems to be doing what it's supposed to be doing. I can't reproduce any of this buggy behaviour locally. Also, I'd wonder how people could be connecting successfully to groupchats using proxies if it were not setting a valid SOCKS5 proxy. |
Sorry, looks like the logs had me confused. When they said:
I thought toxcore was communicating directly with 85.143.221.42:33445, when in fact it was communicating with 127.0.0.1:9050. The I have edited my last comment, crossing out the mistaken parts. The other parts are still relevant though. |
@nurupo - thanks for you detailed analysis and I'm glad the bug report found some real issues. Just a quick comments that may help:
|
This is because when you connect to a groupchat (that is, you establish a connection with any one peer in a group) you get many redundant opportunities to directly connect to all the peers in the chat through the sync protocol, whereas with peers in your friends list, you're forced to wait for a DHT lookup response. |
So, what is happening is only the first three of the the four SOCKS5 negotiation packets are getting exchanged, e.g. the SOCKS5 proxy doesn't reply back to toxcore when toxcore requests the proxy to connect to 85.143.221.42:33445. I wonder why. It should reply back with success/failure at least. It could be that with the Tor proxy being Tor, it is taking a long time to establish a connection to the destination IP:port, and Toxygen crashed before Tor managed to established it and reply back to toxcore with the success/failure. Or perhaps it took Tor over 10 seconds to do so, at which point toxcore timedout and closed the SOCKS5 connection. |
@nurupo "SOCKS5 proxy doesn't reply back to toxcore when toxcore requests the proxy" I think that's the right general direction; I've had 2 more crashes today, and the first one kindly gets toxygen's logger off the hook:
I'll put the full log and analysis up tomorrow, but I just wanted to clarify that these take place in the initial bootstrapping phase of starting up, long before any NGC groups would be accessed by the user. The next one is almost a replay of the original with it crashing in the logger callback during bootstrapping, but a SIGILL this time:
each after dozens of previous There are warnings in the Tor log that the network is having problems:
The warnings are not that uncommon, but like it says, it could be an attack. There's nothing like it in the rest of my normal web usage with HTTP/POP... I don't want to beat a dead horse, and I really don't want to waste your time, but 3 crashes in 3 days is unheard of - I haven't had one in months - and it is possible Tox over Tor is targeted. Similar logs of short reads. Although JF, as you reminded me, LOG_TRACE is supposed to generate tonnes of logging, and nurupo you read them as being "normal", this is actually not the case for me - I've rarely seen the logger say anything, although it's been wired up for a couple of months. So I think there is a real problem. Thanks for your help and I'll put the logs for these 2 up tomorrow. |
As an aside to trouble shooting the issue: now that we known that this behaviour exists, should the code, instead of "hammering" at the proxy for a couple of hundred times in 10 seconds, maybe back off as the null reads or writes mount up, and wait longer but with less volume? |
(I'll speak of first,second,and third logs/events) This doesn't look right. I'm at the gdb command prompt of the third SEGV - hunting and pecking as I don't really know gdb.
Does that mean we tried to read a 33M packet, or is len not meaningful? |
@JFreegman I was in a NGC at the time: I could whisper to you and @Green-Sky but not chat to him or Tha - their chat was not coming online, but I could /whisper (or the toxygen equivalent) back and forth to them. (Did you get the URL I sent you in /whisper? toxygen is a little wierd in how it shows /whisper receipts.) @nurupo So the DHT can be blocked when the NGC sync protocol is resilient? |
are you sure the save is valid? i had the same kind of issues when i fed a garbage save to toxcore. edit: to be specific, a garbage save pointer that was not null. |
also, do me a favor and compile and run it with ASAN (-fsanitize=address) and without gdb. (asan does not work in gdb) |
Maybe - remember I don't know C - running a Python app under gdb was a huge step for me and I'm still exploring a system that when you ask for help on the commands gives you >200 lines of "help" -:) It would be useful if I knew what ASAN is, what it does, and where to get it. I asked gentoo if it could install if for me and it offered:
@Green-Sky Does that |
I have no doubts about the savefile, and I can check them section by section using my tox_profile utility. There's crap in there like dead and duplicated nodes in the DHT section #2347 #2343 but these SEGVs happen long after startup and always during bootstrap. By guess is someone has found a way of using guard or exit nodes to be malicious and it affects the DHT and not NGC - remember when I could /whisper to you but not chat with you? Bear in mind that Tox over Tor is a whole different ballgame than Tox over UDP, and is only lightly tested: not at all really - just simple SOCKS. |
@nurupo has done an excellent analysis here. not much to add except that i remembered it wrong :) |
ASAN or libasan or google address sanitizer is a library that intercepts calls to memory, tracks allocation etc. it can not be run with gdb though. You will have to recompile toxcore with it enabled. seems the toxcore cmake is mising an option for this, so you will have to add a edit: asan will tell you the place allocations happen, accessviolations doublefrees, (exclusive!!) use-after-frees etc |
I think I'll leave that for much later - my C and Cmake skills are primitive. Maybe write that up with instructions in INSTALL.md and CMakeLists.txt so it's easier to try to do for C noobs like me. Right now I think the best thing to do is to run down #2332 in case it's related as it's reproducible and a blocker. (Thanks for the "remembered it wrong" - I was just confused) (I don't think len = 33576277 is important; length=10 was the relevant one.) |
Here;s one of the SEGVs I had recently: It starts:
I poked around up the stack and down and found this:
I have 4158503296 friends. I;m popular! Does this mean I have a corrupt friends list. Recall earlier that I asked about:
which sounded like a warning to me. Could that be related - it's in most or all of the SEGV logs. |
closing this as old, and likely memory corruption that was my fault. |
I have weird networking problems with SOCKS proxies.
This is a very rare SEGV caught by gdb running python
(gdb -ex r --args /usr/bin/python3.9)
. This is an infinite loop leading to a segv undertoxygen
when connected to a SOCKS5 proxy (Tor). This happening at a time when toxic on the same machine is completely blocked from Tox chat and NGC. SEGVs are a rare case under toxygen - I've very rarely seen it.The tracebacks are Python running under gdb. toxygen is just wrapping libtoxcore, and any errors in the wrapping are almost always immediately fatal when calling, so I don't suspect toxygen.
So the trace is a mixture of:
a recent c-toxcore compiled TRACE - those messages start
TRAC>
Python logging, which all looks normal. It starts:
modname DEBUG|INFO debugmessage
or DBUG+ or sometimes no prefix at all -
[
gdb traceback at the end
Comments:
@Green-Sky I'm not someone who knows the code, but I'd say the the code in
TCP_client.c#950
is retryingproxy_socks5_read_connection_response
an infinite number of times, and at the very least, as a failsafe, it must give up and fail after a set amount of time or number of tries. There should be no infinite loops.@JFreegman Why is net_packet_type_name returning "unknown"?
There's no 0x05 entry in
typedef enum Net_Packet_Type
.Traceback:
https://bin.nixnet.services/?bf35a2c2c484f266#9JRCgpKJ62Rsruu99Fdr4gzYGpZwWVtyncGWYWdWMosc
The text was updated successfully, but these errors were encountered: