Kqueue working
Sometime last week I finished the Kqueue implementation, and after a failed deployment (where it crashed the server an hour after the restart) I did some final touches and now it works great. There is still a bug to fix where sometimes a socket attempts to send data to an nonexistent client, but this is pretty rare and will only cause someone to be disconnected from the server.
I am, however, trying to see what causes that problem.
And now for the performance results:
Using SDL_net to check which sockets had data, the server was taking about 15% of the CPU with 700 connections. Now, with Kqueue, it rarely goes over 6%. The nice thing about Kqueue is that it scales very well. For example, if it takes 0.01 ms to check a socket, it will take 1 ms to check 100 sockets. With select() or poll() it might take 0.01 ms to check a socket, but 5 ms to check 100 sockets, and 20 ms to check 200 sockets (all the data is made up, just for an illustration purpose).
What this means for those who do not understand technical terms is that before this change, the server couldn't have handled more than, say, 1500 connections (with enough spare CPU for an invasion). Now, I think it can handle up to 5K connections or more.
It also means that the server should be more responsive than before, although there is not much of a difference if the server responds in 130 rather than 140 ms (most of it being the network latency).
The conclusion of working with Kqueue is that it is well worth the effort to implement it, and it is also pretty easy to do so, once you understand how it works. However, the documentation is not very good, and it is wrong, or at least misleading in some areas.
For example, if you try to change an event in the same call where you also get the pending events, your pending event will not be changed until after, which can cause problems.
And there is a nasty bug where if you add a listening socket to a kqueue, the sockets added after that via accept() will become, for kevent, listening sockets as well, so you won't get any notifications from them. The way I did it was to create two kqueues, one for the listening sockets and one for the normal sockets.
I am, however, trying to see what causes that problem.
And now for the performance results:
Using SDL_net to check which sockets had data, the server was taking about 15% of the CPU with 700 connections. Now, with Kqueue, it rarely goes over 6%. The nice thing about Kqueue is that it scales very well. For example, if it takes 0.01 ms to check a socket, it will take 1 ms to check 100 sockets. With select() or poll() it might take 0.01 ms to check a socket, but 5 ms to check 100 sockets, and 20 ms to check 200 sockets (all the data is made up, just for an illustration purpose).
What this means for those who do not understand technical terms is that before this change, the server couldn't have handled more than, say, 1500 connections (with enough spare CPU for an invasion). Now, I think it can handle up to 5K connections or more.
It also means that the server should be more responsive than before, although there is not much of a difference if the server responds in 130 rather than 140 ms (most of it being the network latency).
The conclusion of working with Kqueue is that it is well worth the effort to implement it, and it is also pretty easy to do so, once you understand how it works. However, the documentation is not very good, and it is wrong, or at least misleading in some areas.
For example, if you try to change an event in the same call where you also get the pending events, your pending event will not be changed until after, which can cause problems.
And there is a nasty bug where if you add a listening socket to a kqueue, the sockets added after that via accept() will become, for kevent, listening sockets as well, so you won't get any notifications from them. The way I did it was to create two kqueues, one for the listening sockets and one for the normal sockets.
3 Comments:
A nice read about this is Fefe's http://bulk.fefe.de/scalable-networking.pdf
That's a very nice link.
Luckly, I don't really care about the file i/o, since it is pretty small (especially because we don't use a db system, but flat files).
Right now, I managed to get the system time to about 1% or less, and that's with 3 servers running, one of them not yet using Kqueue.
Now I'll just focus on finding out why the bug I mentioned happens, then I'll be able to actully add some new features and stuff.
I also enjoyed the read. Keep up the good work on Eternal Lands and the blog!
Post a Comment
<< Home