Hello Brian, Brian Long wrote:
this server has the settings "1 15 100" there, it is running kernel 2.4.21-20.ELsmp and has uptime 183 days. I am sure that the kernel is exactly the one which showed that bad behaviour I described and caused many outages to our service. I am not sure about the settings in /proc/sys/vm/pagecache, but I think that they were not changed from their default. One correction to my original message: this server has 4 GB RAM, not 2 GB RAM. Right now I cannot experiment with this server, because it runs fine and I couldn't justify any outages for the very important services running there. If we would do anything there, we would install RHEL 4 on another server and switch our services there. I am sorry, but this is a 24x7 production environment and we have no room for experiments...Why would you have a critical machine in place and not be able toreproduce the error on a development or test server?
I never wrote that. We have test servers. However, considering that we have a perffectly working solution (RHEL 3 AS with swapping disabled), I don't know how could I justify spending resources (mostly people's time) on changing it. Additionally, every change means increasing possibility of breaking stuff, you know...
I'm used to a different model. Everything is developed in dev, tested in stage and then put into production. Yes, it means 3 servers for each production service, but it's worth it.
Sure, we went through a very similar proces, and we ended up with a fully working scenario in production. What's wrong with that?
FYI, I was wrong about pagecache. The old values were 1 15 100 and the new values are 1 15 30. This means pagecache will only occupy 30% of your total RAM.
Thanks for this hint. However, I still think that if we would need to change our setup, I think that we would simply go away from RHEL 3 - either to RHEL 4, or to even newer RHEL, or a different OS (maybe not even Linux - all we need is a working JVM).
Leos