[Date Prev] [Date Index] [Date Next] [Thread Prev] [Thread Index] [Thread Next]

Re: question about conserver scaling out ability

John Hascall john@iastate.edu
Wed, 18 Nov 2009 13:30:58 GMT

2009/11/18 Guang Cheng Li <liguangc@cn.ibm.com>

We are using conserver to handle the consoles in our cluster, everything worked perfect until several months ago when our cluster is growing larger and larger. For now, our cluster has 2,000 nodes and will be growing to 16,000 nodes in the near future, we are seeing problems with the 2,000 nodes.

1. the conserver will start responding slow after the conserver have been running for a while(maybe several days, I am not so sure), when the conserver responds slow, it probably takes more than 10 seconds to open the node console, or occasionally can not open the consoles for the nodes at all. we have to restart the conserver to fix the problem.

I would suspect that perhaps you have started swapping, either due to a memory leak or just memory consumption.

2. The conserver restart will take a very long time, about 5 minutes, to finish the initialization with 2,000 nodes, during the conserver initialization, the rcons will get "Connection refused" error.

What method are you using to connect to the nodes?  Our conserver (only ~500) nodes connects via Cyclades ACS-48 boxes, and we quickly found out that 'raw socket' connections scaled vastly better than 'ssh' ones.