Discussion:
CPU not being fully used
Alceu R. de Freitas Jr.
2011-06-03 13:35:54 UTC
Permalink
Hello everybody,

I started doing some tests by running a Perl program in a two node cluster.

One node has 4 CPU's and the other has 2.

The Perl program will fork 6 child process in total and my expectations is that I could use at least 50% of all CPU's during this process.

I started executing the program with no DISTANT_FORK or CAN_MIGRATE capabilities enabled. Since I was starting the program from the node with 4 CPU's, all the CPU's in this node were being used in a average of 80%. I think this is the expected behavior.

After adding DISTANT_FORK, CAN_MIGRATE or both capabilities, I was able to see that the second node CPU's were being used, but with an average of 10%. This usage average could be saw in the first node too. Running this way the overall performance is so low that running the same algorithm with a single process (or using a single node with all CPU's being used) is actually faster to finish the data processing.


The program itself does very few I/O operations and zero network operations (well, almost, since they're writing exclusive files per child process in the NFS root file system). I noticed that the network was being heavily used (by checking with ifconfig).


My questions:

1 - Am I using the scheduler correctly? Since the program will fork 6 child, but as soon one child finishes it's job, it will be terminated and another child process will be forked, I believe it would be better to execute a distant_fork than migrating the process. But CPU usage is too low.
2 - Is there any specific care to be taken to have a better CPU utilization?
3 - Is there any way to check network usage (and try to reduce)?
4 - The regular tools used to find bottlenecks in Linux (vmstat, iperf, ntop and sar for example) can be used within a Kerrighed cluster?


Thanks,
Alceu
Louis Rilling
2011-06-06 15:02:35 UTC
Permalink
Post by Alceu R. de Freitas Jr.
Hello everybody,
I started doing some tests by running a Perl program in a two node cluster.
One node has 4 CPU's and the other has 2.
The Perl program will fork 6 child process in total and my expectations is that I could use at least 50% of all CPU's during this process.
I started executing the program with no DISTANT_FORK or CAN_MIGRATE capabilities enabled. Since I was starting the program from the node with 4 CPU's, all the CPU's in this node were being used in a average of 80%. I think this is the expected behavior.
After adding DISTANT_FORK, CAN_MIGRATE or both capabilities, I was able to see that the second node CPU's were being used, but with an average of 10%. This usage average could be saw in the first node too. Running this way the overall performance is so low that running the same algorithm with a single process (or using a single node with all CPU's being used) is actually faster to finish the data processing.
The program itself does very few I/O operations and zero network operations (well, almost, since they're writing exclusive files per child process in the NFS root file system). I noticed that the network was being heavily used (by checking with ifconfig).
1 - Am I using the scheduler correctly? Since the program will fork 6 child, but as soon one child finishes it's job, it will be terminated and another child process will be forked, I believe it would be better to execute a distant_fork than migrating the process. But CPU usage is too low.
DISTANT_FORK looks indeed like the best option, especially if workers have
equivalent computation time.
Post by Alceu R. de Freitas Jr.
2 - Is there any specific care to be taken to have a better CPU utilization?
The most frequent cause for such low performance is that processes are too
short-lived, so that remote fork/migration time costs more than what can be
saved using remote CPUs. Increasing the size (and thus their runtime) of
individual jobs should help. For instance, don't create a new process for each
job. This can be achieved with a simple wrapper script, that remote forks as
many sub-shells as workers, and each sub-shell locally forks jobs as long as
required. In a loop for i=1 to 10000 with 6 workers, this can be achieved by
having each sub-shell worker w (w in 0..5) executing the job for each i in
1..10000 for which i mod 6=w.
Post by Alceu R. de Freitas Jr.
3 - Is there any way to check network usage (and try to reduce)?
Not really. However, seeing krgrpc kernel threads using much CPU is a good sign
of (too) frequent migrations/remote forks, or heavy IO redirection (which should
not happen on NFS regular files).
Post by Alceu R. de Freitas Jr.
4 - The regular tools used to find bottlenecks in Linux (vmstat, iperf, ntop and sar for example) can be used within a Kerrighed cluster?
Not many of them unfortunately, since the kernel ABI used by those tools is
mostly not Kerrighed-aware, and ptrace is not supported for remote processes.

Thanks,

Louis
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Alceu Rodrigues de Freitas Junior
2011-06-06 16:34:25 UTC
Permalink
Post by Louis Rilling
DISTANT_FORK looks indeed like the best option, especially if workers have
equivalent computation time.
Well, they don't. For each new worker created, actually the size of the
task is shorter (incrementally).
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
2 - Is there any specific care to be taken to have a better CPU utilization?
The most frequent cause for such low performance is that processes are too
short-lived, so that remote fork/migration time costs more than what can be
saved using remote CPUs. Increasing the size (and thus their runtime) of
individual jobs should help. For instance, don't create a new process for each
job. This can be achieved with a simple wrapper script, that remote forks as
many sub-shells as workers, and each sub-shell locally forks jobs as long as
required. In a loop for i=1 to 10000 with 6 workers, this can be achieved by
having each sub-shell worker w (w in 0..5) executing the job for each i in
1..10000 for which i mod 6=w.
For that I would need to to several changes in the program and since
it's a proof of concept it does not make much sense to change it.

Actually, there is a version of it (that uses IPC Shared Memory does
exactly this (Parallel::Forkmanager will create 6 child process and they
will execute until all tasks were processed, so the child process will
be running for a long time) so my best option is try to run it without
issues in Kerrighed. :-)

I created both programs as proof of concept and had identify the bad
design in "many forks" version in a single multicore PC. The costs of
fork creation (measured through vmstat CS attribute) are higher than the
version with shared memory, therefore consuming more time to process the
same amount of tasks.
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
3 - Is there any way to check network usage (and try to reduce)?
Not really. However, seeing krgrpc kernel threads using much CPU is a good sign
of (too) frequent migrations/remote forks, or heavy IO redirection (which should
not happen on NFS regular files).
Well, if I enable:

-e +DISTANT_FORK
-e +CAN_MIGRATE

I can see a lot of those messages in the terminal. This is the worst
combination I've tried. :-)

Doing:

-k <parent PID> -e +DISTANT_FORK
-d +CAN_MIGRATE

works better (not as much as I would like, but better anyway).
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
4 - The regular tools used to find bottlenecks in Linux (vmstat, iperf, ntop and sar for example) can be used within a Kerrighed cluster?
Not many of them unfortunately, since the kernel ABI used by those tools is
mostly not Kerrighed-aware, and ptrace is not supported for remote processes.
That's bad. I was counting in using at least vmstat. Is it ok to use it?

Thank you,
Alceu
Louis Rilling
2011-06-06 18:53:07 UTC
Permalink
Post by Alceu Rodrigues de Freitas Junior
Post by Louis Rilling
DISTANT_FORK looks indeed like the best option, especially if workers have
equivalent computation time.
Well, they don't. For each new worker created, actually the size of the
task is shorter (incrementally).
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
2 - Is there any specific care to be taken to have a better CPU utilization?
The most frequent cause for such low performance is that processes are too
short-lived, so that remote fork/migration time costs more than what can be
saved using remote CPUs. Increasing the size (and thus their runtime) of
individual jobs should help. For instance, don't create a new process for each
job. This can be achieved with a simple wrapper script, that remote forks as
many sub-shells as workers, and each sub-shell locally forks jobs as long as
required. In a loop for i=1 to 10000 with 6 workers, this can be achieved by
having each sub-shell worker w (w in 0..5) executing the job for each i in
1..10000 for which i mod 6=w.
For that I would need to to several changes in the program and since
it's a proof of concept it does not make much sense to change it.
Actually, there is a version of it (that uses IPC Shared Memory does
exactly this (Parallel::Forkmanager will create 6 child process and they
will execute until all tasks were processed, so the child process will
be running for a long time) so my best option is try to run it without
issues in Kerrighed. :-)
I created both programs as proof of concept and had identify the bad
design in "many forks" version in a single multicore PC. The costs of
fork creation (measured through vmstat CS attribute) are higher than the
version with shared memory, therefore consuming more time to process the
same amount of tasks.
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
3 - Is there any way to check network usage (and try to reduce)?
Not really. However, seeing krgrpc kernel threads using much CPU is a good sign
of (too) frequent migrations/remote forks, or heavy IO redirection (which should
not happen on NFS regular files).
-e +DISTANT_FORK
-e +CAN_MIGRATE
I can see a lot of those messages in the terminal. This is the worst
combination I've tried. :-)
-e +CAN_MIGRATE will make your shell migrate, which can indeed confuse the
remote fork round robin policy.
Post by Alceu Rodrigues de Freitas Junior
-k <parent PID> -e +DISTANT_FORK
-d +CAN_MIGRATE
works better (not as much as I would like, but better anyway).
Post by Louis Rilling
Post by Alceu R. de Freitas Jr.
4 - The regular tools used to find bottlenecks in Linux (vmstat, iperf, ntop and sar for example) can be used within a Kerrighed cluster?
Not many of them unfortunately, since the kernel ABI used by those tools is
mostly not Kerrighed-aware, and ptrace is not supported for remote processes.
That's bad. I was counting in using at least vmstat. Is it ok to use it?
/proc/vmstat is not Kerrighed-aware, although most of it could be made Kerrighed-aware
rather easily (it's close to how /proc/meminfo is made Kerrighed-aware).

If you want to have a look, it's in kerrighed/procfs/proc.c and
fs/proc/meminfo.c for /proc/meminfo (resp. mm/vmstat.c for /proc/vmstat).

Thanks,

Louis
Post by Alceu Rodrigues de Freitas Junior
Thank you,
Alceu
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
Loading...