Usage of BIC resources - payment requested?

vfonov · April 11, 2017, 2:38pm

I have encountered following problem while usign GPU on BIC system:

As some people might know there is a machine with Tesla K80 gpu, known as gpunode. I have started using it for our lab’s experiments with deep learning. I was using it directly connecting to the node via SSH.

However on 2017-04-10 06:07 PM I recieved following email from the BIC system adminstrator Sylvain Milot:

Hi guys,
could you please start using gpu.q instead of using it interactively ?
actually I’m telling you … thanks!
should you need anything else, ask JF … I’m back on April 24th.

Since my computation task uses both GPU cores provided by Tesla K80, I submitted a job to the gpu.q using parallel environment with two slots. Following morning I receieved following email from the same system administrator:

Hi Vlad,
a 2-node parallel environment on gpu.q, really ? Isn’t this a little greedy ? Unless you give me a good reason for this, aside from it being possible, I think we should disable parallel environments for that queue.
… and your job has been running for 8 hours already … must be a tough problem to solve!
Anyway I suppose you have until my return (April 24th) to play!

I replied that I needed to use both GPU nodes , and recieved following email:

I think you’re missing the point Vlad.
This is in fact just one adapter with 2 GPUs and I have disabled parallel environments.

Also, at the time there were no jobs on the queue waiting to be executed. Which I pointed out in my reply.
And recieved the following responce:

I understand that Vlad, but this is a limited ressource and I wan’t to give others the chance to use it, especially if you plan to run jobs which take days to run …
If you have money to spend, this server can house 6 adapters total, at ~ $5500 per adapter (Tesla K80)
Highlited by me.

So, I would like to find out is the usage of GPU is considered a “premium” service provided by the BIC? Am I allowed to use it for my research?

mcousi13 · April 11, 2017, 3:38pm

I am in no way an authority on BIC data management & policies, but I think the point of Sylvain was that these shared clusters are available to BIC users for free so long as they are used in a fair way. Since the BIC also provides “premium” services, if you’re going to use the clusters in a manner that monopolizes the whole cluster (in this case using all 2 of the available GPUs for a significant amount of time), it might be worth investing in premium so that it does not penalize users that might need to use it for smaller jobs not worth a premium investment.

Wouldn’t it be possible for you to tweak your job so that it only requires one GPU, at the cost of having it take a longer time to run? If you describe what library and data you are training your network on, I can try to help. With 12GB per core this should be more than enough for most types of data – try reducing your batch size?

vfonov · April 11, 2017, 3:52pm

Yes, I can run my code for twice as long, using half the memory.
However, I fail to see the demand - there is no jobs in the queue waiting to be executed.

What is particularly troubling - I fail to see a reason to arbitrary change the rules.

tstrau · April 11, 2017, 6:47pm

I’m inviting JF to weigh in on this issue, since Sylvain is now on vacation.

pdonha · April 11, 2017, 7:46pm

Since I was cc’ed on Sylvain’s email this morning, I guess I’m the only other person who used the gpunode so far. To me it seemed, as long as the demand is that low, there would be no reason for limiting the usage. If more people start using it, probably makes sense to have jobs limited to one card at a time.?

(btw, I’m not planning to submit anything in the coming days…)

Peter

jmalou · April 11, 2017, 7:53pm

Well, first this seems to be rather volatile discussion. So let’s cool it a bit.

Despite Vlad’s objections, I think Sylvain’s motivation is still valid: a fair usage of the gpu resources by all users.
Rather than policing everything with strict rules through SGE environment, I think it would be more productive
to have users understand this ‘fair usage principle’. In the mean time, all this new hardware is very new and users are still not aware afaik of its existence so I’d say to Vlad, go ahead and monopolise both GPUs, but be aware that at some point users will complain!

vfonov · April 11, 2017, 8:04pm

I think disabling parallel environment have nothing to do with fair usage.
SGE does not enforce the actual usage of resources.

Also, as far as I know, it does not actually allocate GPUs for specific job.

vfonov · April 12, 2017, 4:20pm

So, everybody is happy now?
We are happily sharing resources!

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 179974 0.25011 GPU_1      vfonov       r     04/12/2017 12:12:06 gpu.q@gpunode.bic.mni.mcgill.c     1

Wed Apr 12 12:18:14 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39                 Driver Version: 375.39                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:06:00.0     Off |                    0 |
| N/A   50C    P0   103W / 149W |   9889MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:07:00.0     Off |                    0 |
| N/A   30C    P0    71W / 149W |    177MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     26981    C   python                                        8526MiB |
|    0     31570    C   ...tine/experimental/1.9.15/torch/bin/luajit  1358MiB |
|    1     31570    C   ...tine/experimental/1.9.15/torch/bin/luajit   175MiB |
+-----------------------------------------------------------------------------+

pdonha · April 12, 2017, 6:51pm

Looks good!

kelmok · April 12, 2017, 11:49pm

Where can I locate more information about the new shared resources / queues that are available?

We can definitely add to the demand for this

vfonov · April 13, 2017, 4:29am

Ask your friendly system administrator, I suppose?

tstrau · April 13, 2017, 2:05pm

I’ve started a new thread for that here:

https://forum.bic.mni.mcgill.ca/t/how-to-get-started-using-bics-gpus-in-the-image-processing-queue/335?u=tstrau

If this thread is resolved, then I will close it soon. We want to keep each thread in the forum focused on one issue at a time.