Keras trains networks with multiple graphics cards simultaneously


Official documentation: multi_gpu_model ( and Google.


Currently Keras is supporting multiple GPUs to train the network simultaneously, which is very easy, but it won't work with this code below.

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"

When you monitor the GPU usage (nvidia-smi -l 1) you will see that even though the GPU is not idle, essentially only one GPU is running and the others are idle occupied, meaning that if you have multiple graphics cards inside your computer, Keras will default to occupying all the GPUs it can detect, with or without the above code.

This line of code is used when you only need one GPU, which means that you can make Keras detect no other GPUs in your computer. Suppose you have a total of three graphics cards, each with its own label (0, 1, 2), and in order not to interfere with others, you only use one of them, for example with the one with gpu=1, then

os.environ["CUDA_VISIBLE_DEVICES"] = "1"

And then monitor.GPU Use of the(nvidia-smi -l 1), It's true that only one is occupied, Everything else is idle。 So it's aKeras with multiple graphics cardsinconsistency, It does not utilize multipleGPU。


Why train with multiple GPUs at the same time?

Too little memory for a single graphics card -> batch size It can't be set bigger, sometimes evenbatch_size=1 Both memory overflows(OUT OF MEMORY)

From my experience running deep networks,batch_size It would be better to have it bigger., Equivalent to updating the weights with each backpropagation, Network are available to see more samples, thus not every timeiteration All over-fitted to different placesDon't Decay the Learning Rate, Increase the Batch Size。 definitely, I've also read papers that say you can't set it too big either, The reason is unknown.... I never got a chance to try it anyway。 What I suggest.batch_size That's about it.64~256 within the limits of, It's no big deal. issues。

But as the depth of the network is now getting deeper and deeper, the memory requirements for the GPU are also getting bigger and bigger, and the biggest problem for many newcomers who get started is often not the code, but the code copied down from Github inside their own GPU is too scummy to achieve, and they can only reduce the batch_size and end up training without that effect.

The solution is twofold: either buy a super-awesome GPU with immense memory, or buy multiple mediocre GPUs and use them together.

The first option doesn't work because even the best NVIDIA graphics cards currently have a dozen or so gigabytes of memory terrific, the network hangs as soon as it gets deep, and it's not cost-effective to buy a bullish graphics card. So, learning to use multiple GPUs under Keras is a more reliable option.


Very simple.

from model import unet G = 3 # simultaneous use3 sizeGPUwith tf.device("/gpu:0"):

M = unet(input_rows, input_cols, 1) model = keras.utils.training_utils.multi_gpu_model(M, gpus=G) model.compile(optimizer=Adam(lr=1e-5), loss='binary_crossentropy', metrics = ['accuracy']), y_train, batch_size=batch_size*G, epochs=nb_epoch, verbose=0, shuffle=True, validation_data=(X_valid, y_valid)) model.save_weights('/path/to/save/model.h5')


3.1 Compile the model

In the case of a normal network structure, Then no. issues, Just compile the code like above(model.compile(optimizer=Adam(lr=1e-5), loss='binary_crossentropy', metrics = ['accuracy'])) 。 merely, If it isMulti-task network, for exampleFaster-RCNN, It consists of multiple output branches, That is, multipleloss, The naming is generally given during the network definition, Then compile to find the different brancheslayer The name of the company can be, It's like this.:

model.compile(optimizer=optimizer, loss={'main_output': jaccard_distance_loss, 'aux_output': 'binary_crossentropy'}, metrics={'main_output': jaccard_distance_loss, 'aux_output': 'acc'},

loss_weights={'main_output': 1., 'aux_output': 0.5})

where main_output and aux_output are the layer names that are considered defined, but if you use keras.utils.training_utils.multi_gpu_model(), the names are automatically changed to the default concatenate_1, concatenate_2, etc. So you need to first model.summary() a bit, print out the network structure, then figure out which output represents which branch, and then recompile the network as follows.

from keras.optimizers import Adam, RMSprop, SGD model.compile(optimizer=RMSprop(lr=0.045, rho=0.9, epsilon=1.0), loss={'concatenate_1': jaccard_distance_loss, 'concatenate_2': 'binary_crossentropy'}, metrics={'concatenate_1': jaccard_distance_loss, 'concatenate_2': 'acc'},

loss_weights={'concatenate_1': 1., 'concatenate_2': 0.5})

3.2 save the model

using multipleGPU The trained model has a issuesKeras unsolved, Error reported when saving

TypeError: can't pickle module objects

either one or the other

RuntimeError: Unable to create attribute (object header message is too large)

The reason is:

In it clearly stated that the model can be used like the normal model, but it cannot be saved, very funny. I can't even perform reinforced training just because I cannot save the previous model trained with multiple GPUs. If trained with single GPU, the rest of my invested GPUs will become useless. Please urge the developer to look into this bug ASAP.

Normally Keras gives you the function to automatically save the best network (keras.callbacks.ModelCheckpoint()), which is internally saved with, so it doesn't work anymore, you need to design your own function CustomModelCheckpoint() to save the best model.

class CustomModelCheckpoint(keras.callbacks.Callback): def __init__(self, model, path):

self.model = model

self.path = path

self.best_loss = np.inf

def on_epoch_end(self, epoch, logs=None): val_loss = logs['val_loss']

if val_loss < self.best_loss: print(" Validation loss decreased from {} to {}, saving model".format(self.best_loss, val_loss))

self.model.save_weights(self.path, overwrite=True)

self.best_loss = val_loss, y_train, batch_size=batch_size*G, epochs=nb_epoch, verbose=0, shuffle=True, validation_data=(X_valid, y_valid), callbacks=[CustomModelCheckpoint(model, '/path/to/save/model.h5')])

Even so, if the model is still too large, you need the following method, save it in npy format instead of hdf5 format.

RuntimeError: Unable to create attribute (Object header message is too large)

  • model.get_weights(): returns a list of all weight tensors in the model, as Numpy arrays.
  • model.set_weights(weights): sets the values of the weights of the model, from a list of Numpy arrays. The arrays in the list should have the same shape as those returned by get_weights().

# save model

weight = self.model.get_weights()'.npy', weight)

# load model

weight = np.load(load_path) model.set_weights(weight)

3.3 Load the model

By the same token, when reading in a network file.h that was trained together with multiple graphics cards, it also reports an error

ValueError: You are trying to load a weight file containing 3 layers into a model with 1 layers.

The reason for this is that the .h internals are not quite the same as the individual GPU training storage, so you also need to set the function keras.utils.training_utils.multi_gpu_model() when reading it.

from model import unetwith tf.device("/cpu:0"): M = unet(input_rows, input_cols, 1) model = keras.utils.training_utils.multi_gpu_model(M, gpus=G) model.load_weights(load_path)

And then it didn't. issues (onomat.)。

1、Project Management Indepth Understanding 02 Communication Management
2、Loss functions in machine learning
3、Dynamic IEEEFellow Prof Xindong Wu Joins Mindray Data as Chief Scientist and Vice President
4、The road ahead for crowdfunding inclusive ecological selfgenerating
5、A multimilliondollar AI dataset and the 90s supermen behind it

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送