tensorflow: Tensorflow Model with CTC loss having save and restore problem
I am using tensorflow 0.12 without GPU support. I was testing it with various models. My template structure is
#Load some data from file
graph=tf.Graph()
with graph.as_default():
#Build Network
#saver=tf.train.Saver()
with tf.Session(graph=graph) as session:
if(sys.argv[1]=="load"):
saver.restore(session,"weight_last")
else:
initop=tf.global_variables_initializer()
session.run(initop)
#Continue Training
Now, I am facing a strange issue. When I am creating a MLP or RNN with this structure with a categorical cross entropy loss model this saving and restoring is working perfectly, i.e. after restore the loss is showing exact value that was showed during last save. But unfortunately when the network is loaded with CTC loss then after restoring the model is starting almost a new training. I am not sure what is going wrong? Any help shall be highly appreciated.
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 72 (19 by maintainers)
I found my problem and the problem was shuffling before making dictionary… 😢
I also can confirm the issue: Whenever we are using the
The tf.train.Saver or alternative does not save the weights of the LSTM cells properlly. This also seems to be the case of using
So right now it seems we have no reliable way to save and load LSTM based on
tf.contrib.rnn.BasicLSTMCellI train and store the model in python2, and restore it using python3, it got terrible result. But when I restore the model using python2, the result is good. Train and store in python3, I got awful result.
I have the same problem concerning the saving of rnn cells. I use a simple tf.nn.rnn_cell.BasicRNNCell for testing. Whenever I train my network stop the program and restart it to for example generate a sequence or further training (starting with the latest training state) it seems like the model was never trained before.
BUT when I train the model, save it and in the same run train or generate a text by restoring the model before executing the task it works! That is crazy!
I found the root cause. my input data vectors are modeled differently across each execution due to some issue in the data modeling part. Otherwise, no issue in tensor flow graph after restoration found. After fixing the input data model, I could able to retrain from the last checkpoint properly. I suggest cross checking the input data model for its consistency across multiple executions
@michaelisard I am extremely sorry for my delayed response. My input is taken from a H5 file which contains features extracted from online handwriting data sample. at every time step I have 16 features. Every time I read from this file the order of data is shuffled. Here is the part where I am creating the graph.
Just after this, I am running the training and saving it
Now, whenever I am loading the model from last save or best save, It is not showing any sign of previous training. Seems to be starting from some scratch. I also tried
import_meta_graph()without any success. But the same strategy is working absolutely fine with a RNN model which is tested against the well known IRIS data set (hence a classification problem).I am completely in dark. Your concern is highly appreciated.