folk-rnn: Data formatting issue: missing unique IDs & stray HTML

I’ve been trying out ABC with GPT-2 along the lines of my poetry generation (max-likelihood training and then I’ll use OA’s RL preference-learning for finetuning), and I’ve found that some of the ABC files in data/ appear to have 2 issues:

  1. there are ~2k stray </html> lines which should be removed, as this is not ABC music (unsure if it’s syntactically invalid, but they definitely shouldn’t be there). This can be easily removed with any search-and-replace.

  2. lack of unique IDs: some ABC compilers (specifically, abc2midi) require unique IDs like X: 1 for each ABC song. http://abcnotation.com/wiki/abc:standard:v2.1#xreference_number notes that

    The X: field may be empty, although this is not recommended.

    (I worked around this by using an Emacs macro to prefix it and an incrementing integer.)

About this issue

Most upvoted comments

To update this: I figured out the problem which was causing divergence. It turned out to not be the prompt issue, but once I fixed that, the problem became more obvious. The OA code has a sort of heuristic where it looks for a particular character/BPE as a validity check, and ignores the score in favor of a fixed penalty if the check fails; the check they had was to search for ‘.’, which they interpret as indicating a valid English sentence. Unfortunately, it seems periods show up very infrequently in ABC music… So every sample was being given a heavy penalty and quality was being ignored. Once I removed that check entirely, the RL tuning code now runs just fine.

I’ve begun the iterative loop and so far have done ~3 loops and have rated ~2136 pairs of samples.

My first impressions are that the RL finetuning is improving the overall syntactic validity of the ABC samples, and has largely eliminated the repetition failure mode (like in the poetry samples). I think it has also made them more ‘musical’ but it’s hard for me to say for sure. (For example, there seem to be much more in the way of ‘endings’ than I usually see with char-RNN stuff, but I’m not sure how much the RL finetuning is helping there versus the Transformer seeing the entire ABC sample.)


Incidentally, I noticed some more problems with the original data. There are English comments dumped in at random in various places. For example, if you grep for ‘should’ you can find lines like

End of second last line should read: |{cd}e3dF2|G6|G6:|
End of second last line should read: |{cd}e3dF2|G6|G6:|
The last bar of the B part should be: |afeg fedB||
Martin Connolly says that the 1st bar should read |AFD ECA,| instead of |AFD EDA,|. I've had another listen to my recording of Brendan Bulger and he's definitely playing a D in his version.
I think something is wrong with ABC. It should read |A3G E2DE| or |A3G E2D2|.
This transcription is inaccurate. There should be a 2nd time ending on the B-part. Last 2 bars are |2 cefe ~a3f|ecBd cAA2||
peakfiddler, a small change in your ABC: |"G"1 should be |1"G", and |"G"2 should be |2"G", just so the repeat counting is next to the bar signs.
Last 2 measures should probably be  |efg  edB| A3 A2z:|
So, ceolachan, my  |Acd/e/|f  should in fact have been  |Ace/f/|g.F|D2F|FGF|E2E|GAG|CDE|A,B,C|DCD|A,B,C|D2F|FGF|E2E|GAG|CEG|ADE|F3|F2:|D|D2G|BGB,/C/|D2F|AFD|D2C|B,2A,|G,B,G,|A,B,C|D3|=C3|B,3|^G,3|A,GF|EDA|F3|F2:|
After listening to CD I reckon last bar should read, | ed cA B2 :||
Bars 3 & 4 (A music) should read;  | FD/F/ EC | D/E/D/C/ B,/A,/G, |Bars 7 & 8 (A music) should read;  | FD/F/ EC | DB, C2 :|Bars 10 & 11 (B music) should read;  | ED/C/ B,>C | D/C/D/E/ F/E/D/C/ | Bars 14 & 15 (B music) should read;  | ED/C/ B,>C | DD/E/ F/E/D/C/ |
I had a note missing on the second ending of the B part. Should be |2GFGF E3B||
The last bar of the B part should be: |afeg fedB||
Martin Connolly says that the 1st bar should read |AFD ECA,| instead of |AFD EDA,|. I've had another listen to my recording of Brendan Bulger and he's definitely playing a D in his version.
I think something is wrong with ABC. It should read |A3G E2DE| or |A3G E2D2|.
This transcription is inaccurate. There should be a 2nd time ending on the B-part. Last 2 bars are |2 cefe ~a3f|ecBd cAA2||
peakfiddler, a small change in your ABC: |"G"1 should be |1"G", and |"G"2 should be |2"G", just so the repeat counting is next to the bar signs.
Last 2 measures should probably be  |efg  edB| A3 A2z:|
So, ceolachan, my  |Acd/e/|f  should in fact have been  |Ace/f/|g.F|D2F|FGF|E2E|GAG|CDE|A,B,C|DCD|A,B,C|D2F|FGF|E2E|GAG|CEG|ADE|F3|F2:|D|D2G|BGB,/C/|D2F|AFD|D2C|B,2A,|G,B,G,|A,B,C|D3|=C3|B,3|^G,3|A,GF|EDA|F3|F2:|
After listening to CD I reckon last bar should read, | ed cA B2 :||
Bars 3 & 4 (A music) should read;  | FD/F/ EC | D/E/D/C/ B,/A,/G, |Bars 7 & 8 (A music) should read;  | FD/F/ EC | DB, C2 :|Bars 10 & 11 (B music) should read;  | ED/C/ B,>C | D/C/D/E/ F/E/D/C/ | Bars 14 & 15 (B music) should read;  | ED/C/ B,>C | DD/E/ F/E/D/C/ |
I had a note missing on the second ending of the B part. Should be |2GFGF E3B||

Which don’t look like ABC to me and make abc2midi unhappy when they pop up in GPT-2-generated outputs.

I haven’t put the models up so far because it’s not really done. For example, the 1k sample from above had a loss of 0.46, but it turns out it can go as low as 0.09 on just the original corpus (minus the other transformed versions) before my simple plagiarism checks using grep began turning up hits of ABC in the original corpus indicating memorization of more than just titles. (Where more ‘musical’ level plagiarism begins, I couldn’t say.) I discovered I had simply given up too early in the training when I went back to modify the corpus more to make it more acceptable to abc2midi and trained it some more. After I got down to 0.09, I retrained on the concatenated corpus down to 0.29, so it roughly halved the loss. The result is that the syntax seems much more correct and now most samples pass abc2midi, and the musical quality seems somewhat better to me. Anyway:

So that’s where the max-likelihood model currently is.

I’m still struggling with the RL tuning. My hacks worked fine for poetry generation, but lead to really bad divergence in every training run for the currently-final 117M model I trained despite me rating 800+ pairs of samples. The average reward would get worse every iteration, and it would degenerate, until sometimes it discovered a highly repetitive sequence like ‘X. X. X.’ which somehow earned a relatively high reward from the reward model, and would continue with that…

My current theory as to what is going on is that the OpenAI RL tuning code is framed as a conditional generation task: a prompt->response. I’ve been using it for unconditional generation by hacking the config to change one of the tasks to poetry and expect the model to simply ignore the random English words being used to prompt it. This is OK for the poetry model because it just ignores the prompts or works off of them (poetry is robust like that), but I think what happens with the ABC 117M reward model is that it is so finely tuned to ABC notation that when it sees a string with the random English word in it, it knows it’s invalid ABC and penalizes it harshly, so every generated sample looks bad to it, and this destroys the training dynamics. What I need to do is somehow eliminate the prompt or make it innocuous, like zeroing it out with spaces, so all the generated samples can be valid ABC and the reward model can start producing a sane training signal for the generator… Haven’t quite done that because all of the rating of the poems and music samples have really exhausted me. I switched back to working on the poems while I meditated on what was going wrong with the music, since adding more ratings clearly was not fixing things.

(The poetry isn’t suffering any weird problems like that, but the improvements from RL training are thus far small, and I suspect poetry is intrinsically much more difficult than music and may simply require way more ratings to train a reward model good enough to make a dramatic difference.)

In the training data, which comes from thesession.org, tunes appear with settings, so one might have several settings of Connaughtman’s Rambles one after another. That could be why.

Ah. Now I see what you mean. Yes, that would explain it. I think that’s OK to leave in place? I thought it was some sort of error by the model, but since they are all settings or ‘variants’ in the original (right?), then it’d be interesting to see the model also generate multiple ‘variants’ on a theme.

The resulting folk-rnn models are surprisingly successful, even though they have on the order of 1000 times fewer parameters than GPT-2 (https://www.youtube.com/channel/UC7wzmG64y2IbTUeWji_qKhA , https://soundcloud.com/oconaillfamilyandfriends )

Sure. You can always get decent performance with small models, because the log-loss decreases roughly logarithmically with parameter count. And NNs are always highly overparameterized, so you should be able to cut down the trained GPT-2 by a factor of 10 or 100 without degrading quality too badly. My belief is that even a 117M is more than big enough and expressive enough to generate great music, and the problems are architectural or training-related (optimizing for the wrong thing).

I did hope that starting with OA’s GPT-2-117M would make the English titles more interesting because the original model knows all sorts of names and objects, but it seems once you’ve trained it far enough to generate good ABC, it’s just largely memorized the titles. Too small a dataset to benefit from the pretraining. Oh well.

Have you trained your model on a single concatenation of all these datafiles?

Yes.

So, what are you trying to do?

Oh, my only real goal here was to see if switching to the RL setting could yield a qualitative improvement over likelihood training using the same data as the starting point, because RL seems to me philosophically the right way to approach the problem and likelihood fundamentally optimizing for the wrong thing. Aside from that, I am mildly curious how much a Transformer can improve over an older RNN.