godot: Regex capture stops at first match- doesn't work as expected in godot (extracting data from <> tags)

Hi, I am trying to sepparate text from xml tags. The first step is to extract the tags

Trying this https://pythex.org/?regex=<(.*%3F)>&test_string=Hello%2C <silence 1.0>my name is Jonn%2C I am a <speed 0.2> blah blah blah blah blah&ignorecase=0&multiline=0&dotall=0&verbose=0

Works everywhere else but in godot. I get only the first match 😦

Here is example code:

func _ready():
	print("TAGS:",extractXmlTags("Hello, <silence 1.0>my name is Jonn, I am a <speed 0.2> blah blah blah blah blah"))
## should return [silence 1.0,speed 0.2], but returns [silence 1.0]

func extractXmlTags(text):
	var NameRegEx = RegEx.new()
	NameRegEx.compile('<(.*?)>') ## also <(.*?)> ## <([^<]+)>
	NameRegEx.find(text)
	var result = NameRegEx.get_captures()
	return result

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 25 (11 by maintainers)

Most upvoted comments

Ah, right. The problem with ([^<]*)(<[^<]+>)? is that it’s valid with zero length strings, in any implementation, so once it gets to the end of your text it just loops infinitely (because a zero length match is a valid match). And then from there, the array just grows until it runs out of memory and crash.

And the problem with crashing isn’t regex specific, because it essentially boils down to:

var result = []
while true:
	result.append(1)

Anyways, I’ve written RegEx.search_all() in #12915 that prevents infinite loops such as this by detecting when a result doesn’t move.

Ah, wait, the example I gave is with the 3.0 branch. Here’s the solution for the 2.1 branch:

func extractXmlTags(text):
	var NameRegEx = RegEx.new()
	NameRegEx.compile('<(.*?)>')
	var start = 0
	while NameRegEx.find(text, start) >= 0:
		print(NameRegEx.get_captures())
		start = NameRegEx.get_capture_start(0) + NameRegEx.get_capture(0).length()

EDIT: Fixed the typo >=0

It feels like you’re piling a few problems together, so let me tackle each problem one-by-one.

It would be really cool if we had a way to extract all the strings in an array without using the get_string(1,2,3…) method Is that even possible?

You mean like RegExMatch.get_strings()?

It is still much simpler than godot’s gdscript approach, where even with the for loop, you will have to think about which get_string(n)s to get for different regular expressions

I’m not sure what you mean by that. get_string(0) (or alternatively get_string() with no parameters) is the naive I-dont-care-about-the-structure-of-the-regex match result.

Perhaps someone could be interested in simplifying it more? Gdscript should be as easy or easier than javascript, not more complicated imo

While there is some extra boiler-plate lines necessary, I fail to understand how it’s more complicated. Here’s the first example re-written in gdscript:

var text = "The rain in SPAIN stays mainly in the plain"
var ex = RegEx.new()
ex.compile("ain")
var res = ex.search_all(text)

The only difference is that it’s two lines extra. And those two extra lines are because:

a) Regex is an optional module. Not everyone uses it. Having String.match() creates a hard dependency in the core type.

b) In native modules, Object.new() cannot accept any parameters. It’s a limitation of the engine.

And here’s the second example re-written in gdscript:

var text = "This is cool"
var ex = RegEx.new()
ex.compile("(This is)( cool)$")
print(to_json(ex.search(text).get_strings()))

Eh, to be honest, it wasn’t that bad. Just that I had a stressful day that day and it just added to it. No worries.

Anyways, now that it’s been merged, does that solve your regex issues?

Yeah, I could do something like that. Perhaps something like:

var ex = RegEx.new()
ex.compile("([^<]*)(<[^<]+>)?")
for match in ex.search_all(text):
	print(match.get_string(1))
	print(match.get_string(2))

Should be easy enough. I’ll get that done when I’m free.

Ah, sorry, I was just following your pythex link as reference. The following function:

func processTags(text):
	var NameRegEx = RegEx.new()
	NameRegEx.compile('([^<]*)(<[^<]+>)?')
	var start = 0
	while NameRegEx.find(text, start) >= 0:
		# Do stuff with regular text
		print("> ", NameRegEx.get_capture(1))
		# Do stuff with tags
		print("= ", NameRegEx.get_capture(2))
		start = NameRegEx.get_capture_start(0) + NameRegEx.get_capture(0).length()

Should give you the output:

> Hi, 
= <silence 1.0>
> my name is John I am a 
= <speed 0.2>
>  blah blah blah blah blah 
= <speed 0.3>
>  !!! Lets be quiet 
= <silence 2.0>
> .Ok done
= 

Hopefully that’s more useful for you. Just replace the print with the actual functions you want.

Pythex of the RegEx code used

EDIT: Changing ([^<]+) into ([^<]*) should deal with the case of text starting with a tag.

The design behind RegEx.find() was kinda inspired by the C++ string find. You do subsequent searches by specifying the start point, which you can do via RegExMatch.get_end(0)

var text = "ab1ab2ab3ab4"
var ex = RegEx.new()
ex.compile("ab.")
var res = ex.search(text)
while res != null:
   	print(res.get_string(0))
	res = ex.search(text, res.get_end(0))

I really need to get a more intuitive API for this, but I’ve been pretty busy lately.