ox: Sax parser doesn't preserve same text whitespace as browser
It seems like sax parsing does not pick up on all newlines/indentation in some cases where it would matter in browser display. I’ve tried all whitespace skipping/preservation options available in Ox.sax_parse. In the example below, it seems the sax parser’s output is identical for two files with different whitespace.
Update: Using :skip => :skip_none, explicitly, fixes the issue. It appears that the documentation listing :skip_none as the default value (i.e. here: http://www.rubydoc.info/github/ohler55/ox/Ox#sax_parse-class_method) is no longer up to date?
version: 2.6.0
require 'ox'
class WhitespaceSax < Ox::Sax
attr_reader :output
def initialize
@output = ''
end
def start_element(name)
output << "Start: #{name}\n"
end
def end_element(name)
output << "End: #{name}\n"
end
def text(str)
output << "#{str.inspect}\n"
end
end
## whitespace_one.html =>
#
# <html>
# <body>
# <span>span</span>
# <span>span2</span>
# </body>
# </html>
#
# Output in Chrome => "span span2 hello"
sax_parser_one = WhitespaceSax.new
File.open('whitespace_one.html', 'r') do |f|
Ox.sax_parse(sax_parser_one, f)
end
File.open('whitespace_one.html', 'r') do |f|
Ox.sax_parse(sax_parser_one, f, :skip => :skip_return)
end
File.open('whitespace_one.html', 'r') do |f|
Ox.sax_parse(sax_parser_one, f, :skip => :skip_white)
end
## whitespace_two.html =>
#
# <html>
# <body>
# <span>span</span><span>span2</span>
# hello
# </body>
# </html>
#
# Output in Chrome => "spanspan2 hello"
sax_parser_two = WhitespaceSax.new
File.open('whitespace_two.html', 'r') do |f|
Ox.sax_parse(sax_parser_two, f)
end
File.open('whitespace_two.html', 'r') do |f|
Ox.sax_parse(sax_parser_two, f, :skip => :skip_return)
end
File.open('whitespace_two.html', 'r') do |f|
Ox.sax_parse(sax_parser_two, f, :skip => :skip_white)
end
# Output of sax parsers equal?
puts sax_parser_one.output.eql? sax_parser_two.output # => true
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 18 (11 by maintainers)
My mistake. That was actually an Oj debugging branch for a different issue. I have not made a branch for this issue yet. Since it sounds like a new mode will work I’ll start a branch for that.
Just pushed a ‘odd-chars’ branch for testing purposes.
The default is
:skip_whitenow. It changes in version 2.5.0 as noted in the CHANGELOG.md.