ox: Sax parser doesn't preserve same text whitespace as browser

It seems like sax parsing does not pick up on all newlines/indentation in some cases where it would matter in browser display. I’ve tried all whitespace skipping/preservation options available in Ox.sax_parse. In the example below, it seems the sax parser’s output is identical for two files with different whitespace.

Update: Using :skip => :skip_none, explicitly, fixes the issue. It appears that the documentation listing :skip_none as the default value (i.e. here: http://www.rubydoc.info/github/ohler55/ox/Ox#sax_parse-class_method) is no longer up to date?

version: 2.6.0

require 'ox'

class WhitespaceSax < Ox::Sax
  attr_reader :output
  def initialize
    @output = ''
  end
  def start_element(name)
    output << "Start: #{name}\n"
  end
  def end_element(name)
    output << "End: #{name}\n"
  end
  def text(str)
    output << "#{str.inspect}\n"
  end
end

## whitespace_one.html =>
#
# <html>
#   <body>
#     <span>span</span>
#     <span>span2</span>
#   </body>
# </html>
#
# Output in Chrome => "span span2 hello"

sax_parser_one = WhitespaceSax.new
File.open('whitespace_one.html', 'r') do |f|
  Ox.sax_parse(sax_parser_one, f)
end
File.open('whitespace_one.html', 'r') do |f|
  Ox.sax_parse(sax_parser_one, f, :skip => :skip_return)
end
File.open('whitespace_one.html', 'r') do |f|
  Ox.sax_parse(sax_parser_one, f, :skip => :skip_white)
end

## whitespace_two.html =>
#
# <html>
#   <body>
#     <span>span</span><span>span2</span>
#     hello
#   </body>
# </html>
#
# Output in Chrome => "spanspan2 hello"

sax_parser_two = WhitespaceSax.new
File.open('whitespace_two.html', 'r') do |f|
  Ox.sax_parse(sax_parser_two, f)
end
File.open('whitespace_two.html', 'r') do |f|
  Ox.sax_parse(sax_parser_two, f, :skip => :skip_return)
end
File.open('whitespace_two.html', 'r') do |f|
  Ox.sax_parse(sax_parser_two, f, :skip => :skip_white)
end

# Output of sax parsers equal?
puts sax_parser_one.output.eql? sax_parser_two.output # => true

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 18 (11 by maintainers)

Most upvoted comments

My mistake. That was actually an Oj debugging branch for a different issue. I have not made a branch for this issue yet. Since it sounds like a new mode will work I’ll start a branch for that.

Just pushed a ‘odd-chars’ branch for testing purposes.

The default is :skip_white now. It changes in version 2.5.0 as noted in the CHANGELOG.md.