OpenRefine: spark-prototype: temp files not being deleted on Windows

Describe the bug temp files on Windows are not deleted and cause SparkException.

To Reproduce Steps to reproduce the behavior:

  1. Load csv into OpenRefine Spark
  2. Do a few cell edits
  3. Wait for spark.ContextCleaner to clean the accumulators
  4. See error when temp files try to get removed.

Current Results

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
  Task 0 in stage 13.0 failed 1 times, most recent failure:
    Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver):
      java.io.IOException:
       (null) entry in command string:
         null chmod 0644 C:\Users\thadg\AppData\Roaming\OpenRefine\2096101862730.project\initial\grid\_temporary\0\_temporary\attempt_20200214082621_0034_m_000000_0\part-00000
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
<snip>

Expected behavior On Windows, Spark should be able to delete the temp files.

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser Version: Firefox
  • JRE or JDK Version: OpenJDK 11

OpenRefine (please complete the following information):

  • Version spark-prototype

Additional context We might need to properly configure Spark for Windows: https://spark.apache.org/docs/latest/configuration.html We might have to configure Spark for Windows correctly, through trial and error, or research with Spark community. We might instead use Python API for Spark (supposedly it didn’t have issues according to Kingsley Jones) Possible Resolutions? (I found it interesting what Kingsley Jones had to say ): https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16134527#comment-16134527

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 24 (24 by maintainers)

Commits related to this issue

Most upvoted comments

Ok, so after more thinking I decided to just remove the dependency to Hadoop altogether, because I cannot get it to work reliably on Windows and it will be useful not to be tied to Hadoop for a lot of other reasons (#4394).

So, those binaries will no longer be required when running OpenRefine with the local runner (which is the default), only when running Spark.

I will add documentation to explain how to set it up.

Yes, the problem is that it’s not up to us to do this upgrade: as long as Hadoop has not upgraded to those Java NIO libraries, we will not benefit from those performance improvements in OpenRefine without loading the native code ourselves. But I would still prefer trying without first, we can always add it back if it is critical. Also, techy users should be able to install it themselves without modifying the app (just by configuring the environment variables correctly).

@thadguidry Thanks a lot! I hope this is fixed now, could you try it again? Both just ./refine (without Spark) and ./refine -r org.openrefine.model.SparkDatamodelRunner (with Spark).

I just plan to ship all the binaries directly. I should be able to work on this soon, thanks for the offer to test it 😃

Actually we need hadoop binaries not just for Windows but also for other platforms. @thadguidry also reported that even on Windows this does not seem to be set up properly yet.

So I think we need to do 2 things…

  1. set HADOOP_HOME on Windows.
  2. And package up the Hadoop native libraries for Windows as part of the build packaging (latest here): https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and drop them into a path and refer that path to HADOOP_HOME