OpenRefine: spark-prototype: temp files not being deleted on Windows

Describe the bug temp files on Windows are not deleted and cause SparkException.

To Reproduce Steps to reproduce the behavior:

Load csv into OpenRefine Spark
Do a few cell edits
Wait for spark.ContextCleaner to clean the accumulators
See error when temp files try to get removed.

Current Results

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
  Task 0 in stage 13.0 failed 1 times, most recent failure:
    Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver):
      java.io.IOException:
       (null) entry in command string:
         null chmod 0644 C:\Users\thadg\AppData\Roaming\OpenRefine\2096101862730.project\initial\grid\_temporary\0\_temporary\attempt_20200214082621_0034_m_000000_0\part-00000
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
<snip>

Expected behavior On Windows, Spark should be able to delete the temp files.

Desktop (please complete the following information):

OS: Windows 10
Browser Version: Firefox
JRE or JDK Version: OpenJDK 11

OpenRefine (please complete the following information):

Version spark-prototype

Additional context We might need to properly configure Spark for Windows: https://spark.apache.org/docs/latest/configuration.html We might have to configure Spark for Windows correctly, through trial and error, or research with Spark community. We might instead use Python API for Spark (supposedly it didn’t have issues according to Kingsley Jones) Possible Resolutions? (I found it interesting what Kingsley Jones had to say ): https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16134527#comment-16134527

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (24 by maintainers)

Commits related to this issue

Add Hadoop Windows binaries, closes #2313 — committed to OpenRefine/OpenRefine by wetneb 4 years ago
Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Attempt to fix Hadoop path on Windows, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Revert "Attempt to fix Hadoop path on Windows, for #2313" The previous path specification was actually correct. This reverts commit dda1c554139dd573759a41751a39872c48428351. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Enable debug to understand why Spark's native binaries are not found on windows. For #2313. — committed to wetneb/OpenRefine by wetneb 3 years ago
Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix Hadoop binaries path for local runner, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix Hadoop path for local runner in tests, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix packaging of Windows bundle, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Add Hadoop DLLs to PATH on Windows, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Remove Hadoop winutils, too brittle (see #2313) — committed to wetneb/OpenRefine by wetneb 3 years ago
Add Hadoop Windows binaries, closes #2313 — committed to wetneb/OpenRefine by wetneb 4 years ago
Add Hadoop Windows binaries, closes #2313 — committed to wetneb/OpenRefine by wetneb 4 years ago
Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix packaging of Windows bundle, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Add Hadoop DLLs to PATH on Windows, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Add Hadoop Windows binaries, closes #2313 — committed to OpenRefine/OpenRefine by wetneb 4 years ago
Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago

Most upvoted comments

Ok, so after more thinking I decided to just remove the dependency to Hadoop altogether, because I cannot get it to work reliably on Windows and it will be useful not to be tied to Hadoop for a lot of other reasons (#4394).

So, those binaries will no longer be required when running OpenRefine with the local runner (which is the default), only when running Spark.

I will add documentation to explain how to set it up.

wetneb on Dec 30, 2021

Yes, the problem is that it’s not up to us to do this upgrade: as long as Hadoop has not upgraded to those Java NIO libraries, we will not benefit from those performance improvements in OpenRefine without loading the native code ourselves. But I would still prefer trying without first, we can always add it back if it is critical. Also, techy users should be able to install it themselves without modifying the app (just by configuring the environment variables correctly).

wetneb on Dec 28, 2021

@thadguidry Thanks a lot! I hope this is fixed now, could you try it again? Both just ./refine (without Spark) and ./refine -r org.openrefine.model.SparkDatamodelRunner (with Spark).

wetneb on Dec 23, 2021

I just plan to ship all the binaries directly. I should be able to work on this soon, thanks for the offer to test it 😃

wetneb on Nov 26, 2021

Actually we need hadoop binaries not just for Windows but also for other platforms. @thadguidry also reported that even on Windows this does not seem to be set up properly yet.

wetneb on Feb 3, 2021

So I think we need to do 2 things…

set HADOOP_HOME on Windows.
And package up the Hadoop native libraries for Windows as part of the build packaging (latest here): https://github.com/cdarlint/winutils/tree/master/hadoop-3.2.1/bin and drop them into a path and refer that path to HADOOP_HOME

thadguidry on Feb 14, 2020