OpenRefine: spark-prototype: temp files not being deleted on Windows
Describe the bug temp files on Windows are not deleted and cause SparkException.
To Reproduce Steps to reproduce the behavior:
- Load csv into OpenRefine Spark
- Do a few cell edits
- Wait for spark.ContextCleaner to clean the accumulators
- See error when temp files try to get removed.
Current Results
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 13.0 failed 1 times, most recent failure:
Lost task 0.0 in stage 13.0 (TID 13, localhost, executor driver):
java.io.IOException:
(null) entry in command string:
null chmod 0644 C:\Users\thadg\AppData\Roaming\OpenRefine\2096101862730.project\initial\grid\_temporary\0\_temporary\attempt_20200214082621_0034_m_000000_0\part-00000
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
<snip>
Expected behavior On Windows, Spark should be able to delete the temp files.
Desktop (please complete the following information):
- OS: Windows 10
- Browser Version: Firefox
- JRE or JDK Version: OpenJDK 11
OpenRefine (please complete the following information):
- Version spark-prototype
Additional context We might need to properly configure Spark for Windows: https://spark.apache.org/docs/latest/configuration.html We might have to configure Spark for Windows correctly, through trial and error, or research with Spark community. We might instead use Python API for Spark (supposedly it didn’t have issues according to Kingsley Jones) Possible Resolutions? (I found it interesting what Kingsley Jones had to say ): https://issues.apache.org/jira/browse/SPARK-12216?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=16134527#comment-16134527
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (24 by maintainers)
Commits related to this issue
- Add Hadoop Windows binaries, closes #2313 — committed to OpenRefine/OpenRefine by wetneb 4 years ago
- Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Attempt to fix Hadoop path on Windows, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Revert "Attempt to fix Hadoop path on Windows, for #2313" The previous path specification was actually correct. This reverts commit dda1c554139dd573759a41751a39872c48428351. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Enable debug to understand why Spark's native binaries are not found on windows. For #2313. — committed to wetneb/OpenRefine by wetneb 3 years ago
- Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix Hadoop binaries path for local runner, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix Hadoop path for local runner in tests, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix packaging of Windows bundle, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Add Hadoop DLLs to PATH on Windows, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Remove Hadoop winutils, too brittle (see #2313) — committed to wetneb/OpenRefine by wetneb 3 years ago
- Add Hadoop Windows binaries, closes #2313 — committed to wetneb/OpenRefine by wetneb 4 years ago
- Add Hadoop Windows binaries, closes #2313 — committed to wetneb/OpenRefine by wetneb 4 years ago
- Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix packaging of Windows bundle, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Add Hadoop DLLs to PATH on Windows, for #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Add Hadoop Windows binaries, closes #2313 — committed to OpenRefine/OpenRefine by wetneb 4 years ago
- Add Spark native binaries on Windows for tests too, for #2313 — committed to OpenRefine/OpenRefine by wetneb 3 years ago
- Fix loading of Hadoop native utils on Windows (#4369) And upgrade to Spark 3.2.0. Fixes #2313. — committed to OpenRefine/OpenRefine by wetneb 3 years ago
Ok, so after more thinking I decided to just remove the dependency to Hadoop altogether, because I cannot get it to work reliably on Windows and it will be useful not to be tied to Hadoop for a lot of other reasons (#4394).
So, those binaries will no longer be required when running OpenRefine with the local runner (which is the default), only when running Spark.
I will add documentation to explain how to set it up.
Yes, the problem is that it’s not up to us to do this upgrade: as long as Hadoop has not upgraded to those Java NIO libraries, we will not benefit from those performance improvements in OpenRefine without loading the native code ourselves. But I would still prefer trying without first, we can always add it back if it is critical. Also, techy users should be able to install it themselves without modifying the app (just by configuring the environment variables correctly).
@thadguidry Thanks a lot! I hope this is fixed now, could you try it again? Both just
./refine(without Spark) and./refine -r org.openrefine.model.SparkDatamodelRunner(with Spark).I just plan to ship all the binaries directly. I should be able to work on this soon, thanks for the offer to test it 😃
Actually we need hadoop binaries not just for Windows but also for other platforms. @thadguidry also reported that even on Windows this does not seem to be set up properly yet.
So I think we need to do 2 things…
HADOOP_HOMEon Windows.HADOOP_HOME