fscrawler: content is null when creating the most simple job
Hello. Thank you for this package. I’m trying it, but I keep getting null in the content, even for a plain text file containing plain text shadi. Could you give me some pointers on how I can get the content to show up? Other than the plain text file, I’d like to index .xlsx, .xls, and .pdf formats.
Here is my job settings file:
{
"name" : "sic_list",
"fs" : {
"url" : "/data/fscrawler/files",
"update_rate": "1m",
"indexed_chars": "100%"
},
"elasticsearch" : {
"index" : "sic_list",
"type": "doc",
"nodes" : [
{ "host" : "myhost.com", "port" : 9200 }
]
}
}
and here is an excerpt from my --trace output
10:18:31,678 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [test.txt], includes = [null], excludes = [null]
10:18:31,679 TRACE [f.p.e.c.f.u.FsCrawlerUtil] no rules
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl] [test.txt] can be indexed: [true]
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl] - file: test.txt
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/data/fscrawler/files],[test.txt]
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing in ES sic_list, doc, 57e81419ed4fa6aa86d668bb9e28674
10:18:31,681 TRACE [f.p.e.c.f.FsCrawlerImpl] JSon indexed : {
"content" : null,
"attachment" : null,
"meta" : {
"author" : null,
"title" : null,
"date" : null,
"keywords" : null,
"raw" : null
},
"file" : {
"content_type" : null,
"last_modified" : "2017-01-21T10:10:03Z",
"indexing_date" : "2017-01-21T10:18:31.680Z",
"filesize" : null,
"filename" : "test.txt",
"url" : "file:///data/fscrawler/files/test.txt",
"indexed_chars" : null,
"checksum" : null
},
"path" : {
"encoded" : "6113a2c108ffc50c1fd761817d96ca7",
"root" : "6113a2c108ffc50c1fd761817d96ca7",
"virtual" : "",
"real" : "/data/fscrawler/files/test.txt"
},
"attributes" : null
}
I’m running fscrawler from a dockerfile
FROM openjdk:alpine
RUN apk add --update openssl
RUN wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.1/fscrawler-2.1.zip \
&& unzip fscrawler-2.1.zip
RUN mkdir ~/.fscrawler
WORKDIR ./fscrawler-2.1
ENTRYPOINT cp /data/fscrawler/home/* ~/.fscrawler -r \
&& bin/fscrawler --trace sic_list
with the following docker command
docker build -t fscrawler build/fscrawler/
docker run -it --rm --name fscrawler-siclist \
-v /home/shadi/sic_lists/:/data/fscrawler/files/:ro \
-v "${PWD}"/home/:/data/fscrawler/home/:ro \
fscrawler
and all the files are readable by the same user launching fscrawler
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Comments: 15 (15 by maintainers)
Commits related to this issue
- NPE when using --rest option without Rest settings From https://github.com/dadoonet/fscrawler/issues/276#issuecomment-274439085 Using: ```json { "name" : "sic_list", "fs" : { "url" : "/root... — committed to dadoonet/fscrawler by dadoonet 7 years ago
- Add better traces in Elasticsearch client From https://github.com/dadoonet/fscrawler/issues/276#issuecomment-274439085 Instead of: ``` 08:05:49,662 TRACE [f.p.e.c.f.c.ElasticsearchClient] create in... — committed to dadoonet/fscrawler by dadoonet 7 years ago
- Fix default values When writing manually a job settings file like the simplest one: ```json { "name": "test" } ``` We would expect that it comes with advertised default values. But because I miss... — committed to dadoonet/fscrawler by dadoonet 7 years ago
Ha! Indeed it’s actually true by default but only if you generate the job with FS Crawler. If you do it manually, it’s actually false.
Thanks a lot for finding this nasty bug.
I know there are some others which I’m going to fix now.
Btw, I took the freedom to open a couple of issues, which I saw while testing fscrawler, separately. I hope you don’t mind 😃