fscrawler: content is null when creating the most simple job

Hello. Thank you for this package. I’m trying it, but I keep getting null in the content, even for a plain text file containing plain text shadi. Could you give me some pointers on how I can get the content to show up? Other than the plain text file, I’d like to index .xlsx, .xls, and .pdf formats.

Here is my job settings file:

{
  "name" : "sic_list",
  "fs" : {
    "url" : "/data/fscrawler/files",
    "update_rate": "1m",
    "indexed_chars": "100%"
  },
  "elasticsearch" : {
    "index" : "sic_list",
    "type": "doc",
    "nodes" : [
      { "host" : "myhost.com", "port" : 9200 }
    ]
  }
}

and here is an excerpt from my --trace output

10:18:31,678 DEBUG [f.p.e.c.f.u.FsCrawlerUtil] filename = [test.txt], includes = [null], excludes = [null]
10:18:31,679 TRACE [f.p.e.c.f.u.FsCrawlerUtil] no rules
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl] [test.txt] can be indexed: [true]
10:18:31,679 DEBUG [f.p.e.c.f.FsCrawlerImpl]   - file: test.txt
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] fetching content from [/data/fscrawler/files],[test.txt]
10:18:31,680 DEBUG [f.p.e.c.f.FsCrawlerImpl] Indexing in ES sic_list, doc, 57e81419ed4fa6aa86d668bb9e28674
10:18:31,681 TRACE [f.p.e.c.f.FsCrawlerImpl] JSon indexed : {
  "content" : null,
  "attachment" : null,
  "meta" : {
    "author" : null,
    "title" : null,
    "date" : null,
    "keywords" : null,
    "raw" : null
  },
  "file" : {
    "content_type" : null,
    "last_modified" : "2017-01-21T10:10:03Z",
    "indexing_date" : "2017-01-21T10:18:31.680Z",
    "filesize" : null,
    "filename" : "test.txt",
    "url" : "file:///data/fscrawler/files/test.txt",
    "indexed_chars" : null,
    "checksum" : null
  },
  "path" : {
    "encoded" : "6113a2c108ffc50c1fd761817d96ca7",
    "root" : "6113a2c108ffc50c1fd761817d96ca7",
    "virtual" : "",
    "real" : "/data/fscrawler/files/test.txt"
  },
  "attributes" : null
}

I’m running fscrawler from a dockerfile

FROM openjdk:alpine
RUN apk add --update openssl
RUN  wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler/2.1/fscrawler-2.1.zip \
  && unzip fscrawler-2.1.zip
RUN mkdir ~/.fscrawler
WORKDIR ./fscrawler-2.1
ENTRYPOINT cp /data/fscrawler/home/* ~/.fscrawler -r \
        && bin/fscrawler --trace sic_list

with the following docker command

  docker build -t fscrawler build/fscrawler/
  docker run -it --rm --name fscrawler-siclist \
    -v /home/shadi/sic_lists/:/data/fscrawler/files/:ro \
    -v "${PWD}"/home/:/data/fscrawler/home/:ro \
    fscrawler

and all the files are readable by the same user launching fscrawler

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Ha! Indeed it’s actually true by default but only if you generate the job with FS Crawler. If you do it manually, it’s actually false.

Thanks a lot for finding this nasty bug.

I know there are some others which I’m going to fix now.

Btw, I took the freedom to open a couple of issues, which I saw while testing fscrawler, separately. I hope you don’t mind 😃