elasticsearch-hadoop: Dots in field names exception

spark-1.6.2-bin-hadoop2.6, elasticsearch-5.0.0-beta1, elasticsearch-hadoop-5.0.0-beta1

curl -XPOST localhost:9200/test4/test -d '{"b":0,"e":{"f.g":"hello"}}'
./bin/pyspark --driver-class-path=../elasticsearch-hadoop-5.0.0-beta1/dist/elasticsearch-hadoop-5.0.0-beta1.jar
>>> df1 = sqlContext.read.format("org.elasticsearch.spark.sql").load("test4/test")
>>> df1.printSchema()
root
 |-- b: long (nullable = true)
 |-- e: struct (nullable = true)
 |    |-- f: struct (nullable = true)
 |    |    |-- g: string (nullable = true)

>>> df1.show()
---8<--- snip ---8<--- 
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Position for 'e.f.g' not found in row; typically this is caused by a mapping inconsistency
    at org.elasticsearch.spark.sql.RowValueReader$class.addToBuffer(RowValueReader.scala:45)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToBuffer(ScalaEsRowValueReader.scala:14)
    at org.elasticsearch.spark.sql.ScalaRowValueReader.addToMap(ScalaEsRowValueReader.scala:94)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:806)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:696)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:466)
    at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:391)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:286)
    at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:259)
    at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:365)
    at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
    at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:43)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
    at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
    at scala.collection.AbstractIterator.to(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
    at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
    at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
    at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Reactions: 8
  • Comments: 21 (8 by maintainers)

Commits related to this issue

Most upvoted comments

I am going to go ahead and re-open this since it seems like this “problem” of dots in field names is less of a “problem” and more just where things are trending toward in the data integration space. It would be unwise of us to ignore this issue given recent developments across existing solutions.

That said, this issue is not an easy fix and requires some adjusting of invariants that we have treated very carefully over the years - most notably that _source is sacred and should only be changed judiciously. Additionally, document update logic likely will need looking at (just try running a partial document update using normalized JSON in the request against a document containing dotted field names).