python-deequ: TypeError: 'JavaPackage' object is not callable when running pydeequ

Describe the bug I’ve got an exception when I try to run pydeequ: “TypeError: ‘JavaPackage’ object is not callable”.

To Reproduce Steps to reproduce the behavior:

  1. pip install pydeequ==0.1.5
  2. Code:
from pyspark.sql import SparkSession, Row
import pydeequ

spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

df = spark.sparkContext.parallelize([
            Row(a="foo", b=1, c=5),
            Row(a="bar", b=2, c=6),
            Row(a="baz", b=3, c=None)]).toDF()

from pydeequ.analyzers import *

analysisResult = AnalysisRunner(spark) \
                    .onData(df) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("b")) \
                    .run()
                    
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()
  1. Execute the code above
  2. See error: TypeError: ‘JavaPackage’ object is not callable

Expected behavior I was expecting the results of the analyzer.

Screenshots If applicable, add screenshots to help explain your problem. image

Desktop (please complete the following information):

  • Apache Spark 3.0.0
  • Scala 2.12
  • Pydeequ = 0.1.5

Additional context I’m running it on a Databricks cluster.

Thank you for your help.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 10
  • Comments: 33 (6 by maintainers)

Commits related to this issue

Most upvoted comments

experiencing the same issue. solved by using pyspark --jars /path-to-the-jar/deequ-1.0.5.jar more info: python version: 3.7.9 spark version: 2.4.7 scala version: 2.13.4

I installed the following maven package directly instead of pydeequ.deequ_maven_coord

com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12

you need to check wthr they have an exact match for your cluster and add it as a maven package on the databricks cluster. @anusha610 if you are running it on locally(using dbconnect) use the spark object as follows,

spark = (SparkSession .builder .config(“spark.jars.packages”, ‘com.amazon.deequ:deequ:1.1.0_spark-3.0-scala-2.12’) .config(“spark.jars.excludes”, pydeequ.f2j_maven_coord) .getOrCreate())

We have not tested with databricks yet, but here is how you’d get started with an Amazon EMR cluster – I presume there may be some overlap here! Copy and pasted below:

Your EMR cluster must be running Spark v2.4.6 in order to work with PyDeequ. Once you have a running cluster that has those components and a SageMaker notebook with the necessary permissions, you can configure a SparkSession object from the below template to connect to your cluster. If you need a refresher on how to connect a SageMaker Notebook to EMR, check out this AWS blogpost on using Sparkmagic.

Once you’re in the SageMaker Notebook, run the following JSON in a cell before you start your SparkSession to configure your EMR cluster.

%%configure -f
{ "conf":{
          "spark.pyspark.python": "python3",
          "spark.pyspark.virtualenv.enabled": "true",
          "spark.pyspark.virtualenv.type":"native",
          "spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv",
          "spark.jars.packages": "com.amazon.deequ:deequ:1.0.3",
          "spark.jars.excludes": "net.sourceforge.f2j:arpack_combined_all"
         }
}

Start your SparkSession object in a cell after the above configuration by running spark, then use the SparkContext (default named sc) to install PyDeequ onto your cluster like so

sc.install_pypi_package('pydeequ')

same issue on spark version 2.4.3. I’m using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?

using pyspark --jars {PATH_TO_DEEQ_JAR} resolves this error for me, i think this should be added to the installation steps.

It works fine with following configuration,

Use https://mvnrepository.com/artifact/com.amazon.deequ/deequ to pick the deeque version and spark version for spark.jars.package

from pyspark.sql import SparkSession, DataFrame
import pydeequ


def create_spark():
    """Function to get Spark Configuration"""
    spark = (
        SparkSession.builder.config(
            "spark.jars.packages", "com.amazon.deequ:deequ:2.0.1-spark-3.2"
        )
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
        .getOrCreate()
    )
    return spark

Also, if you are using databricks, make sure that you install this to the cluster libraries maven packages.

image

@MOHACGCG @vinura - Could you please suggest the script changes for this fix?

Issue: Am facing the similar error using DataBricks with below pydeequ version Error: TypeError: ‘JavaPackage’ object is not callable

python version: 3.7.9
pyspark - 2.4.0
scala version: 2.13.4
- **pydeequ-1.0.1** version 

Tried: Downloaded the suggested Jars and uploaded to Databricks filestore and passed the same for spark session

import pydeequ
import sagemaker_pyspark
from pyspark.sql import SparkSession, Row
classpath = ":".join(sagemaker_pyspark.classpath_jars()) # aws-specific jars
spark = (SparkSession
    .builder
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", '/FileStore/jars/deequ_1_0_5.jar')
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

Attached screenshot of error deequ_error

Could you please suggest with appropriate version, steps and scripts for data-bricks implementations

@SerenaLin2020 We have not tested PyDeequ with deequ-1.0.5.jar, so some functionalities may be impaired. Please try with deequ-1.0.3.jar and keep us updated! 😄

Thanks @gucciwang for the insight. However, it was not working automatically as intended. Perhaps it was to do with my setup. I therefore followed @MOHACGCG instruction in the above comment and it works now. Kindly make note of this in the readme file in the interest of larger audience.

@MOHACGCG I’m trying to find the path to deequ jar so as to try your solution. I’m looking under the virtual env path: /lib/Python3.7/site-packages/pyspark/jars. However, I don’t see any jar corresponding to deequ in there. So I’m considering downloading and copyting deequ jar under this path to be followed by your command referring to this location. Do you think there is a better alternative to this?

I just downloaded the jar from here and passed it on.

same issue on spark version 2.4.3. I’m using 2.4.3 hoping to load pydeequ to glue etl. Do you know if deequ is compatible with glue v2?