splink: ParseError: Required keyword: 'this' missing for

What happens?

Hi,

I am trying to run the spark example: https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html and the error I am getting is: ParseError: Required keyword: ‘this’ missing for <class ‘sqlglot.expressions.EQ’>. Line 1, Col: 65. l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1

Any ideas on what is going wrong and what I can do about it? Thank you!!

Sincerely,

tom

To Reproduce

%pip install pyspark pyspark==3.4.1 #%pip install pyspark %pip install --upgrade --force-reinstall pyarrow %pip install pyodbc %pip install duckdb %pip install splink %pip install usaddress %pip install nbformat

import pyodbc import os import pandas as pd import re import usaddress import time

from pyspark.sql import SparkSession import pyspark.sql.functions as pyfuncs from pyspark.sql.types import * from pyspark.sql import Window

from splink.spark.jar_location import similarity_jar_location path = similarity_jar_location()

print(‘create spark sesh’) spark = SparkSession
.builder
.appName(“tomssplinktest”)
.config(“spark.master”, “spark://ddlas01.hosted.lac.com:7077”)
.config(“spark.executor.memory”, “45g”)
.config(“spark.driver.memory”, “10g”)
.config(‘spark.executor.cores’, ‘1’)
.config(‘spark.cores.max’, ‘8’)
.config(‘spark.executor.instances’, ‘1’)
.config(‘spark.jars’, path)
.config(‘spark.sql.parquet.int96RebaseModeInWrite’, “CORRECTED”)
.getOrCreate() print(“created spark sesh!”)

Disable warnings for pyspark - you don’t need to include this

import warnings spark.sparkContext.setLogLevel(“ERROR”) warnings.simplefilter(“ignore”, UserWarning)

from splink.datasets import splink_datasets pandas_df = splink_datasets.fake_1000

df = spark.createDataFrame(pandas_df)

import splink.spark.comparison_library as cl import splink.spark.comparison_template_library as ctl from splink.spark.blocking_rule_library import block_on

settings = { “link_type”: “dedupe_only”, “comparisons”: [ ctl.name_comparison(“first_name”), ctl.name_comparison(“surname”), ctl.date_comparison(“dob”, cast_strings_to_date=True), cl.exact_match(“city”, term_frequency_adjustments=True), ctl.email_comparison(“email”, include_username_fuzzy_level=False), ], “blocking_rules_to_generate_predictions”: [ block_on(“first_name”), “l.surname = r.surname”, # alternatively, you can write BRs in their SQL form ], “retain_matching_columns”: True, “retain_intermediate_calculation_columns”: True, “em_convergence”: 0.01 }

from splink.spark.linker import SparkLinker linker = SparkLinker(df, settings, spark=spark) deterministic_rules = [ “l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1”, “l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1”, “l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2”, “l.email = r.email” ]

linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)

OS:

linux

Splink version:

most recent pypi version

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

About this issue

Original URL
State: closed
Created 4 months ago
Comments: 15 (7 by maintainers)

Most upvoted comments

Hi - I don’t know if this helps, but I got exactly this error (same version of sqlglot) when I copy & pasted this example from the github.io page. This page has the deterministic rules written with "<= 1" at the end of the instead of “<=1”, which causes a parse error. The demo file in the repo (docs/demos/examples/spark/deduplicate_1k_synthetic.ipynb) has the correct text for the rules.

Thanks so much for this @mattjbishop! I think this looks like it is probably the issue - @theimanph if I look at the source text of your comment I see that the code block has exactly these character entity refs:

deterministic_rules = [
    "l.first_name = r.first_name and levenshtein(r.dob, l.dob) &lt;= 1",
    "l.surname = r.surname and levenshtein(r.dob, l.dob) &lt;= 1",
    "l.first_name = r.first_name and levenshtein(r.surname, l.surname) &lt;= 2",
    "l.email = r.email"
]

I think because the code was not contained in a codeblock github renders ‘<’ as ‘<’ automatically, so we never spotted the problem with the code as it appears fine when rendered on github

ADBond on Mar 5, 2024