splink: ParseError: Required keyword: 'this' missing for
What happens?
Hi,
I am trying to run the spark example: https://moj-analytical-services.github.io/splink/demos/examples/spark/deduplicate_1k_synthetic.html and the error I am getting is: ParseError: Required keyword: ‘this’ missing for <class ‘sqlglot.expressions.EQ’>. Line 1, Col: 65. l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1
Any ideas on what is going wrong and what I can do about it? Thank you!!
Sincerely,
tom
To Reproduce
%pip install pyspark pyspark==3.4.1 #%pip install pyspark %pip install --upgrade --force-reinstall pyarrow %pip install pyodbc %pip install duckdb %pip install splink %pip install usaddress %pip install nbformat
import pyodbc import os import pandas as pd import re import usaddress import time
from pyspark.sql import SparkSession import pyspark.sql.functions as pyfuncs from pyspark.sql.types import * from pyspark.sql import Window
from splink.spark.jar_location import similarity_jar_location path = similarity_jar_location()
print(‘create spark sesh’)
spark = SparkSession
.builder
.appName(“tomssplinktest”)
.config(“spark.master”, “spark://ddlas01.hosted.lac.com:7077”)
.config(“spark.executor.memory”, “45g”)
.config(“spark.driver.memory”, “10g”)
.config(‘spark.executor.cores’, ‘1’)
.config(‘spark.cores.max’, ‘8’)
.config(‘spark.executor.instances’, ‘1’)
.config(‘spark.jars’, path)
.config(‘spark.sql.parquet.int96RebaseModeInWrite’, “CORRECTED”)
.getOrCreate()
print(“created spark sesh!”)
Disable warnings for pyspark - you don’t need to include this
import warnings spark.sparkContext.setLogLevel(“ERROR”) warnings.simplefilter(“ignore”, UserWarning)
from splink.datasets import splink_datasets pandas_df = splink_datasets.fake_1000
df = spark.createDataFrame(pandas_df)
import splink.spark.comparison_library as cl import splink.spark.comparison_template_library as ctl from splink.spark.blocking_rule_library import block_on
settings = { “link_type”: “dedupe_only”, “comparisons”: [ ctl.name_comparison(“first_name”), ctl.name_comparison(“surname”), ctl.date_comparison(“dob”, cast_strings_to_date=True), cl.exact_match(“city”, term_frequency_adjustments=True), ctl.email_comparison(“email”, include_username_fuzzy_level=False), ], “blocking_rules_to_generate_predictions”: [ block_on(“first_name”), “l.surname = r.surname”, # alternatively, you can write BRs in their SQL form ], “retain_matching_columns”: True, “retain_intermediate_calculation_columns”: True, “em_convergence”: 0.01 }
from splink.spark.linker import SparkLinker linker = SparkLinker(df, settings, spark=spark) deterministic_rules = [ “l.first_name = r.first_name and levenshtein(r.dob, l.dob) <= 1”, “l.surname = r.surname and levenshtein(r.dob, l.dob) <= 1”, “l.first_name = r.first_name and levenshtein(r.surname, l.surname) <= 2”, “l.email = r.email” ]
linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)
OS:
linux
Splink version:
most recent pypi version
Have you tried this on the latest master
branch?
- I agree
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- I agree
About this issue
- Original URL
- State: closed
- Created 4 months ago
- Comments: 15 (7 by maintainers)
Thanks so much for this @mattjbishop! I think this looks like it is probably the issue - @theimanph if I look at the source text of your comment I see that the code block has exactly these character entity refs:
I think because the code was not contained in a codeblock github renders ‘
<
’ as ‘<
’ automatically, so we never spotted the problem with the code as it appears fine when rendered on githubHi - I don’t know if this helps, but I got exactly this error (same version of sqlglot) when I copy & pasted this example from the github.io page. This page has the deterministic rules written with
"<=
1" at the end of the instead of “<=1”, which causes a parse error. The demo file in the repo (docs/demos/examples/spark/deduplicate_1k_synthetic.ipynb) has the correct text for the rules.