splink: INTERNAL Error: Invalid unicode detected in segment statistics update!

Hi,

this is the error I get when I run clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95):


`---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:1211, in Linker.cluster_pairwise_predictions_at_threshold(self, df_predict, threshold_match_probability)
   1203 self._initialise_df_concat_with_tf(df_predict)
   1205 edges_table = _cc_create_unique_id_cols(
   1206     self,
   1207     df_predict,
   1208     threshold_match_probability,
   1209 )
-> 1211 cc = solve_connected_components(self, edges_table)
   1213 return cc

File ~\AppData\Roaming\Python\Python310\site-packages\splink\connected_components.py:411, in solve_connected_components(linker, edges_table, _generated_graph)
    409 sql = _cc_update_representatives_first_iter()
    410 # Execute if we have no batching, otherwise add it to our batched process
--> 411 representatives = linker._enqueue_and_execute_sql_pipeline(
    412     sql, "__splink__df_representatives"
    413 )
    414 prev_representatives_table = representatives
    416 # Loop while our representative table still has unsettled nodes

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:364, in Linker._enqueue_and_execute_sql_pipeline(self, sql, output_table_name, materialise_as_hash, use_cache, transpile)
    361 """Wrapper method to enqueue and execute a sql pipeline in a single call."""
    363 self._enqueue_sql(sql, output_table_name)
--> 364 return self._execute_sql_pipeline([], materialise_as_hash, use_cache, transpile)

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:323, in Linker._execute_sql_pipeline(self, input_dataframes, materialise_as_hash, use_cache, transpile)
    319     sql_gen = self._pipeline._generate_pipeline(input_dataframes)
    321     output_tablename_templated = self._pipeline.queue[-1].output_table_name
--> 323     dataframe = self._sql_to_splink_dataframe(
    324         sql_gen,
    325         output_tablename_templated,
    326         materialise_as_hash,
    327         use_cache,
    328         transpile,
    329     )
    330     return dataframe
    331 else:
    332     # In debug mode, we do not pipeline the sql and print the
    333     # results of each part of the pipeline

File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:405, in Linker._sql_to_splink_dataframe(self, sql, output_tablename_templated, materialise_as_hash, use_cache, transpile)
    402     print(sql)
    404 if materialise_as_hash:
--> 405     splink_dataframe = self._execute_sql(
    406         sql, output_tablename_templated, table_name_hash, transpile=transpile
    407     )
    408 else:
    409     splink_dataframe = self._execute_sql(
    410         sql,
    411         output_tablename_templated,
    412         output_tablename_templated,
    413         transpile=transpile,
    414     )

File ~\AppData\Roaming\Python\Python310\site-packages\splink\duckdb\duckdb_linker.py:227, in DuckDBLinker._execute_sql(self, sql, templated_name, physical_name, transpile)
    220 logger.log(5, log_sql(sql))
    222 sql = f"""
    223 CREATE TABLE {physical_name}
    224 AS
    225 ({sql})
    226 """
--> 227 self._con.execute(sql).fetch_df()
    229 return DuckDBLinkerDataFrame(templated_name, physical_name, self)

RuntimeError: INTERNAL Error: INTERNAL Error: Invalid unicode detected in segment statistics update!
​`

Some extra info: Splink version 3.0.1

Name Version Build Channel

anaconda 2022.05 py39_0 anaconda-client 1.9.0 py39haa95532_0 anaconda-navigator 2.1.4 py39haa95532_0 anaconda-project 0.10.2 pyhd3eb1b0_0

Windows specifications Edition Windows 10 Enterprise Version 20H2

Thank you very much

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (9 by maintainers)

Most upvoted comments

yes, i’m on win10. using spark it worked