splink: INTERNAL Error: Invalid unicode detected in segment statistics update!
Hi,
this is the error I get when I run clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)
:
`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Input In [51], in <cell line: 1>()
----> 1 clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.95)
File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:1211, in Linker.cluster_pairwise_predictions_at_threshold(self, df_predict, threshold_match_probability)
1203 self._initialise_df_concat_with_tf(df_predict)
1205 edges_table = _cc_create_unique_id_cols(
1206 self,
1207 df_predict,
1208 threshold_match_probability,
1209 )
-> 1211 cc = solve_connected_components(self, edges_table)
1213 return cc
File ~\AppData\Roaming\Python\Python310\site-packages\splink\connected_components.py:411, in solve_connected_components(linker, edges_table, _generated_graph)
409 sql = _cc_update_representatives_first_iter()
410 # Execute if we have no batching, otherwise add it to our batched process
--> 411 representatives = linker._enqueue_and_execute_sql_pipeline(
412 sql, "__splink__df_representatives"
413 )
414 prev_representatives_table = representatives
416 # Loop while our representative table still has unsettled nodes
File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:364, in Linker._enqueue_and_execute_sql_pipeline(self, sql, output_table_name, materialise_as_hash, use_cache, transpile)
361 """Wrapper method to enqueue and execute a sql pipeline in a single call."""
363 self._enqueue_sql(sql, output_table_name)
--> 364 return self._execute_sql_pipeline([], materialise_as_hash, use_cache, transpile)
File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:323, in Linker._execute_sql_pipeline(self, input_dataframes, materialise_as_hash, use_cache, transpile)
319 sql_gen = self._pipeline._generate_pipeline(input_dataframes)
321 output_tablename_templated = self._pipeline.queue[-1].output_table_name
--> 323 dataframe = self._sql_to_splink_dataframe(
324 sql_gen,
325 output_tablename_templated,
326 materialise_as_hash,
327 use_cache,
328 transpile,
329 )
330 return dataframe
331 else:
332 # In debug mode, we do not pipeline the sql and print the
333 # results of each part of the pipeline
File ~\AppData\Roaming\Python\Python310\site-packages\splink\linker.py:405, in Linker._sql_to_splink_dataframe(self, sql, output_tablename_templated, materialise_as_hash, use_cache, transpile)
402 print(sql)
404 if materialise_as_hash:
--> 405 splink_dataframe = self._execute_sql(
406 sql, output_tablename_templated, table_name_hash, transpile=transpile
407 )
408 else:
409 splink_dataframe = self._execute_sql(
410 sql,
411 output_tablename_templated,
412 output_tablename_templated,
413 transpile=transpile,
414 )
File ~\AppData\Roaming\Python\Python310\site-packages\splink\duckdb\duckdb_linker.py:227, in DuckDBLinker._execute_sql(self, sql, templated_name, physical_name, transpile)
220 logger.log(5, log_sql(sql))
222 sql = f"""
223 CREATE TABLE {physical_name}
224 AS
225 ({sql})
226 """
--> 227 self._con.execute(sql).fetch_df()
229 return DuckDBLinkerDataFrame(templated_name, physical_name, self)
RuntimeError: INTERNAL Error: INTERNAL Error: Invalid unicode detected in segment statistics update!
`
Some extra info: Splink version 3.0.1
Name Version Build Channel
anaconda 2022.05 py39_0 anaconda-client 1.9.0 py39haa95532_0 anaconda-navigator 2.1.4 py39haa95532_0 anaconda-project 0.10.2 pyhd3eb1b0_0
Windows specifications Edition Windows 10 Enterprise Version 20H2
Thank you very much
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 15 (9 by maintainers)
yes, i’m on win10. using spark it worked