Better handling of embeddings with two rare, but not unusual, files in them

I have encountered pickled embeddings with a short byteorder file at the top-level, as well as a .data/serialization_id file.

Both load fine after allowing these files in the dataset.

I do not think it is likely adding them to the safe unpickle regular expression would be a security risk, but that's for the maintainers to decide.
This commit is contained in:
Brendan Hoar 2024-04-26 07:55:39 -04:00 committed by GitHub
parent c5b7559856
commit c5ae225418
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -65,7 +65,7 @@ class RestrictedUnpickler(pickle.Unpickler):
# Regular expression that accepts 'dirname/version', 'dirname/data.pkl', and 'dirname/data/<number>'
allowed_zip_names_re = re.compile(r"^([^/]+)/((data/\d+)|version|(data\.pkl))$")
allowed_zip_names_re = re.compile(r"^([^/]+)/((data/\d+)|byteorder|(\.data\/serialization_id)|version|(data\.pkl))$")
data_pkl_re = re.compile(r"^([^/]+)/data\.pkl$")
def check_zip_filenames(filename, names):