Programmatically accessing artifact object very slow for first call for large artifacts

Hello!

I came across this issue recently, and I was wondering whether anything can be done to speed up this process. We are using W&B as source of truth for versioning of our datasets. Each dataset is an artefact in a specific project, and files making up this dataset are added as references (everything is stored on S3).

We sometime need to retrieve the path (including version) to a specific file in the artefact. This is typically very fast (<1s) but for larger artefact (made up of >10K references), the process can slow down significantly and take up to 30 seconds. We realized that this holds true whenever we try to access the artifact for the first time (e.g. getting its digest).

Is it expected that artifacts with a large number of files will result in long wait for the first operation when accessing in programmatically in Python?

We typically use the public API to access the artifact (see below for example) but the same happens when using a run.

api = wandb.Api()
artifact = api.artifact("my_org/my_project/my_artifact:latest")
file_info = artifact.get_path("example_file_in_artifact")
s3_path = file_info.ref
s3_version = file_info.extra["versionID"]

Thanks!

Nicolas

Hi Nicolas, if the artifact is large, it will take longer for it to download for the first operation. As you already stated, if you indicate the path, it’ll be faster to download, and you can also indicate the type of artifact you want in order to get a specific datatype from the artifact by using the following code:

runs = api.runs(…) for run in runs:
for artifact in run.logged_artifacts():
if artifact.type == “model”:
artifact.download()

Hi @Leslie,

Thank you for your reply.

In my case, I am not trying to download the artefact, but merely to get the path to one of the file it includes. It is not obvious to me that the length of this operation should be proportional to the size of the dataset?

Happy to provide more details if useful.

Thanks.

Regards,
Nicolas

Thank you for the clarification. We are currently working on optimizing our artifacts to speed up artifact.get, but yes currently to get the time to get these artifacts is correlated to the size of the artifacts.