Programmatically accessing artifact object very slow for first call for large artifacts

Hello!

I came across this issue recently, and I was wondering whether anything can be done to speed up this process. We are using W&B as source of truth for versioning of our datasets. Each dataset is an artefact in a specific project, and files making up this dataset are added as references (everything is stored on S3).

We sometime need to retrieve the path (including version) to a specific file in the artefact. This is typically very fast (<1s) but for larger artefact (made up of >10K references), the process can slow down significantly and take up to 30 seconds. We realized that this holds true whenever we try to access the artifact for the first time (e.g. getting its digest).

Is it expected that artifacts with a large number of files will result in long wait for the first operation when accessing in programmatically in Python?

We typically use the public API to access the artifact (see below for example) but the same happens when using a run.

api = wandb.Api()
artifact = api.artifact("my_org/my_project/my_artifact:latest")
file_info = artifact.get_path("example_file_in_artifact")
s3_path = file_info.ref
s3_version = file_info.extra["versionID"]

Thanks!

Nicolas

Hi Nicolas, if the artifact is large, it will take longer for it to download for the first operation. As you already stated, if you indicate the path, it’ll be faster to download, and you can also indicate the type of artifact you want in order to get a specific datatype from the artifact by using the following code:

runs = api.runs(…) for run in runs:
for artifact in run.logged_artifacts():
if artifact.type == “model”:
artifact.download()

Hi @Leslie,

Thank you for your reply.

In my case, I am not trying to download the artefact, but merely to get the path to one of the file it includes. It is not obvious to me that the length of this operation should be proportional to the size of the dataset?

Happy to provide more details if useful.

Thanks.

Regards,
Nicolas

Thank you for the clarification. We are currently working on optimizing our artifacts to speed up artifact.get, but yes currently to get the time to get these artifacts is correlated to the size of the artifacts.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Hi again Nicolas, when our engineers tried to repro this, two artifacts with S3 references (one with 10K and one with 100K), it takes 0.5 seconds and 2 seconds, respectively using the code that is given. Is it possible for you to give us a more detailed script of what you are doing to get this lag?

Hi Nicolas,

Is it possible for you to give us a more detailed script so we can fix this issue?

Warmly,
Leslie

Hi Nicolas,

Since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Warmly,
Leslie