Programmatically accessing artifact object very slow for first call for large artifacts

nicjac · October 29, 2021, 5:07pm

Hello!

I came across this issue recently, and I was wondering whether anything can be done to speed up this process. We are using W&B as source of truth for versioning of our datasets. Each dataset is an artefact in a specific project, and files making up this dataset are added as references (everything is stored on S3).

We sometime need to retrieve the path (including version) to a specific file in the artefact. This is typically very fast (<1s) but for larger artefact (made up of >10K references), the process can slow down significantly and take up to 30 seconds. We realized that this holds true whenever we try to access the artifact for the first time (e.g. getting its digest).

Is it expected that artifacts with a large number of files will result in long wait for the first operation when accessing in programmatically in Python?

We typically use the public API to access the artifact (see below for example) but the same happens when using a run.

api = wandb.Api()
artifact = api.artifact("my_org/my_project/my_artifact:latest")
file_info = artifact.get_path("example_file_in_artifact")
s3_path = file_info.ref
s3_version = file_info.extra["versionID"]

Thanks!

Nicolas

lesliewandb · November 1, 2021, 1:19pm

Hi Nicolas, if the artifact is large, it will take longer for it to download for the first operation. As you already stated, if you indicate the path, it’ll be faster to download, and you can also indicate the type of artifact you want in order to get a specific datatype from the artifact by using the following code:

runs = api.runs(…) for run in runs:
for artifact in run.logged_artifacts():
if artifact.type == “model”:
artifact.download()

nicjac · November 2, 2021, 5:36pm

Hi @Leslie,

Thank you for your reply.

In my case, I am not trying to download the artefact, but merely to get the path to one of the file it includes. It is not obvious to me that the length of this operation should be proportional to the size of the dataset?

Happy to provide more details if useful.

Thanks.

Regards,
Nicolas

lesliewandb · November 4, 2021, 7:16pm

Thank you for the clarification. We are currently working on optimizing our artifacts to speed up artifact.get, but yes currently to get the time to get these artifacts is correlated to the size of the artifacts.

system · January 1, 2022, 5:36pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

lesliewandb · January 5, 2022, 9:10pm

Hi again Nicolas, when our engineers tried to repro this, two artifacts with S3 references (one with 10K and one with 100K), it takes 0.5 seconds and 2 seconds, respectively using the code that is given. Is it possible for you to give us a more detailed script of what you are doing to get this lag?

lesliewandb · February 18, 2022, 7:58pm

Hi Nicolas,

Is it possible for you to give us a more detailed script so we can fix this issue?

Warmly,
Leslie

lesliewandb · February 22, 2022, 2:12pm

Hi Nicolas,

Since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Warmly,
Leslie

Topic		Replies	Views
Artefact upload very slow W&B Help artifacts	6	2231	February 12, 2023
Download artifact from Google Bucket W&B Help artifacts	5	603	February 24, 2024
Get S3 Filepath for WandB Artifact W&B Help wandb	3	754	July 11, 2024
Access local filesystem artifacts without downloading W&B Help artifacts	3	834	May 22, 2023
Version tracking for artifacts added by reference (files on local system) W&B Help artifacts , wandb	2	429	December 1, 2023

Programmatically accessing artifact object very slow for first call for large artifacts

Related topics