DatasetInfo seems to be missing when I pull my dataset from HFHub

I have a number of datasets, which I create from a dictionary like so:

    info = DatasetInfo(
            description="my happy lil dataset",
            version="0.0.1",
            homepage="https://www.myhomepage.co.uk"
        )
    train_dataset = Dataset.from_dict(prepare_data(data["train"]), info=info)
    test_dataset = Dataset.from_dict(prepare_data(data["test"]), info=info)
    validation_dataset = Dataset.from_dict(prepare_data(data["validation"]),info=info)

I then combine these into a DatasetDict.

    # Create a DatasetDict
    dataset = DatasetDict(
        {"train": train_dataset, "test": test_dataset, "validation": validation_dataset}
    )

So far, so good. If I access dataset['train'].info.description I see the expected result of "My happy lil dataset".

So I push to the hub, like so:

dataset.push_to_hub(f"{organization}/{repo_name}", commit_message="Some commit message")

And this succeeds too.

However, when I come to pull the dataset back down from the hub, and access the information associated with it; like so:

pulled_data = full = load_dataset("f{organization}/{repo_name}" ,use_auth_token = True)

# I expect the following to print out "my happy lil dataset"
print(pulled_data["train"].info.description)

# However, instead it returns ''

Am I loading my data in from the hub incorrectly? Am I pushing only my dataset and not the info somehow?
I feel like I’m missing something obvious, but I’m really not sure. Any help would be appreciated.

Did you ever happen to figure this out? I’m seeing exactly the same issue.

It seems that the contents are stored in ds.info.description only when the dataset is created by a build script. But since trust_remote_code has been removed from the datasets library in version 4.0.0 and later, it can be said that ds.info.description is normally empty as expected.
If you need information about a dataset, refer to the card information in the dataset repository or use DatasetViewer API.

# pip install -U datasets<4.0.0 huggingface_hub[hf_xet]
from datasets import load_dataset_builder, load_dataset
import datasets
from huggingface_hub import RepoCard
import textwrap
print("datasets version:", datasets.__version__)

# 1) Scripted dataset: builder.info.description comes from its loading script
b = load_dataset_builder("livecodebench/code_generation_lite", trust_remote_code=True) # trust_remote_code is no longer supported in >= 4.0.0 https://github.com/huggingface/datasets/releases/tag/4.0.0
print("[scripted] description:", textwrap.shorten(b.info.description or "", 180))

# 2) File-based repo: read metadata from the dataset card; ds.info.description is usually empty
repo = "databricks/databricks-dolly-15k"
card = RepoCard.load(repo, repo_type="dataset")
print("[file-based] card.license:", card.data.to_dict().get("license"))
print("[file-based] card.desc:", textwrap.shorten(card.text or "", 180))
ds = load_dataset(repo, split="train")
print("[file-based] ds.info.description:", repr(ds.info.description))

#datasets version: 3.2.0
#[scripted] description: LiveCodeBench is a temporaly updating benchmark for code generation. Please check the homepage: https://livecodebench.github.io/.
#[file-based] card.license: cc-by-sa-3.0
#[file-based] card.desc: # Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral [...]
#[file-based] ds.info.description: ''