Hugging Face’s Dataset Release Exposes 1M Bluesky Posts for Research

Why Trust Techopedia
Key Takeaways

  • A Hugging Face librarian published a dataset of 1 million Bluesky posts, raising privacy issues.
  • The dataset, pulled from Bluesky's Firehose API, was shared for machine learning research.
  • Bluesky is considering external consent solutions, emphasizing third-party compliance.

A Hugging Face librarian released and later removed a 1 million Bluesky posts dataset, sparking concerns over data transparency and consent.

Daniel van Strien extracted the posts using the Firehose API and uploaded the dataset to a public repository for machine learning research, according to 404 Media.

 

The dataset, which included users’ decentralized identifiers (DID), featured typical Bluesky content like political debates, quirky comments, and adult material, likely including posts that have since been deleted.

Van Strien announced the release on Bluesky (see below) on November 26 but later removed it, citing concerns about transparency and consent in data collection. He apologized, acknowledging that he intended to support tool development but recognized the mistake.

Bluesky’s Openness Poses Risks to Data Privacy

The Firehose API, unique to Bluesky, streams all public data updates, including posts, likes, and follows. Built on the open AT Protocol, developers can use the data to create tools like Firesky, visualizers, and bots.

Bluesky has stated it is exploring ways for users to express consent preferences externally. While the social media platform isn’t using user content for AI training, it remains up to third parties to honor those preferences.Many people have recently moved to Bluesky to avoid using their content for AI training and gain more control over their data through its decentralized model. However, this openness also makes Bluesky vulnerable, allowing unrestricted access to its data.