Home / Companies / HuggingFace / Blog / Post Details
Content Deep Dive

ClawHub Security Signals: Large Corpus Multi-Scanner Dataset for Agent Skill Security Research

Blog post from HuggingFace

Post Details
Company
Date Published
Author
Vincent Koc, Patrick Erichsen, Jacob Tomlinson, Agustin Rivera, Mike Appel, and Nir Paz
Word Count
1,400
Language
-
Hacker News Points
-
Summary

ClawHub Security Signals is a dataset comprising 67,453 public agent skills from the ClawHub registry, designed to aid research on agent supply-chain security and multi-signal triage. It integrates data from three scanner families—VirusTotal, static heuristic analysis, and NVIDIA SkillSpector—to produce registry verdicts without human annotations. The dataset reveals significant disagreement among scanners, highlighting the need for ensemble approaches to assess malware reputation, static patterns, and semantic risks associated with agent skills. SkillSpector, with a broader scope, often identifies advisory signals regarding authority and data flow, while VirusTotal excels in detecting malicious content. The dataset is structured into four splits for training, validation, testing, and evaluation, providing sanitized data and redacted sensitive information. It aims to facilitate the development of safe agentic systems by examining scanner disagreements and advancing research areas like multi-signal triage, prompt-injection detection, and least-privilege policy learning.