Finding Needles in a Haystack: PII Detection at Scale with Unstructured, Box, and Elasticsearch
Blog post from Unstructured
Unstructured offers a streamlined solution for identifying sensitive information within unstructured documents stored in Box, using a combination of its parsing and enrichment workflow and Elasticsearch for search and filtering capabilities. The process involves setting up a Box source connector to securely access and process documents, and an Elasticsearch destination connector to receive and query the processed data. Using Unstructured's interactive workflow builder, users can customize transformations, such as image description enrichment and Named Entity Recognition (NER) for detecting personally identifiable information (PII). Once the workflow is configured and executed, the results are stored in Elasticsearch, where users can query for sensitive data like Social Security numbers or credit card information. The platform supports experimentation and prompt tuning, enabling users to effectively parse, enrich, and search a variety of document types, ensuring compliance and data protection.