MediaPipe KNIFT: Template-based feature matching
Blog post from Google Cloud
MediaPipe's KNIFT (Keypoint Neural Invariant Feature Transform) is a newly announced local feature descriptor designed to improve feature matching in computer vision tasks, such as image retrieval and template matching. Unlike traditional methods like SIFT or ORB, which rely on engineered heuristics, KNIFT employs a data-driven approach by learning embeddings from large datasets of real-world video frames. This method allows KNIFT to handle complex spatial transformations and lighting changes more robustly. The training process involves using the well-established Triplet Loss function with triplets extracted from the YouTube UGC Dataset, enhanced by hard-negative triplet mining. KNIFT's architecture is based on a lightweight version of the Inception model, resulting in a 40-dimensional feature descriptor. Performance benchmarks indicate that KNIFT consistently outperforms ORB in keypoint matching tasks across various object types and image conditions. To facilitate adoption, KNIFT is integrated into MediaPipe, offering a template matching solution that can identify and localize image templates in real-time. Demonstrations include matching U.S. dollar bills in video frames, showcasing KNIFT's efficiency and precision. MediaPipe also provides tools for users to create custom templates, further enhancing its versatility in practical applications.