MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention
Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2025-02-01
|
Series: | Frontiers in Computer Science |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823860408414371840 |
---|---|
author | Liliek Triyono Liliek Triyono Rahmat Gernowo Prayitno |
author_facet | Liliek Triyono Liliek Triyono Rahmat Gernowo Prayitno |
author_sort | Liliek Triyono |
collection | DOAJ |
description | Aruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired. |
format | Article |
id | doaj-art-29131a43a9a44b1182b3dee1f504667b |
institution | Kabale University |
issn | 2624-9898 |
language | English |
publishDate | 2025-02-01 |
publisher | Frontiers Media S.A. |
record_format | Article |
series | Frontiers in Computer Science |
spelling | doaj-art-29131a43a9a44b1182b3dee1f504667b2025-02-10T14:10:02ZengFrontiers Media S.A.Frontiers in Computer Science2624-98982025-02-01710.3389/fcomp.2025.15102521510252MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attentionLiliek Triyono0Liliek Triyono1Rahmat Gernowo2 Prayitno3Doctoral Program of Information System, Diponegoro University, Central Java, Semarang, IndonesiaDepartment of Electrical Engineering, Politeknik Negeri Semarang, Semarang, IndonesiaDoctoral Program of Information System, Diponegoro University, Central Java, Semarang, IndonesiaDepartment of Electrical Engineering, Politeknik Negeri Semarang, Semarang, IndonesiaAruco markers are crucial for navigation in complex indoor environments, especially for those with visual impairments. Traditional CNNs handle image segmentation well, but transformers excel at capturing long-range dependencies, essential for machine vision tasks. Our study introduces MoNetViT (Mini-MobileNet MobileViT), a lightweight model combining CNNs and MobileViT in a dual-path encoder to optimize global and spatial image details. This design reduces complexity and boosts segmentation performance. The addition of a multi-query attention (MQA) module enhances multi-scale feature integration, allowing end-to-end learning guided by ground truth. Experiments show MoNetViT outperforms other semantic segmentation algorithms in efficiency and effectiveness, particularly in detecting Aruco markers, making it a promising tool to improve navigation aids for the visually impaired.https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/fullindoor navigationcomputer visionmarkersassistive technologymobile devices |
spellingShingle | Liliek Triyono Liliek Triyono Rahmat Gernowo Prayitno MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention Frontiers in Computer Science indoor navigation computer vision markers assistive technology mobile devices |
title | MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention |
title_full | MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention |
title_fullStr | MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention |
title_full_unstemmed | MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention |
title_short | MoNetViT: an efficient fusion of CNN and transformer technologies for visual navigation assistance with multi query attention |
title_sort | monetvit an efficient fusion of cnn and transformer technologies for visual navigation assistance with multi query attention |
topic | indoor navigation computer vision markers assistive technology mobile devices |
url | https://www.frontiersin.org/articles/10.3389/fcomp.2025.1510252/full |
work_keys_str_mv | AT liliektriyono monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention AT liliektriyono monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention AT rahmatgernowo monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention AT prayitno monetvitanefficientfusionofcnnandtransformertechnologiesforvisualnavigationassistancewithmultiqueryattention |