AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments

Liu, Shinan; Mangla, Tarun; Shaowang, Ted; Zhao, Jinjin; Paparrizos, John; Krishnan, Sanjay; Feamster, Nick

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/3580818
Scopus: eid_2-s2.0-85150703793

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Industrial & Manufacturing Systems Engineering: Journal/Magazine Articles

Article: AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments

Title	AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments
Authors	Liu, Shinan Mangla, Tarun Shaowang, Ted Zhao, Jinjin Paparrizos, John Krishnan, Sanjay Feamster, Nick
Keywords	activity recognition datasets multimodal learning
Issue Date	2023
Citation	Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies, 2023, v. 7, n. 1, article no. 21 How to Cite? DOI: http://dx.doi.org/10.1145/3580818
Abstract	Activity recognition using video data is widely adopted for elder care, monitoring for safety and security, and home automation. Unfortunately, using video data as the basis for activity recognition can be brittle, since models trained on video are often not robust to certain environmental changes, such as camera angle and lighting changes. There has been a proliferation of network-connected devices in home environments. Interactions with these smart devices are associated with network activity, making network data a potential source for recognizing these device interactions. This paper advocates for the synthesis of video and network data for robust interaction recognition in connected environments. We consider machine learning-based approaches for activity recognition, where each labeled activity is associated with both a video capture and an accompanying network traffic trace. We develop a simple but effective framework AMIR (Active Multimodal Interaction Recognition)1 that trains independent models for video and network activity recognition respectively, and subsequently combines the predictions from these models using a meta-learning framework. Whether in lab or at home, this approach reduces the amount of "paired"demonstrations needed to perform accurate activity recognition, where both network and video data are collected simultaneously. Specifically, the method we have developed requires up to 70.83% fewer samples to achieve 85% F1 score than random data collection, and improves accuracy by 17.76% given the same number of samples.
Persistent Identifier	http://hdl.handle.net/10722/363521

DC Field	Value	Language
dc.contributor.author	Liu, Shinan	-
dc.contributor.author	Mangla, Tarun	-
dc.contributor.author	Shaowang, Ted	-
dc.contributor.author	Zhao, Jinjin	-
dc.contributor.author	Paparrizos, John	-
dc.contributor.author	Krishnan, Sanjay	-
dc.contributor.author	Feamster, Nick	-
dc.date.accessioned	2025-10-10T07:47:32Z	-
dc.date.available	2025-10-10T07:47:32Z	-
dc.date.issued	2023	-
dc.identifier.citation	Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies, 2023, v. 7, n. 1, article no. 21	-
dc.identifier.uri	http://hdl.handle.net/10722/363521	-
dc.description.abstract	Activity recognition using video data is widely adopted for elder care, monitoring for safety and security, and home automation. Unfortunately, using video data as the basis for activity recognition can be brittle, since models trained on video are often not robust to certain environmental changes, such as camera angle and lighting changes. There has been a proliferation of network-connected devices in home environments. Interactions with these smart devices are associated with network activity, making network data a potential source for recognizing these device interactions. This paper advocates for the synthesis of video and network data for robust interaction recognition in connected environments. We consider machine learning-based approaches for activity recognition, where each labeled activity is associated with both a video capture and an accompanying network traffic trace. We develop a simple but effective framework AMIR (Active Multimodal Interaction Recognition)1 that trains independent models for video and network activity recognition respectively, and subsequently combines the predictions from these models using a meta-learning framework. Whether in lab or at home, this approach reduces the amount of "paired"demonstrations needed to perform accurate activity recognition, where both network and video data are collected simultaneously. Specifically, the method we have developed requires up to 70.83% fewer samples to achieve 85% F1 score than random data collection, and improves accuracy by 17.76% given the same number of samples.	-
dc.language	eng	-
dc.relation.ispartof	Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies	-
dc.subject	activity recognition	-
dc.subject	datasets	-
dc.subject	multimodal learning	-
dc.title	AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1145/3580818	-
dc.identifier.scopus	eid_2-s2.0-85150703793	-
dc.identifier.volume	7	-
dc.identifier.issue	1	-
dc.identifier.spage	article no. 21	-
dc.identifier.epage	article no. 21	-
dc.identifier.eissn	2474-9567	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: AMIR: Active Multimodal Interaction Recognition from Video and Network Traffic in Connected Environments

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats