research-article
Free access
Just Accepted
Authors: Nana Zhang, Min Xiong, Dandan Zhu, Kun Zhu, Guangtao Zhai, Xiaokang Yang
ACM Transactions on Multimedia Computing, Communications and Applications
Accepted on 20 September 2024
Online AM: 11 October 2024 Publication History
Metrics
Total Citations0Total Downloads0Last 12 Months0
Last 6 weeks0
New Citation Alert added!
This alert has been successfully added and will be sent to:
You will be notified whenever a record that you have chosen has been cited.
To manage your alert preferences, click on the button below.
Manage my Alerts
New Citation Alert!
Please log in to your account
PDFeReader
- View Options
- References
- Media
- Tables
- Share
Abstract
With the remarkable advancement of deep learning techniques and the wide availability of large-scale datasets, the performance of audio-visual saliency prediction has been drastically improved. Actually, audio-visual saliency prediction is still at an early exploration stage due to the spatial-temporal signal complexity and dynamic continuity of video content. To our knowledge, most existing audio-visual saliency prediction approaches usually represent videos as 3D grid of RGB values using discrete convolution neural networks (CNNs), which inevitably incurs video content-agnostic and ignores the dynamic continuity issues. This paper proposes a novel parametric audio-visual saliency (PAVS) model with implicit neural representation (INR) to address the aforementioned problems. Specifically, by using the proposed parametric neural network, we can effectively encode the space-time coordinates of video frames into corresponding saliency values, which can significantly enhance the compact feature representation ability. Meanwhile, a parametric feature fusion method is developed to achieve intrinsic interactions between audio and visual information streams, which can adaptively fuse audio and visual features to obtain competitive performance. Notably, without resorting to any specific audio-visual feature fusion strategy, the proposed PAVS model outperforms other state-of-the-art saliency methods by a large margin.
References
[1]
I. Anokhin, K. Demochkin, T. Khakhulin, G. Sterkin, V. Lempitsky, and D. Korzhenkov. 2021. Image Generators with Conditionally-Independent Pixel Synthesis. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14273–14282. https://doi.org/10.1109/CVPR46437.2021.01405
[2]
G Bellitto, F Proietto Salanitri, S Palazzo, F Rundo, D Giordano, and C Spampinato. 2021. Hierarchical domain-adapted feature learning for video saliency prediction. International Journal of Computer Vision (2021), 1–17.
Digital Library
[3]
Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. What Do Different Evaluation Metrics Tell Us About Saliency Models? IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2016), 740–757.
Digital Library
[4]
Moran Cerf, Jonathan Harel, Wolfgang Einhäuser, and Christof Koch. 2007. Predicting Human Gaze Using Low-Level Saliency Combined with Face Detection. In Proceedings of the 20th International Conference on Neural Information Processing Systems (Vancouver, British Columbia, Canada) (NIPS’07). Curran Associates Inc., Red Hook, NY, USA, 241–248.
Digital Library
[5]
Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
[6]
Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, and Muwei Jian. 2023. A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key! IEEE Transactions on Circuits and Systems for Video Technology 33, 2 (2023), 457–477. https://doi.org/10.1109/TCSVT.2022.3203421
[7]
Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser Nam Lim, and Abhinav Shrivastava. 2021. NeRV: Neural Representations for Videos. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (Eds.), Vol. 34. Curran Associates, Inc., 21557–21568. https://proceedings.neurips.cc/paper_files/paper/2021/file/b44182379bf9fae976e6ae5996e13cd8-Paper.pdf
[8]
Yangyu Chen, W. Zhang, Shuhui Wang, L. Li, and Qingming Huang. 2018. Saliency-Based Spatiotemporal Attention for Video Captioning. 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM) (2018), 1–8. https://api.semanticscholar.org/CorpusID:53045827
[9]
Antoine Coutrot and Nathalie Guyader. 2013. Toward the introduction of auditory information in dynamic visual attention models. 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) (2013), 1–4.
[10]
Antoine Coutrot and Nathalie Guyader. 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision 14 8 (2014), 5.
[11]
Antoine Coutrot and Nathalie Guyader. 2016. Multimodal saliency models for videos. From Human Attention to Computational Attention: A Multidisciplinary Approach (2016), 291–304.
[12]
Piotr Dabkowski and Yarin Gal. 2017. Real Time Image Saliency for Black Box Classifiers. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Vol. 30. 6970–6979.
[13]
Ana De Abreu, Cagri Ozcinar, and Aljosa Smolic. 2017. Look around you: Saliency maps for omnidirectional images in VR applications. In 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX). 1–6. https://doi.org/10.1109/QoMEX.2017.7965634
[14]
Richard Droste, Jianbo Jiao, and J Alison Noble. 2020. Unified image and video saliency modeling. In European Conference on Computer Vision. Springer, 419–435.
Digital Library
[15]
Emilien Dupont, Yee Whye Teh, and Arnaud Doucet. 2021. Generative models as distributions of functions. arXiv preprint arXiv:2102.04776 (2021).
[16]
Yuming Fang, Guanqun Ding, Jia Li, and Zhijun Fang. 2018. Deep3DSaliency: Deep stereoscopic video saliency detection model by 3D convolutional networks. IEEE Transactions on Image Processing 28, 5 (2018), 2305–2318.
[17]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In European conference on computer vision. Springer, 505–520.
[18]
Jonathan Harel, Christof Koch, and Pietro Perona. 2007. Graph-based visual saliency. Advances in neural information processing systems 19 (2007), 545–552.
[19]
Xiaodi Hou and Liqing Zhang. 2007. Saliency Detection: A Spectral Residual Approach. In 2007 IEEE Conference on Computer Vision and Pattern Recognition. 1–8. https://doi.org/10.1109/CVPR.2007.383267
[20]
Hou-Ning Hu, Yen-Chen Lin, Ming-Yu Liu, Hsien-Tzu Cheng, Yung-Ju Chang, and Min Sun. 2017. Deep 360 Pilot: Learning a Deep Agent for Piloting through 360° Sports Videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1396–1405. https://doi.org/10.1109/CVPR.2017.153
[21]
Liming Huang, Kechen Song, Jie Wang, Menghui Niu, and Yunhui Yan. 2021. Multi-Graph Fusion and Learning for RGBT Image Saliency Detection. IEEE Transactions on Circuits and Systems for Video Technology 32 (2021), 1366–1377.
[22]
Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In 2015 IEEE International Conference on Computer Vision (ICCV). 262–270. https://doi.org/10.1109/ICCV.2015.38
[23]
Massimiliano Iacono, Giulia D’Angelo, Arren J. Glover, Vadim Tikhanoff, Ernst Niebur, and Chiara Bartolozzi. 2019. Proto-object based saliency for event-driven cameras. 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2019), 805–812. https://api.semanticscholar.org/CorpusID:210971188
Digital Library
[24]
Laurent Itti. 2004. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE Transactions on Image Processing 13 (2004), 1304–1318.
Digital Library
[25]
L. Itti, C. Koch, and E. Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 11 (1998), 1254–1259. https://doi.org/10.1109/34.730558
Digital Library
[26]
Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. 2021. ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 3520–3527. https://doi.org/10.1109/IROS51168.2021.9635989
Digital Library
[27]
Saumya Jetley, Naila Murray, and Eleonora Vig. 2016. End-to-End Saliency Mapping via Probability Distribution Prediction. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5753–5761. https://doi.org/10.1109/CVPR.2016.620
[28]
Sen Jia and Neil D.B. Bruce. 2020. EML-NET: An Expandable Multi-Layer NETwork for saliency prediction. Image and Vision Computing 95 (2020), 103887. https://doi.org/10.1016/j.imavis.2020.103887
Digital Library
[29]
Lai Jiang, Mai Xu, Tie Liu, Minglang Qiao, and Zulin Wang. 2018. Deepvs: A deep learning based video saliency prediction approach. In Proceedings of the european conference on computer vision. 602–617.
Digital Library
[30]
Tilke Judd, Krista Ehinger, Frédo Durand, and Antonio Torralba. 2009. Learning to Predict Where Humans Look. In 2009 IEEE 12th international conference on computer vision. IEEE, 2106–2113.
[31]
Christof Koch and Shimon Ullman. 1985. Shifts in selective visual attention: towards the underlying neural circuitry. Human neurobiology 4 4 (1985), 219–27.
[32]
Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, Pol Moreno, Rosalia Schneider, Sona Mokrá, and Danilo Jimenez Rezende. 2021. Nerf-vae: A geometry aware 3d scene generative model. In International Conference on Machine Learning. PMLR, 5742–5752.
[33]
Petros Koutras and Petros Maragos. 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication 38 (2015), 15–31.
Digital Library
[34]
Matthias Kümmerer, Lucas Theis, and Matthias Bethge. 2014. Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045 (2014).
[35]
Qiuxia Lai, Wenguan Wang, Hanqiu Sun, and Jianbing Shen. 2019. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Transactions on Image Processing 29 (2019), 1113–1126.
[36]
Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, and Zhaoyang Lv. 2022. Neural 3D Video Synthesis from Multi-view Video. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5511–5521. https://doi.org/10.1109/CVPR52688.2022.00544
[37]
Yunxiao Li, Shuai Li, Chenglizhao Chen, Aimin Hao, and Hong Qin. 2020. A Plug-and-Play Scheme to Adapt Image Saliency Deep Model for Video Data. IEEE Transactions on Circuits and Systems for Video Technology 31 (2020), 2315–2327.
[38]
Yangke Li and Xinman Zhang. 2023. Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network. Neurocomputing 549 (2023), 126432. https://doi.org/10.1016/j.neucom.2023.126432
Digital Library
[39]
Panagiotis Linardos, Eva Mohedano, Juan José Nieto, Noel E. O’Connor, Xavier Giró-i-Nieto, and Kevin McGuinness. 2019. Simple vs complex temporal recurrences for video saliency prediction. In 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, September 9-12, 2019. BMVA Press, 182. https://bmvc2019.org/wp-content/uploads/papers/0952-paper.pdf
[40]
Nian Liu and Junwei Han. 2018. A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection. IEEE Transactions on Image Processing 27, 7 (2018), 3264–3274. https://doi.org/10.1109/TIP.2018.2817047
[41]
Yufan Liu, Minglang Qiao, Mai Xu, Bing Li, Weiming Hu, and Ali Borji. 2020. Learning to predict salient faces: A novel visual-audio saliency model. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer, 413–429.
[42]
Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. 2019. Neural volumes. ACM Transactions on Graphics 38, 4 (jul 2019), 1–14.
Digital Library
[43]
Jianxun Lou, Hanhe Lin, David Marshall, Dietmar Saupe, and Hantao Liu. 2022. TranSalNet: Towards perceptually relevant visual saliency prediction. Neurocomputing 494 (2022), 455–467. https://doi.org/10.1016/j.neucom.2022.04.080
[44]
Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. 2020. NeRF in the Wild: Neural Radiance Fields for Unconstrained Photo Collections. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020), 7206–7215.
[45]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis. Commun. ACM 65, 1 (2021), 99–106.
Digital Library
[46]
Kyle Min and Jason J Corso. 2019. Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2394–2403.
[47]
Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2016), 1–23.
Digital Library
[48]
Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, and Xinping Guan. 2020. A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence. IEEE Transactions on Image Processing 29 (2020), 3805–3819. https://doi.org/10.1109/TIP.2020.2966082
Digital Library
[49]
Parag K. Mital, Tim J. Smith, Robin L. Hill, and John M. Henderson. 2011. Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion. Cognitive Computation 3 (2011), 5–24.
[50]
Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, and Antonio Torralba. 2016. Ambient Sound Provides Supervision for Visual Learning. In Computer Vision – ECCV 2016, Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer International Publishing, Cham, 801–816.
[51]
Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, and Xavier and Giro-i Nieto. 2017. SalGAN: Visual Saliency Prediction with Generative Adversarial Networks. In arXiv.
[52]
Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition. 598–606.
[53]
David R. Perrott, Kourosh Saberi, Katherine Brown, and Thomas Z. Strybel. 1990. Auditory psychomotor coordination and visual search performance. Perception & Psychophysics 48 (1990), 214–226. https://api.semanticscholar.org/CorpusID:18332809
[54]
Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, and Andrea Vedaldi. 2020. There and Back Again: Revisiting Backpropagation Saliency Methods. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8836–8845. https://doi.org/10.1109/CVPR42600.2020.00886
[55]
Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. 2020. GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis., Article 1692 (2020), 13 pages.
[56]
Jialie Shen, Liqiang Nie, and Tat-Seng Chua. 2016. Smart Ambient Sound Analysis via Structured Statistical Modeling. In MultiMedia Modeling, Qi Tian, Nicu Sebe, Guo-Jun Qi, Benoit Huet, Richang Hong, and Xueliang Liu (Eds.). Springer International Publishing, Cham, 231–243.
[57]
Vincent Sitzmann, Julien N. P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. 2020. Implicit Neural Representations with Periodic Activation Functions. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20). Curran Associates Inc., Article 626, 12 pages.
Digital Library
[58]
Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. 2018. Saliency in VR: How Do People Explore Virtual Environments? IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1633–1642. https://doi.org/10.1109/TVCG.2018.2793599
Digital Library
[59]
Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. 2021. Adversarial generation of continuous images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10753–10764.
[60]
Yingya Su, Qingjie Zhao, Liujun Zhao, and Dongbing Gu. 2014. Abrupt motion tracking using a visual saliency embedded particle filter. Pattern Recognition 47, 5 (2014), 1826–1834. https://doi.org/10.1016/j.patcog.2013.11.028
Digital Library
[61]
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. 2020. Fourier features let networks learn high-frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33 (2020), 7537–7547.
[62]
Hamed Rezazadegan Tavakoli, Ali Borji, Esa Rahtu, and Juho Kannala. 2019. DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction. ArXiv abs/1905.10693 (2019).
[63]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audio-Visual Event Localization in Unconstrained Videos. In ECCV.
[64]
Anne Treisman. 1998. Feature binding, attention and object perception. Philosophical transactions of the Royal Society of London. Series B, Biological sciences 353 1373 (1998), 1295–306.
[65]
Anne Treisman and Garry A. Gelade. 1980. A feature-integration theory of attention. Cognitive Psychology 12 (1980), 97–136.
[66]
Antigoni Tsiami, Petros Koutras, and Petros Maragos. 2020. Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4766–4776.
[67]
Erik Van der Burg, Christian N. L. Olivers, Adelbert W. Bronkhorst, and Jan Theeuwes. 2008. Audiovisual events capture attention: Evidence from temporal order judgments. Journal of Vision 8, 5 (05 2008), 2–2.
[68]
Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 2798–2805. https://doi.org/10.1109/CVPR.2014.358
Digital Library
[69]
Guotao Wang, Chenglizhao Chen, Deng-Ping Fan, Aimin Hao, and Hong Qin. 2021. From Semantic Categories to Fixations: A Novel Weakly-Supervised Visual-Auditory Saliency Detection Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 15119–15128.
[70]
Wenguan Wang and Jianbing Shen. 2017. Deep Visual Attention Prediction. IEEE Transactions on Image Processing 27 (2017), 2368–2378.
Digital Library
[71]
Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting Video Saliency: A Large-Scale Benchmark and a New Model. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4894–4903. https://doi.org/10.1109/CVPR.2018.00514
[72]
Wenguan Wang, Jianbing Shen, Jianwen Xie, Ming-Ming Cheng, Haibin Ling, and Ali Borji. 2021. Revisiting Video Saliency Prediction in the Deep Learning Era. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (2021), 220–237.
Digital Library
[73]
Ralph Weidner, Joseph Krummenacher, Brit Reimann, Hermann J. Müller, and Gereon Rudolf Fink. 2009. Sources of Top–Down Control in Visual Search. Journal of Cognitive Neuroscience 21 (2009), 2100–2113.
Digital Library
[74]
Junwen Xiong, Ganglai Wang, Peng Zhang, Wei Huang, Yufei Zha, and Guangtao Zhai. 2023. CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective. arXiv:2303.06357 [cs.CV]
[75]
Sheng Yang, Guosheng Lin, Qiuping Jiang, and Weisi Lin. 2019. A dilated inception network for visual saliency prediction. IEEE Transactions on Multimedia 22, 8 (2019), 2163–2176.
[76]
Shunyu Yao, Xiongkuo Min, and Guangtao Zhai. 2021. Deep Audio-Visual Fusion Neural Network for Saliency Estimation. In 2021 IEEE International Conference on Image Processing (ICIP). 1604–1608. https://doi.org/10.1109/ICIP42928.2021.9506089
[77]
Sihyun Yu, Jihoon Tack, Sangwoo Mo, Hyunsu Kim, Junho Kim, Jung-Woo Ha, and Jinwoo Shin. 2022. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. arXiv preprint arXiv:2202.10571 abs/2202.10571 (2022).
[78]
Yuan Yuan, Hailong Ning, and Xiaoqiang Lu. 2021. Bio-Inspired Representation Learning for Visual Attention Prediction. IEEE Transactions on Cybernetics 51, 7 (2021), 3562–3575. https://doi.org/10.1109/TCYB.2019.2931735
[79]
Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. 2020. NeRF++: Analyzing and Improving Neural Radiance Fields. arXiv:2010.07492 (2020).
[80]
Dandan Zhu, Xuan Shao, Qiangqiang Zhou, Xiongkuo Min, Guangtao Zhai, and Xiaokang Yang. 2023. A Novel Lightweight Audio-visual Saliency Model for Videos. ACM Transactions on Multimedia Computing, Communications and Applications 19 (2023), 22.
Digital Library
[81]
Mengmeng Zhu, Guanqun Hou, Xinjia Chen, Jiaxing Xie, Haixian Lu, and Jun Che. 2021. Saliency-Guided Transformer Network combined with Local Embedding for No-Reference Image Quality Assessment. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 1953–1962. https://doi.org/10.1109/ICCVW54120.2021.00222
[82]
Shiping Zhu and Ziyao Xu. 2018. Spatiotemporal visual saliency guided perceptual high efficiency video coding with neural network. Neurocomput. 275, C (jan 2018), 511–522.
Index Terms
Audio-visual Saliency Prediction Model with Implicit Neural Representation
Computing methodologies
Artificial intelligence
Computer vision
Computer vision problems
Interest point and salient region detections
Recommendations
- Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
Abstract
Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their ...
Read More
- Audio-visual saliency prediction with multisensory perception and integration
Abstract
Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos,...
Highlights
- A three-stream multisensory framework for audio-visual saliency prediction.
- Image saliency model offers image saliency for feature fusion with dynamic saliency.
- A self-supervised audio-visual fusion block fuses auditory and visual ...
Read More
- A Novel Lightweight Audio-visual Saliency Model for Videos
Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention ...
Read More
Comments
Information & Contributors
Information
Published In
ACM Transactions on Multimedia Computing, Communications, and ApplicationsJust Accepted
EISSN:1551-6865
Table of Contents
Copyright © 2024 Copyright held by the owner/author(s).
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [emailprotected].
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Online AM: 11 October 2024
Accepted: 20 September 2024
Revised: 03 June 2024
Received: 29 September 2023
Check for updates
Author Tags
- Implicit neural representation
- audio-visual saliency prediction
- parameterized feature fusion method
- generative model
Qualifiers
- Research-article
Contributors
Other Metrics
View Article Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
Total Citations
Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Reflects downloads up to 06 Oct 2024
Other Metrics
View Author Metrics
Citations
View Options
View options
View or Download as a PDF file.
PDFeReader
View online with eReader.
eReaderGet Access
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in
Full Access
Get this Article
Media
Figures
Other
Tables