今日arXiv精选 | Survey/ICCV/ACM MM/ICML/CIKM/SIGIR/RecSys/IROS

关于 #今日arXiv精选

这是「AI 学术前沿」旗下的一档栏目，编辑将每日从 arXiv 中精选高质量论文，推送给读者。

Scalable pragmatic communication via self-supervision

Comment: Workshop on Self-Supervised Learning @ ICML 2021

Link: http://arxiv.org/abs/2108.05799

Abstract

Models of context-sensitive communication often use the Rational Speech Actframework (RSA; Frank & Goodman, 2012), which formulates listeners and speakersin a cooperative reasoning process. However, the standard RSA formulation canonly be applied to small domains, and large-scale applications have relied onimitating human behavior. Here, we propose a new approach to scalablepragmatics, building upon recent theoretical results (Zaslavsky et al., 2020)that characterize pragmatic reasoning in terms of general information-theoreticprinciples. Specifically, we propose an architecture and learning process inwhich agents acquire pragmatic policies via self-supervision instead ofimitating human data. This work suggests a new principled approach forequipping artificial agents with pragmatic skills via self-supervision, whichis grounded both in pragmatic theory and in information theory.

HopfE: Knowledge Graph Representation Learning using Inverse Hopf Fibrations

Comment: CIKM 2021 : 30th ACM International Conference on Information and Knowledge Management (full paper)

Link: http://arxiv.org/abs/2108.05774

Abstract

Recently, several Knowledge Graph Embedding (KGE) approaches have beendevised to represent entities and relations in dense vector space and employedin downstream tasks such as link prediction. A few KGE techniques addressinterpretability, i.e., mapping the connectivity patterns of the relations(i.e., symmetric/asymmetric, inverse, and composition) to a geometricinterpretation such as rotations. Other approaches model the representations inhigher dimensional space such as four-dimensional space (4D) to enhance theability to infer the connectivity patterns (i.e., expressiveness). However,modeling relation and entity in a 4D space often comes at the cost ofinterpretability. This paper proposes HopfE, a novel KGE approach aiming toachieve the interpretability of inferred relations in the four-dimensionalspace. We first model the structural embeddings in 3D Euclidean space and viewthe relation operator as an SO(3) rotation. Next, we map the entity embeddingvector from a 3D space to a 4D hypersphere using the inverse Hopf Fibration, inwhich we embed the semantic information from the KG ontology. Thus, HopfEconsiders the structural and semantic properties of the entities without losingexpressivity and interpretability. Our empirical results on four well-knownbenchmarks achieve state-of-the-art performance for the KG completion task.

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

Comment: Preprint under review

Link: http://arxiv.org/abs/2108.05542

Abstract

Transformer-based pretrained language models (T-PTLMs) have achieved greatsuccess in almost every NLP task. The evolution of these models started withGPT and BERT. These models are built on the top of transformers,self-supervised learning and transfer learning. Transformed-based PTLMs learnuniversal language representations from large volumes of text data usingself-supervised learning and transfer this knowledge to downstream tasks. Thesemodels provide good background knowledge to downstream tasks which avoidstraining of downstream models from scratch. In this comprehensive survey paper,we initially give a brief overview of self-supervised learning. Next, weexplain various core concepts like pretraining, pretraining methods,pretraining tasks, embeddings and downstream adaptation methods. Next, wepresent a new taxonomy of T-PTLMs and then give brief overview of variousbenchmarks including both intrinsic and extrinsic. We present a summary ofvarious useful libraries to work with T-PTLMs. Finally, we highlight some ofthe future research directions which will further improve these models. Westrongly believe that this comprehensive survey paper will serve as a goodreference to learn the core concepts as well as to stay updated with the recenthappenings in T-PTLMs.

MicroNet: Improving Image Recognition with Extremely Low FLOPs

Comment: ICCV 2021

Code: https://github.com/liyunsheng13/micronet

Link: http://arxiv.org/abs/2108.05894

Abstract

This paper aims at addressing the problem of substantial performancedegradation at extremely low computational cost (e.g. 5M FLOPs on ImageNetclassification). We found that two factors, sparse connectivity and dynamicactivation function, are effective to improve the accuracy. The former avoidsthe significant reduction of network width, while the latter mitigates thedetriment of reduction in network depth. Technically, we proposemicro-factorized convolution, which factorizes a convolution matrix into lowrank matrices, to integrate sparse connectivity into convolution. We alsopresent a new dynamic activation function, named Dynamic Shift Max, to improvethe non-linearity via maxing out multiple dynamic fusions between an inputfeature map and its circular channel shift. Building upon these two newoperators, we arrive at a family of networks, named MicroNet, that achievessignificant performance gains over the state of the art in the low FLOP regime.For instance, under the constraint of 12M FLOPs, MicroNet achieves 59.4\% top-1accuracy on ImageNet classification, outperforming MobileNetV3 by 9.6\%. Sourcecode is at\href{https://github.com/liyunsheng13/micronet}{https://github.com/liyunsheng13/micronet}.

PixelSynth: Generating a 3D-Consistent Experience from a Single Image

Comment: In ICCV 2021

Link: http://arxiv.org/abs/2108.05892

Abstract

Recent advancements in differentiable rendering and 3D reasoning have drivenexciting results in novel view synthesis from a single image. Despite realisticresults, methods are limited to relatively small view change. In order tosynthesize immersive scenes, models must also be able to extrapolate. Wepresent an approach that fuses 3D reasoning with autoregressive modeling tooutpaint large view changes in a 3D-consistent manner, enabling scenesynthesis. We demonstrate considerable improvement in single image large-angleview synthesis results compared to a variety of methods and possible variantsacross simulated and real datasets. In addition, we show increased 3Dconsistency compared to alternative accumulation methods. Project website:https://crockwell.github.io/pixelsynth/

Towards Interpretable Deep Metric Learning with Structural Matching

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.05889

Abstract

How do the neural networks distinguish two images? It is of criticalimportance to understand the matching mechanism of deep models for developingreliable intelligent systems for many risky visual applications such assurveillance and access control. However, most existing deep metric learningmethods match the images by comparing feature vectors, which ignores thespatial structure of images and thus lacks interpretability. In this paper, wepresent a deep interpretable metric learning (DIML) method for more transparentembedding learning. Unlike conventional metric learning methods based onfeature vector comparison, we propose a structural matching strategy thatexplicitly aligns the spatial embeddings by computing an optimal matching flowbetween feature maps of the two images. Our method enables deep models to learnmetrics in a more human-friendly way, where the similarity of two images can bedecomposed to several part-wise similarities and their contributions to theoverall similarity. Our method is model-agnostic, which can be applied tooff-the-shelf backbone networks and metric learning methods. We evaluate ourmethod on three major benchmarks of deep metric learning including CUB200-2011,Cars196, and Stanford Online Products, and achieve substantial improvementsover popular metric learning methods with better interpretability. Code isavailable at https://github.com/wl-zhao/DIML

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Comment: ACM MM 2021

Link: http://arxiv.org/abs/2108.05888

Abstract

Multiview detection incorporates multiple camera views to deal withocclusions, and its central problem is multiview aggregation. Given feature mapprojections from multiple views onto a common ground plane, thestate-of-the-art method addresses this problem via convolution, which appliesthe same calculation regardless of object locations. However, suchtranslation-invariant behaviors might not be the best choice, as objectfeatures undergo various projection distortions according to their positionsand cameras. In this paper, we propose a novel multiview detector, MVDeTr, thatadopts a newly introduced shadow transformer to aggregate multiviewinformation. Unlike convolutions, shadow transformer attends differently atdifferent positions and cameras to deal with various shadow-like distortions.We propose an effective training scheme that includes a new view-coherent dataaugmentation method, which applies random augmentations while maintainingmultiview consistency. On two multiview detection benchmarks, we report newstate-of-the-art accuracy with the proposed system. Code is available athttps://github.com/hou-yz/MVDeTr.

Unconditional Scene Graph Generation

Comment: accepted for publication at ICCV 2021

Link: http://arxiv.org/abs/2108.05884

Abstract

Despite recent advancements in single-domain or single-object imagegeneration, it is still challenging to generate complex scenes containingdiverse, multiple objects and their interactions. Scene graphs, composed ofnodes as objects and directed-edges as relationships among objects, offer analternative representation of a scene that is more semantically grounded thanimages. We hypothesize that a generative model for scene graphs might be ableto learn the underlying semantic structure of real-world scenes moreeffectively than images, and hence, generate realistic novel scenes in the formof scene graphs. In this work, we explore a new task for the unconditionalgeneration of semantic scene graphs. We develop a deep auto-regressive modelcalled SceneGraphGen which can directly learn the probability distribution overlabelled and directed graphs using a hierarchical recurrent architecture. Themodel takes a seed object as input and generates a scene graph in a sequence ofsteps, each step generating an object node, followed by a sequence ofrelationship edges connecting to the previous nodes. We show that the scenegraphs generated by SceneGraphGen are diverse and follow the semantic patternsof real-world scenes. Additionally, we demonstrate the application of thegenerated graphs in image synthesis, anomaly detection and scene graphcompletion.

Improving Ranking Correlation of Supernet with Candidates Enhancement and Progressive Training

Comment: 5 pages, 2 figures. CVPR2021 NAS challenge

Link: http://arxiv.org/abs/2108.05866

Abstract

One-shot neural architecture search (NAS) applies weight-sharing supernet toreduce the unaffordable computation overhead of automated architecturedesigning. However, the weight-sharing technique worsens the rankingconsistency of performance due to the interferences between different candidatenetworks. To address this issue, we propose a candidates enhancement method andprogressive training pipeline to improve the ranking correlation of supernet.Specifically, we carefully redesign the sub-networks in the supernet and mapthe original supernet to a new one of high capacity. In addition, we graduallyadd narrow branches of supernet to reduce the degree of weight sharing whicheffectively alleviates the mutual interference between sub-networks. Finally,our method ranks the 1st place in the Supernet Track of CVPR2021 1stLightweight NAS Challenge.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision

Comment: Published in ICCV 2021;

Project: https://www.cs.cornell.edu/projects/babel

Link: http://arxiv.org/abs/2108.05863

Abstract

The abundance and richness of Internet photos of landmarks and cities has ledto significant progress in 3D vision over the past two decades, includingautomated 3D reconstructions of the world's landmarks from tourist photos.However, a major source of information available for these 3D-augmentedcollections---namely language, e.g., from image captions---has been virtuallyuntapped. In this work, we present WikiScenes, a new, large-scale dataset oflandmark photo collections that contains descriptive text in the form ofcaptions and hierarchical category names. WikiScenes forms a new testbed formultimodal reasoning involving images, text, and 3D geometry. We demonstratethe utility of WikiScenes for learning semantic concepts over images and 3Dmodels. Our weakly-supervised framework connects images, 3D structure, andsemantics---utilizing the strong constraints provided by 3D geometry---toassociate semantic concepts to image pixels and 3D points.

m-RevNet: Deep Reversible Neural Networks with Momentum

Comment: Accepted to ICCV 2021

Link: http://arxiv.org/abs/2108.05862

Abstract

In recent years, the connections between deep residual networks andfirst-order Ordinary Differential Equations (ODEs) have been disclosed. In thiswork, we further bridge the deep neural architecture design with thesecond-order ODEs and propose a novel reversible neural network, termed asm-RevNet, that is characterized by inserting momentum update to residualblocks. The reversible property allows us to perform backward pass withoutaccess to activation values of the forward pass, greatly relieving the storageburden during training. Furthermore, the theoretical foundation based onsecond-order ODEs grants m-RevNet with stronger representational power thanvanilla residual networks, which potentially explains its performance gains.For certain learning scenarios, we analytically and empirically reveal that ourm-RevNet succeeds while standard ResNet fails. Comprehensive experiments onvarious image classification and semantic segmentation benchmarks demonstratethe superiority of our m-RevNet over ResNet, concerning both memory efficiencyand recognition performance.

Continual Neural Mapping: Learning An Implicit Scene Representation from Sequential Observations

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.05851

Abstract

Recent advances have enabled a single neural network to serve as an implicitscene representation, establishing the mapping function between spatialcoordinates and scene properties. In this paper, we make a further step towardscontinual learning of the implicit scene representation directly fromsequential observations, namely Continual Neural Mapping. The proposed problemsetting bridges the gap between batch-trained implicit neural representationsand commonly used streaming data in robotics and vision communities. Weintroduce an experience replay approach to tackle an exemplary task ofcontinual neural mapping: approximating a continuous signed distance function(SDF) from sequential depth images as a scene geometry representation. We showfor the first time that a single network can represent scene geometry over timecontinually without catastrophic forgetting, while achieving promisingtrade-offs between accuracy and efficiency.

DiagViB-6: A Diagnostic Benchmark Suite for Vision Models in the Presence of Shortcut and Generalization Opportunities

Comment: Accepted for publication at IEEE International Conference on Computer Vision (ICCV) 2021

Link: http://arxiv.org/abs/2108.05779

Abstract

Common deep neural networks (DNNs) for image classification have been shownto rely on shortcut opportunities (SO) in the form of predictive andeasy-to-represent visual factors. This is known as shortcut learning and leadsto impaired generalization. In this work, we show that common DNNs also sufferfrom shortcut learning when predicting only basic visual object factors ofvariation (FoV) such as shape, color, or texture. We argue that besidesshortcut opportunities, generalization opportunities (GO) are also an inherentpart of real-world vision data and arise from partial independence betweenpredicted classes and FoVs. We also argue that it is necessary for DNNs toexploit GO to overcome shortcut learning. Our core contribution is to introducethe Diagnostic Vision Benchmark suite DiagViB-6, which includes datasets andmetrics to study a network's shortcut vulnerability and generalizationcapability for six independent FoV. In particular, DiagViB-6 allows controllingthe type and degree of SO and GO in a dataset. We benchmark a wide range ofpopular vision architectures and show that they can exploit GO only to alimited extent.

Correlate-and-Excite: Real-Time Stereo Matching via Guided Cost Volume Excitation

Comment: To appear at IROS 2021.

Code: https://github.com/antabangun/coex

Link: http://arxiv.org/abs/2108.05773

Abstract

Volumetric deep learning approach towards stereo matching aggregates a costvolume computed from input left and right images using 3D convolutions. Recentworks showed that utilization of extracted image features and a spatiallyvarying cost volume aggregation complements 3D convolutions. However, existingmethods with spatially varying operations are complex, cost considerablecomputation time, and cause memory consumption to increase. In this work, weconstruct Guided Cost volume Excitation (GCE) and show that simple channelexcitation of cost volume guided by image can improve performance considerably.Moreover, we propose a novel method of using top-k selection prior tosoft-argmin disparity regression for computing the final disparity estimate.Combining our novel contributions, we present an end-to-end network that wecall Correlate-and-Excite (CoEx). Extensive experiments of our model on theSceneFlow, KITTI 2012, and KITTI 2015 datasets demonstrate the effectivenessand efficiency of our model and show that our model outperforms otherspeed-based algorithms while also being competitive to other state-of-the-artalgorithms. Codes will be made available at https://github.com/antabangun/coex.

MT-ORL: Multi-Task Occlusion Relationship Learning

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.05722

Abstract

Retrieving occlusion relation among objects in a single image is challengingdue to sparsity of boundaries in image. We observe two key issues in existingworks: firstly, lack of an architecture which can exploit the limited amount ofcoupling in the decoder stage between the two subtasks, namely occlusionboundary extraction and occlusion orientation prediction, and secondly,improper representation of occlusion orientation. In this paper, we propose anovel architecture called Occlusion-shared and Path-separated Network (OPNet),which solves the first issue by exploiting rich occlusion cues in sharedhigh-level features and structured spatial information in task-specificlow-level features. We then design a simple but effective orthogonal occlusionrepresentation (OOR) to tackle the second issue. Our method surpasses thestate-of-the-art methods by 6.1%/8.3% Boundary-AP and 6.5%/10% Orientation-APon standard PIOD/BSDS ownership datasets. Code is available athttps://github.com/fengpanhe/MT-ORL.

Semantic Concentration for Domain Adaptation

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.05720

Abstract

Domain adaptation (DA) paves the way for label annotation and dataset biasissues by the knowledge transfer from a label-rich source domain to a relatedbut unlabeled target domain. A mainstream of DA methods is to align the featuredistributions of the two domains. However, the majority of them focus on theentire image features where irrelevant semantic information, e.g., the messybackground, is inevitably embedded. Enforcing feature alignments in such casewill negatively influence the correct matching of objects and consequently leadto the semantically negative transfer due to the confusion of irrelevantsemantics. To tackle this issue, we propose Semantic Concentration for DomainAdaptation (SCDA), which encourages the model to concentrate on the mostprincipal features via the pair-wise adversarial alignment of predictiondistributions. Specifically, we train the classifier to class-wisely maximizethe prediction distribution divergence of each sample pair, which enables themodel to find the region with large differences among the same class ofsamples. Meanwhile, the feature extractor attempts to minimize thatdiscrepancy, which suppresses the features of dissimilar regions among the sameclass of samples and accentuates the features of principal parts. As a generalmethod, SCDA can be easily integrated into various DA methods as a regularizerto further boost their performance. Extensive experiments on the cross-domainbenchmarks show the efficacy of SCDA.

Oriented R-CNN for Object Detection

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.05699

Abstract

Current state-of-the-art two-stage detectors generate oriented proposalsthrough time-consuming schemes. This diminishes the detectors' speed, therebybecoming the computational bottleneck in advanced oriented object detectionsystems. This work proposes an effective and simple oriented object detectionframework, termed Oriented R-CNN, which is a general two-stage orienteddetector with promising accuracy and efficiency. To be specific, in the firststage, we propose an oriented Region Proposal Network (oriented RPN) thatdirectly generates high-quality oriented proposals in a nearly cost-freemanner. The second stage is oriented R-CNN head for refining oriented Regionsof Interest (oriented RoIs) and recognizing them. Without tricks, orientedR-CNN with ResNet50 achieves state-of-the-art detection accuracy on twocommonly-used datasets for oriented object detection including DOTA (75.87%mAP) and HRSC2016 (96.50% mAP), while having a speed of 15.1 FPS with the imagesize of 1024$\times$1024 on a single RTX 2080Ti. We hope our work could inspirerethinking the design of oriented detectors and serve as a baseline fororiented object detection. Code is available athttps://github.com/jbwang1997/OBBDetection.

Memory-based Semantic Segmentation for Off-road Unstructured Natural Environments

Comment: 8 pages, 10 figures, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021. (Accept)

Link: http://arxiv.org/abs/2108.05635

Abstract

With the availability of many datasets tailored for autonomous driving inreal-world urban scenes, semantic segmentation for urban driving scenesachieves significant progress. However, semantic segmentation for off-road,unstructured environments is not widely studied. Directly applying existingsegmentation networks often results in performance degradation as they cannotovercome intrinsic problems in such environments, such as illumination changes.In this paper, a built-in memory module for semantic segmentation is proposedto overcome these problems. The memory module stores significantrepresentations of training images as memory items. In addition to the encoderembedding like items together, the proposed memory module is specificallydesigned to cluster together instances of the same class even when there aresignificant variances in embedded features. Therefore, it makes segmentationnetworks better deal with unexpected illumination changes. A triplet loss isused in training to minimize redundancy in storing discriminativerepresentations of the memory module. The proposed memory module is general sothat it can be adopted in a variety of networks. We conduct experiments on theRobot Unstructured Ground Driving (RUGD) dataset and RELLIS dataset, which arecollected from off-road, unstructured natural environments. Experimentalresults show that the proposed memory module improves the performance ofexisting segmentation networks and contributes to capturing unclear objectsover various off-road, unstructured natural scenes with equivalentcomputational cost and network parameters. As the proposed method can beintegrated into compact networks, it presents a viable approach forresource-limited small autonomous platforms.

Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for Open-Set Semi-Supervised Learning

Comment: Accepted by ICCV2021

Link: http://arxiv.org/abs/2108.05617

Abstract

Open-set semi-supervised learning (open-set SSL) investigates a challengingbut practical scenario where out-of-distribution (OOD) samples are contained inthe unlabeled data. While the mainstream technique seeks to completely filterout the OOD samples for semi-supervised learning (SSL), we propose a noveltraining mechanism that could effectively exploit the presence of OOD data forenhanced feature learning while avoiding its adverse impact on the SSL. Weachieve this goal by first introducing a warm-up training that leverages allthe unlabeled data, including both the in-distribution (ID) and OOD samples.Specifically, we perform a pretext task that enforces our feature extractor toobtain a high-level semantic understanding of the training images, leading tomore discriminative features that can benefit the downstream tasks. Since theOOD samples are inevitably detrimental to SSL, we propose a novel cross-modalmatching strategy to detect OOD samples. Instead of directly applying binaryclassification, we train the network to predict whether the data sample ismatched to an assigned one-hot class label. The appeal of the proposedcross-modal matching over binary classification is the ability to generate acompatible feature space that aligns with the core classification task.Extensive experiments show that our approach substantially lifts theperformance on open-set SSL and outperforms the state-of-the-art by a largemargin.

perf4sight: A toolflow to model CNN training performance on Edge GPUs

Comment: Accepted into the Workshop on Embedded and Real-World Computer Vision in Autonomous Driving (ERCVAD), ICCV 2021

Link: http://arxiv.org/abs/2108.05580

Abstract

The increased memory and processing capabilities of today's edge devicescreate opportunities for greater edge intelligence. In the domain of vision,the ability to adapt a Convolutional Neural Network's (CNN) structure andparameters to the input data distribution leads to systems with lower memoryfootprint, latency and power consumption. However, due to the limited computeresources and memory budget on edge devices, it is necessary for the system tobe able to predict the latency and memory footprint of the training process inorder to identify favourable training configurations of the network topologyand device combination for efficient network adaptation. This work proposesperf4sight, an automated methodology for developing accurate models thatpredict CNN training memory footprint and latency given a target device andnetwork. This enables rapid identification of network topologies that can beretrained on the edge device with low resource consumption. With PyTorch as theframework and NVIDIA Jetson TX2 as the target device, the developed modelspredict training memory footprint and latency with 95% and 91% accuracyrespectively for a wide range of networks, opening the path towards efficientnetwork adaptation on edge GPUs.

iButter: Neural Interactive Bullet Time Generator for Human Free-viewpoint Rendering

Comment: Accepted by ACM MM 2021

Link: http://arxiv.org/abs/2108.05577

Abstract

Generating ``bullet-time'' effects of human free-viewpoint videos is criticalfor immersive visual effects and VR/AR experience. Recent neural advances stilllack the controllable and interactive bullet-time design ability for humanfree-viewpoint rendering, especially under the real-time, dynamic and generalsetting for our trajectory-aware task. To fill this gap, in this paper wepropose a neural interactive bullet-time generator (iButter) forphoto-realistic human free-viewpoint rendering from dense RGB streams, whichenables flexible and interactive design for human bullet-time visual effects.Our iButter approach consists of a real-time preview and design stage as wellas a trajectory-aware refinement stage. During preview, we propose aninteractive bullet-time design approach by extending the NeRF rendering to areal-time and dynamic setting and getting rid of the tedious per-scenetraining. To this end, our bullet-time design stage utilizes a hybrid trainingset, light-weight network design and an efficient silhouette-based samplingstrategy. During refinement, we introduce an efficient trajectory-aware schemewithin 20 minutes, which jointly encodes the spatial, temporal consistency andsemantic cues along the designed trajectory, achieving photo-realisticbullet-time viewing experience of human activities. Extensive experimentsdemonstrate the effectiveness of our approach for convenient interactivebullet-time design and photo-realistic human free-viewpoint video generation.

LabOR: Labeling Only if Required for Domain Adaptive Semantic Segmentation

Comment: Accepted to ICCV 2021 (Oral)

Link: http://arxiv.org/abs/2108.05570

Abstract

Unsupervised Domain Adaptation (UDA) for semantic segmentation has beenactively studied to mitigate the domain gap between label-rich source data andunlabeled target data. Despite these efforts, UDA still has a long way to go toreach the fully supervised performance. To this end, we propose a Labeling Onlyif Required strategy, LabOR, where we introduce a human-in-the-loop approach toadaptively give scarce labels to points that a UDA model is uncertain about. Inorder to find the uncertain points, we generate an inconsistency mask using theproposed adaptive pixel selector and we label these segment-based regions toachieve near supervised performance with only a small fraction (about 2.2%)ground truth points, which we call "Segment based Pixel-Labeling (SPL)". Tofurther reduce the efforts of the human annotator, we also propose "Point-basedPixel-Labeling (PPL)", which finds the most representative points for labelingwithin the generated inconsistency mask. This reduces efforts from 2.2% segmentlabel to 40 points label while minimizing performance degradation. Throughextensive experimentation, we show the advantages of this new framework fordomain adaptive semantic segmentation while minimizing human labor costs.

Vision-Language Transformer and Query Generation for Referring Segmentation

Comment: ICCV 2021

Link: http://arxiv.org/abs/2108.05565

Abstract

In this work, we address the challenging task of referring segmentation. Thequery expression in referring segmentation typically indicates the targetobject by describing its relationship with others. Therefore, to find thetarget one among all instances in the image, the model must have a holisticunderstanding of the whole image. To achieve this, we reformulate referringsegmentation as a direct attention problem: finding the region in the imagewhere the query language expression is most attended to. We introducetransformer and multi-head attention to build a network with an encoder-decoderattention mechanism architecture that "queries" the given image with thelanguage expression. Furthermore, we propose a Query Generation Module, whichproduces multiple sets of queries with different attention weights thatrepresent the diversified comprehensions of the language expression fromdifferent aspects. At the same time, to find the best way from thesediversified comprehensions based on visual clues, we further propose a QueryBalance Module to adaptively select the output features of these queries for abetter mask generation. Without bells and whistles, our approach islight-weight and achieves new state-of-the-art performance consistently onthree referring segmentation datasets, RefCOCO, RefCOCO+, and G-Ref. Our codeis available at https://github.com/henghuiding/Vision-Language-Transformer.

HandFoldingNet: A 3D Hand Pose Estimation Network Using Multiscale-Feature Guided Folding of a 2D Hand Skeleton

Comment: Accepted as a conference paper at International Conference on Computer Vision (ICCV) 2021

Link: http://arxiv.org/abs/2108.05545

Abstract

With increasing applications of 3D hand pose estimation in varioushuman-computer interaction applications, convolution neural networks (CNNs)based estimation models have been actively explored. However, the existingmodels require complex architectures or redundant computational resources totrade with the acceptable accuracy. To tackle this limitation, this paperproposes HandFoldingNet, an accurate and efficient hand pose estimator thatregresses the hand joint locations from the normalized 3D hand point cloudinput. The proposed model utilizes a folding-based decoder that folds a given2D hand skeleton into the corresponding joint coordinates. For higherestimation accuracy, folding is guided by multi-scale features, which includeboth global and joint-wise local features. Experimental results show that theproposed model outperforms the existing methods on three hand pose benchmarkdatasets with the lowest model parameter requirement. Code is available athttps://github.com/cwc1260/HandFold.

Distilling Holistic Knowledge with Graph Neural Networks

Comment: Accepted by ICCV 2021

Link: http://arxiv.org/abs/2108.05507

Abstract

Knowledge Distillation (KD) aims at transferring knowledge from a largerwell-optimized teacher network to a smaller learnable student network.ExistingKD methods have mainly considered two types of knowledge, namely the individualknowledge and the relational knowledge. However, these two types of knowledgeare usually modeled independently while the inherent correlations between themare largely ignored. It is critical for sufficient student network learning tointegrate both individual knowledge and relational knowledge while reservingtheir inherent correlation. In this paper, we propose to distill the novelholistic knowledge based on an attributed graph constructed among instances.The holistic knowledge is represented as a unified graph-based embedding byaggregating individual knowledge from relational neighborhood samples withgraph neural networks, the student network is learned by distilling theholistic knowledge in a contrastive manner. Extensive experiments and ablationstudies are conducted on benchmark datasets, the results demonstrate theeffectiveness of the proposed method. The code has been published inhttps://github.com/wyc-ruiker/HKD

Page-level Optimization of e-Commerce Item Recommendations

Comment: Accepted by RecSys 2021

Link: http://arxiv.org/abs/2108.05891

Abstract

The item details page (IDP) is a web page on an e-commerce website thatprovides information on a specific product or item listing. Just below thedetails of the item on this page, the buyer can usually find recommendationsfor other relevant items. These are typically in the form of a series ofmodules or carousels, with each module containing a set of recommended items.The selection and ordering of these item recommendation modules are intended toincrease discover-ability of relevant items and encourage greater userengagement, while simultaneously showcasing diversity of inventory andsatisfying other business objectives. Item recommendation modules on the IDPare often curated and statically configured for all customers, ignoringopportunities for personalization. In this paper, we present a scalableend-to-end production system to optimize the personalized selection andordering of item recommendation modules on the IDP in real-time by utilizingdeep neural networks. Through extensive offline experimentation and online A/Btesting, we show that our proposed system achieves significantly higherclick-through and conversion rates compared to other existing methods. In ouronline A/B test, our framework improved click-through rate by 2.48% andpurchase-through rate by 7.34% over a static configuration.

How Nonconformity Functions and Difficulty of Datasets Impact the Efficiency of Conformal Classifiers

Comment: Workshop on Distribution-Free Uncertainty Quantification at ICML 2021

Link: http://arxiv.org/abs/2108.05677

Abstract

The property of conformal predictors to guarantee the required accuracy ratemakes this framework attractive in various practical applications. However,this property is achieved at a price of reduction in precision. In the case ofconformal classification, the systems can output multiple class labels insteadof one. It is also known from the literature, that the choice of nonconformityfunction has a major impact on the efficiency of conformal classifiers.Recently, it was shown that different model-agnostic nonconformity functionsresult in conformal classifiers with different characteristics. For a NeuralNetwork-based conformal classifier, the inverse probability (or hinge loss)allows minimizing the average number of predicted labels, and margin results ina larger fraction of singleton predictions. In this work, we aim to furtherextend this study. We perform an experimental evaluation using 8 differentclassification algorithms and discuss when the previously observed relationshipholds or not. Additionally, we propose a successful method to combine theproperties of these two nonconformity functions. The experimental evaluation isdone using 11 real and 5 synthetic datasets.

Conditional Sequential Slate Optimization

Comment: 8 pages, 4 figures, SIGIR eCom'21

Link: http://arxiv.org/abs/2108.05618

Abstract

The top search results matching a user query that are displayed on the firstpage are critical to the effectiveness and perception of a search system. Asearch ranking system typically orders the results by independentquery-document scores to produce a slate of search results. However, suchunilateral scoring methods may fail to capture inter-document dependencies thatusers are sensitive to, thus producing a sub-optimal slate. Further, inpractice, many real-world applications such as e-commerce search requireenforcing certain distributional criteria at the slate-level, due to businessobjectives or long term user retention goals. Unilateral scoring of resultsdoes not explicitly support optimizing for such objectives with respect to aslate. Hence, solutions to the slate optimization problem must consider theoptimal selection and order of the documents, along with adherence toslate-level distributional criteria. To that end, we propose a hybrid frameworkextended from traditional slate optimization to solve the conditional slateoptimization problem. We introduce conditional sequential slate optimization(CSSO), which jointly learns to optimize for traditional ranking metrics aswell as prescribed distribution criteria of documents within the slate. Theproposed method can be applied to practical real world problems such asenforcing diversity in e-commerce search results, mitigating bias in topresults and personalization of results. Experiments on public datasets andreal-world data from e-commerce datasets show that CSSO outperforms popularcomparable ranking methods in terms of adherence to distributional criteriawhile producing comparable or better relevance metrics.

Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather

Comment: Accepted at IEEE International Conference on Computer Vision (ICCV) 2021

Link: http://arxiv.org/abs/2108.05249

Abstract

This work addresses the challenging task of LiDAR-based 3D object detectionin foggy weather. Collecting and annotating data in such a scenario is verytime, labor and cost intensive. In this paper, we tackle this problem bysimulating physically accurate fog into clear-weather scenes, so that theabundant existing real datasets captured in clear weather can be repurposed forour task. Our contributions are twofold: 1) We develop a physically valid fogsimulation method that is applicable to any LiDAR dataset. This unleashes theacquisition of large-scale foggy training data at no extra cost. Thesepartially synthetic data can be used to improve the robustness of severalperception methods, such as 3D object detection and tracking or simultaneouslocalization and mapping, on real foggy data. 2) Through extensive experimentswith several state-of-the-art detection approaches, we show that our fogsimulation can be leveraged to significantly improve the performance for 3Dobject detection in the presence of fog. Thus, we are the first to providestrong 3D object detection baselines on the Seeing Through Fog dataset. Ourcode is available at www.trace.ethz.ch/lidar_fog_simulation.

今日arXiv精选 | Survey/ICCV/ACM MM/ICML/CIKM/SIGIR/RecSys/IROS相关推荐

今日arXiv精选 | TNNLS/ICCV/TIP/ACM MM/CIKM/WWW/ICME
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Medical-VLBERT: Medical Visual Languag ...
今日arXiv精选 | 29篇顶会论文：ACM MM/ ICCV/ CIKM/ AAAI/ IJCAI
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Group-based Distinctive Image Captioni ...
今日arXiv精选 | 34篇顶会论文：CIKM/ ACL/ Interspeech/ ICCV/ ACM MM
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. DESYR: Definition and Syntactic Repres ...
今日arXiv精选 | ICCV 2021/CIKM 2021/ACM MM 2021
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. SUNet: Symmetric Undistortion Network ...
今日arXiv精选 | 35篇顶会论文：ICCV/ CIKM/ ACM MM
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. TSI: an Ad Text Strength Indicator usi ...
今日arXiv精选 | 23篇顶会论文：ICASSP / ICCV / CIKM / ICME / AAAI
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. VerbCL: A Dataset of Verbatim Quotes f ...
今日arXiv精选 | 11篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Explain Me the Painting: Multi-Topic K ...
今日arXiv精选 | 15篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. Image In painting Applied to Art Compl ...
今日arXiv精选 | 9篇ICCV 2021最新论文
关于 #今日arXiv精选这是「AI 学术前沿」旗下的一档栏目,编辑将每日从arXiv中精选高质量论文,推送给读者. The Power of Points for Modeling Human ...

今日arXiv精选 | Survey/ICCV/ACM MM/ICML/CIKM/SIGIR/RecSys/IROS

今日arXiv精选 | Survey/ICCV/ACM MM/ICML/CIKM/SIGIR/RecSys/IROS相关推荐

最新文章

热门文章