图像分类，物体检测，语义分割，实例分割等概念

图像分类 Image Classification

The task of object classification requires binary labels indicating whether objects are present in an image.[1] 图像分类，该任务需要我们对出现在某幅图像中的物体做标注。比如一共有1000个物体类，对一幅图中所有物体来说，某个物体要么有，要么没有。可实现：输入一幅测试图片，输出该图片中物体类别的候选集。

物体检测 Object detection

Detecting an object entails both stating that an object belonging to a specified class is present, and localizing it in the image. The location of an object is typically represented by a bounding box. 物体检测，包含两个问题，一是判断属于某个特定类的物体是否出现在图中；二是对该物体定位，定位常用表征就是物体的边界框。可实现：输入测试图片，输出检测到的物体类别和位置。

语义分割 Semantic scene labeling

The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc. In contrast to the detection task, individual instances of objects do not need to be segmented. 语义标注/分割：该任务需要将图中每一点像素标注为某个物体类别。同一物体的不同实例不需要单独分割出来。对下图，标注为人，羊，狗，草地。而不需要羊1，羊2，羊3，羊4，羊5.

实例分割 Instance segment

实例分割是物体检测+语义分割的综合体。相对物体检测的边界框，实例分割可精确到物体的边缘；相对语义分割，实例分割可以标注出图上同一物体的不同个体（羊1，羊2，羊3…）

4种任务的数据集标注示例如图示。可以看到，标注越来越复杂，但是处理效果越来越有用。

Object Segmentation

one of the reasons that this has fallen out of favor in the research community is because it is problematically vague. Object segmentation used to simply mean finding a single or small number of objects in an image and draw a boundary around them, and for most purposes you can still assume it means this. However, it also began to be used to mean segmentation of blobs that might be objects, segmentation of objects from the background (more commonly now called background subtraction or background segmentation or foreground detection), and even in some cases used interchangeably with object recognition using bounding boxes (this quickly stopped with the advent of deep neural network approaches to object recognition, but beforehand object recognition could also mean simply labeling an entire image with the object in it).

What makes “segmentation” “semantic”?

Simpy, each segment, or in the case of deep methods each pixel, is given a class label based on a category. Segmentation in general is just the division of the image by some rule. Meanshift segmentation, for example, from a very high level divide the data according to the changes in the energy of the image. Graph cut based segmentation is similarly not learned but directly derived from the properties of each image separate from the rest. More recent (neural network based) methods use pixels that are labeled to learn to identify the local features which are associated with specific classes, and then classify each pixel based on which class has the highest confidence for that pixel. In this way, “pixel-labeling” is actually more honest name for the task, and the “segmentation” component is emergent.

Instance Segmentation

Arguably the most difficult, relevant, and original meaning of Object Segmentation, “instance segmentation” means the segmentation of the individual objects within a scene, regardless of if they are the same type. However, one of the reason this is so difficult is because from a vision perspective (and in some ways a philosophical one) what makes an “object” instance is not entirely clear. Are body parts objects? Should such “part-objects” be segmented at all by an instance segmentation algorithm? Should they be only segmented if they are seen separate from the whole? What about compound objects should two things clearly adjoined but separable be one object or two (is a rock glued to the top of a stick an ax, a hammer, or just a stick and a rock unless properly made?). Also, it isn’t clear how to distinguish instances. Is a will a separate instance from the other walls it is attached to? What order should instances be counted in? As they appear? Proximity to the viewpoint? In spite of these difficulties, segmentation of objects is still a big deal because as humans we interact with objects all the time regardless of their “class label” (using random objects around you as paper weights, sitting on things that are not chairs), and so some dataset do attempt to get at this problem, but the main reason there isn’t much attention given to the problem yet is because it isn’t well enough defined.

Scene Parsing/Scene labeling

Scene Parsing is the strictly segmentation approach to scene labeling, which also has some vagueness problems of its own. Historically, scene labeling meant to divide the entire “scene” (image) up into segments and give them all a class label. However, it was also used to mean giving class labels to areas of the image without explicitly segmenting them. With respect to segmentation, “semantic segmentation” does not imply dividing the entire scene. For semantic segmentation, the algorithm is intended to segment only the objects it knows, and will be penalized by its loss function for labeling pixels that don’t have any label. For example the MS-COCO dataset is a dataset for semantic segmentation where only some objects are segmented.