In this page, we introduce the Market-1501 dataset as well as the implementation of the BoW baseline.

If you use this dataset in your research, please kindly cite our work as,
@inproceedings{zheng2015scalable, title={Scalable Person Re-identification: A Benchmark}, author={Zheng, Liang and Shen, Liyue and Tian, Lu and Wang, Shengjin and Wang, Jingdong and Tian, Qi}, booktitle={Computer Vision, IEEE International Conference on}, year={2015} }

Market-1501 Dataset

The Market-1501 dataset is collected in front of a supermarket in Tsinghua University. A total of six cameras were used, including 5 high-resolution cameras, and one low-resolution camera. Field-of-view overlap exists among different cameras. Overall, this dataset contains 32,668 annotated bounding boxes of 1,501 identities. In this open system, images of each identity are captured by at most six cameras. We make sure that each annotated identity is present in at least two cameras, so that cross-camera search can be performed. The Market-1501 dataset has three featured properties:

First, our dataset uses the Deformable Part Model (DPM) as pedestrian detector.
Second, in addition to the true positive bounding boxes, we also provde false alarm detection results.
Third, each identify may have multiple images under each camera. During cross-camera search, there are multiple queries and multiple ground truths for each identity.

The Market-1501 dataset is annotated using the following rules. For each detected bounding box to be annotated, we manually draw a ground truth bounding box that contains the pedestrian. Then, for the detected and hand-drawn bounding boxes, we calculate the ratio of the overlapping area to the union area. If the ratio is larger than 50%, the DPM bounding box is marked as "good"; if the ratio is smaller than 20%, the bounding boxe is marked as "distractor"; otherwise, it is marked as "junk", meaning that this image is of zero influence to the re-identification accuracy.

Dataset Market-1501 RAiD [1] CUHK03 [2] VIPeR [3] i-LIDS [4] CUHK01 [5] CUHK02 [6] CAVIAR [7]

# identities 1,501 43 1,360 632 119 971 1,816 72

# BBoxes 32,668 6,920 13,164 1,264 476 1,942 7,264 610

# distractors 2,793 0 0 0 0 0 0 0

# cam. per ID 6 4 2 2 2 2 2 2

DPM/Hand DPM hand DPM hand hand hand hand hand

Evaluation mAP CMC CMC CMC CMC CMC CMC CMC

New!
We are organizing 2nd Workshop and Challenge on Target Re-identification and Multi-Target Multi-Camera Tracking in CVPR 2019. Market-1501 will be used as the source training set.

We have annotated 27 ID-level attributes for Market-1501.

We have summarized current state of the art methods on Market-1501.

The dataset package is can be downloaded from any of the following links:
[Link 1] (Google Drive)
[Link 2] (Baidu Disk)
[Link 3] (A Server, Many thanks to Julian Tanke)

The package contains four folders.
1) "bounding_box_test". There are 19,732 images in this folder used for testing.
2) "bounding_box_train". There are 12,936 images in this folder used for training.
3) "query". There are 750 identities. We randomly select one query image for each camera. So the maximum number of query images is 6 for an identity. In total, there are 3,368 query images in this folder.
4) "gt_query". This folder contains the ground truth annotations. For each query, the relevant images are marked as "good" or "junk". "junk" has zero impact on search accuracy. "junk" images also include those in the same camera with the query.
5) "gt_bbox". We also provide the hand-drawn bounding boxes. They are used to judge whether a DPM bounding box is good.

Naming Rule of the bboxes
In bbox "0001_c1s1_001051_00.jpg", "c1" is the first camera (there are totally 6 cameras).

"s1" is sequence 1 of camera 1. Here, a sequence was defined automatically by the camera. We suppose that the camera cannot store a whole video that is quite large, so it splits the video into equally large sequences. Two sequences, namely, "c1s1" and "c2s1" do not happen exactly at the same time. This is mainly because the starting time of the 6 cameras are not exactly the same (it takes time to turn on them). But, "c1s1" and "c2s1" are roughly at the same time period.

"001051" is the 1051th frame in the sequence "c1s1". The frame rate is 25 frames per sec.

As with the last two digits, remember we use the DPM detector. Then, for identity "0001", there may be multiple detected bounding boxes in the frame "c1s1_001051". In other words, a pedestrian in the image may have several bboxes by DPM. So, "00" means that this bounding box is the first one among the several.

Market-1501+500k Dataset

We have released the 500k bboxes as distractors. Please download it from BaiduDisk or GoogleDrive. The results used in Fig. 10 is also provided here.

We have summarized current state of the art methods on Market-1501+500k.

Please feel free contact us (liangzheng06@gmail.com) if you have any comment on our dataset.

Baseline Codes

For the baseline, the Bag-of-Words (BoW) model is used. Then, a number of improvements are added. In the following, we provide codes for 1) feature extraction given a pedestrian image 2) performance evaluation.

1) Feature Extraction
A classic BoW model is constructed. We use dense sampling and extract a 11-dim Color Names [8] vector for each patch. The descriptor is l1-normalized followed by square root operator [10]. A codebook is trained on the irrelevant TUD-Brussels dataset [9]. Then, a given feature vector is quantized to its nearest neighbor under Euclidean distance. We employ Multiple Assignment (MA) [11] and set MA = 10. Moreover, we also integrate Burstiness weighting [12] and Negative Evidence [13] into the BoW model.
MATLAB code for feature extraction is provided on GoogleDrive here, or Baidu Disk here.

To repeat our result on the VIPeR dataset, we provide the code of feature extraction with different parameters from the code above. This code is also used in our CVPR15 paper. Code on Baidu Disk is here, and on GoogleDrive here.

2) Performance Evaluation
We evaluate our algorithm on Market-1501 dataset. The evaluation package in Google Drive is here, and in Baidu Disk is here.
In this implementation, we have the following components:

a. baseline: bow descriptor + linear scan. You will obtain mAP = 14.75%, r1 accuracy = 35.84%;

b. baseline + multiple query by max/avg pooling;

c. baseline + multiple query by max pooling + re-ranking;

d. baseline + pairwise evaluation: re-id between camera pairs is evaluated, and a confusion matrix is drawn.

e. metric learning: Using the code provide by [14], we provide the training and testing protocol on our dataset.

Note that, we updated the evaluation package (Thanks Ying-Cong!). In the current version, we do not use the "gt_query" folder provided in the dataset. Instead, the ground truths are generated automatically during evaluation.

References

[1] A. Das et al. Consistent re-identification in a camera network. In ECCV, 2014.
[2] W. Li et al. Deepreid: deep filter pairing neural network for person re- identification. In CVPR, 2014.
[3] D. Gray et al. Evaluating appearance models for recognition, reacquisition, and tracking. In IEEE International workshop on performance evaluation of tracking and surveillance, 2007.
[4] W.-S. Zheng et al. Associating groups of people. In BMVC, 2009.
[5] W. Li et al. Human reidentification with transferred metric learning. In ACCV, 2013.
[6] W. Li et al. Locally aligned feature transforms across views. In CVPR, 2013.
[7] D. S. Cheng et al. Custom pictorial structures for re- identification. In BMVC, 2011. [8] J. Van De Weijer et al. Learning color names for real-world applications. IEEE Transactions on Image Processing, 18(7): 1512- 1533, 2009.
[9] C. Wojek et al. Multi-cue onboard pedestrian detection. In CVPR, 2009.
[10] A. Relja et al. Three things everyone should know to improve object retrieval. In CVPR, 2012.
[11] H. Jegou et al. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, 2008.
[12] H. Jegou et al. On the burstiness of visual elements. In CVPR, 2009. [13] H. Jegou et al. Negative evidences and co-ocurrences in image retrieval: the benefit of PCA and whitening. In ECCV, 2012.
[14] M. Koestinger et al, Large scale metric learning from equivalence constraints. In CVPR, 2012.

Frequently Asked Questions

1. What are images beginning with "0000" and "-1"?
Ans: Names beginning with "0000" are distractors produced by DPM false detection.
Names beginning with "-1" are junks that are neither good nor bad DPM bboxes.
So "0000" will have a negative impact on accuracy, while "-1" will have no impact.
During testing, we rank all the images in "bounding_box_test". Then, the junk images are just neglected; distractor images are not neglected.

2. How exactly do we use images in the "gt_bboxes" during training and testing?
Ans: gt_bboxes are not used in training and testing now. These hand-drawn images are used to calculate the IOU (intersection over union) with the DPM bboxes, so that "good", "distractor", and "junk" can be assigned to the DPM bboxes.

3. Do you have some results for full hand labeled version of the dataset (images from gt_bboxes)?
Ans: Not yet. We only have the results on the DPM bboxes. One may want to try it. It is pretty interesting.

Thank you, Niki and Jenya!

Dataset	Market-1501	RAiD [1]	CUHK03 [2]	VIPeR [3]	i-LIDS [4]	CUHK01 [5]	CUHK02 [6]	CAVIAR [7]
# identities	1,501	43	1,360	632	119	971	1,816	72
# BBoxes	32,668	6,920	13,164	1,264	476	1,942	7,264	610
# distractors	2,793	0	0	0	0	0	0	0
# cam. per ID	6	4	2	2	2	2	2	2
DPM/Hand	DPM	hand	DPM	hand	hand	hand	hand	hand
Evaluation	mAP	CMC	CMC	CMC	CMC	CMC	CMC	CMC