Dataset Overviews

This page presents a brief overview of the datasets used to evaluate 1- and 2-dimensional range queries.

Name Dimensionality Original Scale % Zero Counts Source Dataset
Adult 1D 32,558 97.8 Adult Census Dataset: describing the “capital loss” attribute of each person
HepPh 1D 347,414 21.17 Arxiv High Energy Physics paper citation network: describing number of citations for each paper (scale is not consistent to the source)
Income 1D 20,787,122 44.97 IPUMS USA collected census microdata: describing “personal income” attribute of person on the survey from 2001-2011
Medcost 1D 98,415 74.8 National home and hospice care 2007 survey data: describing the personal medical expenses
Trace 1D 25,714 96.61 Bipartite connection on an IP-level network trace collected at a major university: describing the number of external connections made by each internal host
Patent 1D 27,948,226 6.2 US patent dataset (from 1/1/1963 to 12/30/1999) maintained by the National Bureau of Economic Research: describing the number of citations for each patent
Search 1D 335,889 51.03 Search query logs dataset: describing the frequency of search term “Obama” over time (from 2004 to 2010)
BIDS-FJ 1D 1,901,799 0 Kaggle competition testing dataset: describing the number of individual bids where “merchandise=jewelry” on each IP address
BIDS-FM 1D 2,126,344 0 Kaggle competition testing dataset: describing the number of individual bids where “merchandise=mobile” on each IP address
BIDS-ALL 1D 7,655,502 0 Kaggle competition testing dataset: describing the total number of individual bids on each IP address
MD-Sal 1D 135,727 83.12 Maryland salary database (2012 state employee): describing “YTD-gross-compensation” attribute
MD-Sal-FA 1D 100,534 83.17 Maryland salary database (2012 state employee): describing “YTD-gross-compensation” attribute and filtered on condition “pay-type=Annually”
LC-REQ-F1 1D 3,737,472 61.57 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Amount Requested” attribute and filtered on condition “Employment between 0 and 5”
LC-REQ-F2 1D 198,045 67.69 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Amount Requested” attribute and filtered on condition “Employment between 5 and 10 (5 not included)”
LC-REQ-ALL 1D 3,999,425 60.15 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Amount Requested” attribute
LC-DTIR-F1 1D 3,336,740 0 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Debt-To-Income Ratio” attribute and filtered on condition “Employment between 0 and 5”
LC-DTIR-F2 1D 189,827 11.91 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Debt-To-Income Ratio” attribute and filtered on condition “Employment between 5 and 10 (5 not included)”
LC-DTIR-ALL 1D 3,589,119 0 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Debt-To-Income Ratio” attribute
BJ-CABS-S 2D 4,268,780 78.17 GPS trajectories (by 8602 taxi cabs in Beijing during May 2009): describing the GPS location of the start point of each trajectory
BJ-CABS-E 2D 4,268,780 76.83 GPS trajectories (by 8602 taxi cabs in Beijing during May 2009): describing the GPS location of the end point of each trajectory
GOWALLA 2D 6,442,863 88.92 Checking-in datasets from Gowalla social networking website: describing the checking-in locations of users
ADULT-2D 2D 32,561 99.30 Adult Census Dataset: describing the “capital-gain” and “capital-loss” attributes of each person
MD-SAL-2D 2D 70,526 97.89 Maryland salary database (2012 state employee): describing “Annual Salary” and “Overtime earnings” attributes
SF-CABS-S 2D 464,040 95.04 San Francisco taxi cabs mobility traces dataset: describing the location of the start point of each trace
SF-CABS-E 2D 464,040 97.31 San Francisco taxi cabs mobility traces dataset: describing the location of the end point of each trace
LC-2D 2D 550,559 92.66 Lending Club (online credit market) Statistics database on rejected loan applications: describing “Funded Amount” and “Annual Income” attributes
STROKE 2D 19,435 79.02 International Stroke Trial Database: describing “Age” and “Systolic blood pressure” attributes
TWITTER 2D 189943 98.12 Twitter Location Datatset: describing tweet location in western USA