Now you can request additional data and/or customized columns!

Try It Now!

Diabetes

machine-learning

Files Size Format Created Updated License Source
3 132kB arff csv zip 5 years ago 5 years ago Open Data Commons Public Domain Dedication and License
The resources for this dataset can be found at https://www.openml.org/d/37 Author: Vincent Sigillito Source: Obtained from UCI Please cite: UCI citation policy Title: Pima Indians Diabetes Database Sources: (a) Original owners: National Institute of Diabetes and Digestive and read more
Download Developers

Data Files

Download files in this dataset

File Description Size Last changed Download
diabetes_arff 37kB arff (37kB)
diabetes 34kB csv (34kB) , json (105kB)
diabetes_zip Compressed versions of dataset. Includes normalized CSV and JSON data with original data and datapackage.json. 46kB zip (46kB)

diabetes_arff  

Signup to Premium Service for additional or customised data - Get Started

This is a preview version. There might be more data in the original version.

diabetes  

Signup to Premium Service for additional or customised data - Get Started

This is a preview version. There might be more data in the original version.

Field information

Field Name Order Type (Format) Description
preg 1 number (default)
plas 2 number (default)
pres 3 number (default)
skin 4 number (default)
insu 5 number (default)
mass 6 number (default)
pedi 7 number (default)
age 8 number (default)
class 9 string (default)

Integrate this dataset into your favourite tool

Use our data-cli tool designed for data wranglers:

data get https://datahub.io/machine-learning/diabetes
data info machine-learning/diabetes
tree machine-learning/diabetes
# Get a list of dataset's resources
curl -L -s https://datahub.io/machine-learning/diabetes/datapackage.json | grep path

# Get resources

curl -L https://datahub.io/machine-learning/diabetes/r/0.arff

curl -L https://datahub.io/machine-learning/diabetes/r/1.csv

curl -L https://datahub.io/machine-learning/diabetes/r/2.zip

If you are using R here's how to get the data you want quickly loaded:

install.packages("jsonlite", repos="https://cran.rstudio.com/")
library("jsonlite")

json_file <- 'https://datahub.io/machine-learning/diabetes/datapackage.json'
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

# get list of all resources:
print(json_data$resources$name)

# print all tabular data(if exists any)
for(i in 1:length(json_data$resources$datahub$type)){
  if(json_data$resources$datahub$type[i]=='derived/csv'){
    path_to_file = json_data$resources$path[i]
    data <- read.csv(url(path_to_file))
    print(data)
  }
}

Note: You might need to run the script with root permissions if you are running on Linux machine

Install the Frictionless Data data package library and the pandas itself:

pip install datapackage
pip install pandas

Now you can use the datapackage in the Pandas:

import datapackage
import pandas as pd

data_url = 'https://datahub.io/machine-learning/diabetes/datapackage.json'

# to load Data Package into storage
package = datapackage.Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources:
    if resource.tabular:
        data = pd.read_csv(resource.descriptor['path'])
        print (data)

For Python, first install the `datapackage` library (all the datasets on DataHub are Data Packages):

pip install datapackage

To get Data Package into your Python environment, run following code:

from datapackage import Package

package = Package('https://datahub.io/machine-learning/diabetes/datapackage.json')

# print list of all resources:
print(package.resource_names)

# print processed tabular data (if exists any)
for resource in package.resources:
    if resource.descriptor['datahub']['type'] == 'derived/csv':
        print(resource.read())

If you are using JavaScript, please, follow instructions below:

Install data.js module using npm:

  $ npm install data.js

Once the package is installed, use the following code snippet:

const {Dataset} = require('data.js')

const path = 'https://datahub.io/machine-learning/diabetes/datapackage.json'

// We're using self-invoking function here as we want to use async-await syntax:
;(async () => {
  const dataset = await Dataset.load(path)
  // get list of all resources:
  for (const id in dataset.resources) {
    console.log(dataset.resources[id]._descriptor.name)
  }
  // get all tabular data(if exists any)
  for (const id in dataset.resources) {
    if (dataset.resources[id]._descriptor.format === "csv") {
      const file = dataset.resources[id]
      // Get a raw stream
      const stream = await file.stream()
      // entire file as a buffer (be careful with large files!)
      const buffer = await file.buffer
      // print data
      stream.pipe(process.stdout)
    }
  }
})()

Read me

The resources for this dataset can be found at https://www.openml.org/d/37

Author: Vincent Sigillito

Source: Obtained from UCI

Please cite: UCI citation policy

  1. Title: Pima Indians Diabetes Database

  2. Sources: (a) Original owners: National Institute of Diabetes and Digestive and Kidney Diseases (b) Donor of database: Vincent Sigillito ([email protected]) Research Center, RMI Group Leader Applied Physics Laboratory The Johns Hopkins University Johns Hopkins Road Laurel, MD 20707 (301) 953-6231 © Date received: 9 May 1990

  3. Past Usage:

    1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., & Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261–265). IEEE Computer Society Press.

      The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

      Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.

  4. Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.

  5. Number of Instances: 768

  6. Number of Attributes: 8 plus class

  7. For Each Attribute: (all numeric-valued)

    1. Number of times pregnant
    2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
    3. Diastolic blood pressure (mm Hg)
    4. Triceps skin fold thickness (mm)
    5. 2-Hour serum insulin (mu U/ml)
    6. Body mass index (weight in kg/(height in m)^2)
    7. Diabetes pedigree function
    8. Age (years)
    9. Class variable (0 or 1)
  8. Missing Attribute Values: None

  9. Class Distribution: (class value 1 is interpreted as “tested positive for diabetes”)

    Class Value Number of instances 0 500 1 268

  10. Brief statistical analysis:

    Attribute number: Mean: Standard Deviation:

    1.                 3.8     3.4
      
    2.               120.9    32.0
      
    3.                69.1    19.4
      
    4.                20.5    16.0
      
    5.                79.8   115.2
      
    6.                32.0     7.9
      
    7.                 0.5     0.3
      
    8.                33.2    11.8
      

Relabeled values in attribute ‘class’ From: 0 To: tested_negative
From: 1 To: tested_positive

Datapackage.json

Request Customized Data


Notifications of data updates and schema changes

Warranty / guaranteed updates

Workflow integration (e.g. Python packages, NPM packages)

Customized data (e.g. you need different or additional data)

Or suggest your own feature from the link below