Dataset | Description | # Compounds | # Tasks | Recommend Metric* | Task Type | Reference | |
---|---|---|---|---|---|---|---|
logP | Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set. | 8199(train), 406(test-FDA), 223(test-Star), 43(test-Nonstar) |
3 | R2 | Regression | 1, 3 | |
logS(1) | A diverse dataset of 1708 molecules. | 1708 | 1 | R2 | Regression | 1 | |
Small aqueous solubility datasets. SMILES are provided. | 1290(train), 21(test-1), 120(test-2) |
2 | R | Regression | 1 | ||
Quantitative toxicity | LD50 | The oral rat LD50 dataset (LD50). SMILES are provided. | 5931(train), 1482(test) |
1 | R2 | Regression | 2, 3 |
IGC50 | Tetrahymena pyriformis IGC50 dataset (IGC50). SMILES are provided. | 1434(train), 358(test) |
1 | R2 | Regression | 2, 3 | |
LC50 | 96 h fathead minnow LC50 dataset. SMILESare provided. | 659(train), 164(test) |
1 | R2 | Regression | 2, 3 | |
LC50DM | Daphnia magna LC50 dataset (LC50DM). SMILES are provided. | 283(train), 70(test) |
1 | R2 | Regression | 2, 3 | |
Qualitative toxicity | Tox21 | Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/ | 1. NR-AhR:8162(train),609(test) 2. NR-AR:9353(train),585(test) 3. NR-AR-LBD:8591(train),581(test) 4. NR-Aromatase:7220(train),527(test) 5. NR-ER:7689(train),515(test) 6. NR-ER-LBD:8743(train),599(test) 7. NR-ppar-gamma:8176(train),604(test) 8. SR-ARE:7163(train),554(test) 9. SR-ATAD5:9085(train),621(test) 10.SR-HSE:8144(train),609(test) 11.SR-MMP:7314(train),542(test) 12.SR-p53:8626(train),615(test) |
12 | ROC-AUC | Classification | 4 |
FreeSolv | Solvation free energy (FreeSolv). SMILES are provided. | 642 | 1 | RMSE | Regression | 3, 5 | |
Lipophilicity | SMILES strings are provided. | 4200 | 1 | RMSE | Regression | 3, 5 | |
DPP4 | DPP-4 inhibitors (DPP4) was extract from ChEMBL with DPP-4 target. The data was processed by removing salt and normalizing molecular structure, with molecular duplication examination, leaving 3933 molecules. | 3933 | 1 | RMSE | Regression | 5 | |
LogS | Dataset LogS original from https://admetmesh.scbdd.com/resources/DA. | 4801 | 1 | RMSE | Regression | 5 | |
ESOL | ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. | 1128 | 1 | RMSE | Regression | 5 | |
BBBP | Blood–brain barrier penetration (BBBP). SMILES strings are provided. | 2039 | 1 | ROC-AUC | Classification | 3, 5 | |
Ames | Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results. | 6512 | 1 | ROC-AUC | Classification | 5 | |
bace | A collection of 1522 compounds with their 2D structures and properties are provided. | 1513 | 1 | ROC-AUC | Classification | 5 | |
beet | The toxicity in honey bees (beet) dataset was extract from a study on the prediction of acute contact toxicity of pesticides in honeybees. The data set contains 254 compounds with their experimental values. | 254 | 2 | ROC-AUC | Classification | 5 | |
ClinTox | The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures. | 1491 | 2 | ROC-AUC | Classification | 5 | |
DUD | A Directory of Useful Decoys (DUD). A total of 21 targets. |
1. 'ace':46(actives),1796(decoys) 2. 'ache':99(actives),3859(decoys) 3. 'ar':68(actives),2848(decoys) 4. 'cdk2':47(actives),2070(decoys) 5. 'cox2':212(actives),12606(decoys) 6. 'dhfr':190(actives),8350(decoys) 7. 'egfr':365(actives),15560(decoys) 8. 'agonist':63(actives),2568(decoys) 9. 'fgfr1':71(actives),3462(decoys) 10. 'fxa':64(actives),2092(decoys) 11. 'gpb':49(actives),2132(decoys) 12. 'gr':32(actives),2585(decoys) 13. 'hivrt':34(actives),1494(decoys) 14. 'inha':57(actives),2707(decoys) 15. 'na':49(actives),1713(decoys) 16. 'p38':137(actives),6779(decoys) 17. 'parp':31(actives),1350(decoys) 18. 'pdgfrb':124(actives),5603(decoys) 19. 'sahh':33(actives),1344(decoys) 20. 'src':98(actives),5679(decoys) 21. 'vegfr2':48(actives),2712(decoys) |
21 | ROC-AUC | Rank | 5 | |
MUV | Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening. A total of 17 targets. |
1. '466':30(actives),15000(decoys) 2. '548':30(actives),15000(decoys) 3. '600':30(actives),15000(decoys) 4. '644':30(actives),15000(decoys) 5. '652':30(actives),15000(decoys) 6. '689':30(actives),15000(decoys) 7. '692':30(actives),15000(decoys) 8. '712':30(actives),15000(decoys) 9. '713':30(actives),15000(decoys) 10. '733':30(actives),15000(decoys) 11. '737':30(actives),15000(decoys) 12. '810':30(actives),15000(decoys) 13. '832':30(actives),15000(decoys) 14. '846':30(actives),15000(decoys) 15. '852':30(actives),15000(decoys) 16. '858':30(actives),15000(decoys) 17. '859':30(actives),15000(decoys) |
17 | ROC-AUC | Rank | 5 | |
Cocaine addiction datasets | The 36 cocaine-addiction related datasets are collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) and literatures (references 1 and 2 in README file), which involve 32 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided. |
1. DAT-binding: 1189 2. DAT-uptake: 350 3. Extended-DAT: 2877 4. D3R: 4685 5. D2R: 3721 6. Extended-D2R: 6923 7. D4R: 2411 8. HDAC: 1925 9. Sigma1: 2388 10. Activin receptor 1: 257 11. VMAT2: 248 12. CDK1: 1253 13. CACNA1D: 137 14. CAPN1: 639 15. CNR1: 3922 16. CNR2: 4336 17. EGFR: 6693 18. EPHA2: 490 19. GRM2: 748 20. GRM3: 114 21. HGF: 529 22. IGF1R: 2450 23. ITGB7: 416 24. LRRK2: 1871 25. MET: 3347 26. MMP3: 1909 27. MMP7: 482 28. MMP9: 2523 29. PSEN1: 117 30. SPR: 1026 31. SRC: 3268 32. SSTR5: 788 33. YES1: 121 34. GRK5: 262 35. hERG: 2043 36. Extended-hERG: 6298 |
36 | R | Regression | 6 | |
Cocaine addiction datasets 2 | The 30 additional cocaine-addiction related datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/), which involve 30 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided. |
1. SERT: 4327 2. NET: 2981 3. AKR1B1: 725 4. AKT1: 2859 5. APP: 1141 6. CACNA1B: 352 7. CSNK2A1: 737 8. DHFR: 960 9. DPP4: 3883 10. FGFR1: 2060 11. FYN: 459 12. GBA: 398 13. HDAC1: 4608 14. HTR1A: 4342 15. HTR2A: 4307 16. LCK: 1855 17. LYN: 468 18. MAPKAPK2: 789 19. MDM2: 1745 20. MINK1: 364 21. NTRK1: 2783 22. NTRK2: 566 23. NTRK3: 355 24. PLG: 937 25. PRKCD: 792 26. SLC5A2: 1231 27. STAT3: 670 28. SYK: 3175 29. TDO2: 291 30. VCP: 323 |
30 | R | Regression | 7 | |
Drug_addiction_related | Receptors related to opioid or cocaine addiction. Smiles strings are provided. | mu-ext(6541), 5HT2A-ext(5765), 5HT2C-ext(4044), 5HT6-ext(5096), D2-ext(11297), NMDA-ext(815), NOP-ext(2063), catB-ext(2285), catL-ext(2655), delta-ext(6338), and kappa-ext(6139) |
11 | R2, RMSE, MAE | Regression | 8 | |
hERG blocker/non-blocker datasets | Seven datasets are provided for the classification of hERG blocker/non-blockers. These datasets are from literatures and the original datasets are included. |
1. Braga: 6824 (train) 2. C. Zhang: 927(train), 407(test) 3. Li: 3721(train), 1092(test) 4. Cai: 954(train), 493(test) 5. Doddaredy: 2389(train), 255(test) 6. Ogura: 203853(train), 87366(test) 7. X. Zhang: 10859(train), 2570(test) |
7 | ROC-AUC, MCC, ACC | Classification | 9 | |
Opioid use disorder datasets | 75 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of opioid use disorder. The labels are binding affinities to these targets. Smiles strings are provided. |
1. MOR: 4667; 2. KOR: 4249; 3. DOR: 4033; 4. NOR: 1494; 5. ACE: 621; 6. ACKR3: 268; 7. ADRB1: 844; 8. ADORA1: 4266; 9. ADRB2: 1024; 10. AGTR1: 1036; 11. AKT1: 3140; 12. BDKRB2: 500; 13. CNR1: 4016; 14. CREBBP: 482; 15. CXCR4: 826; 16. HGF: 3620; 17. JAK2: 6202; 18. KLKB1: 1158; 19. MAPK1: 3159; 20. MC1R: 626; 21. MC3R: 780; 22. MC5R: 637; 23. MC4R: 2597; 24. MDM2: 2255; 25. NOS3: 675; 26. NTRK1: 3114; 27. PDE4A: 655; 28. SMO: 663; 29. SRC: 3344; 30. JAK1: 4402; 31. HDAC1: 5264; 32. HDAC2: 1768; 33. MAPK10: 1191; 34. BDKRB1: 785; 35. TOP2A: 308; 36. PI4KB: 294; 37. ITGB1: 1300; 38. CDK1: 1260; 39. ADAM17: 1767; 40. ADAMTS4: 288; 41. ADAMTS5: 482; 42. ADRBK1: 321; 43. AR: 2196; 44. ATG4B: 401; 45. AVPR2: 530; 46. CCR5: 2095; 47. DNMT1: 355; 48. EGFR: 7180; 49. ERBB2: 2021; 50. ERBB4: 280; 51. F11: 964; 52. JAK3: 3408; 53. KRAS: 306; 54. MAP3K5: 355; 55. MMP1: 2511; 56. MMP2: 3538; 57. MMP7: 489; 58. MMP8: 1242; 59. MMP9: 2620; 60. PDGFRB: 1371; 61. PPARG: 1863; 62. TYK2: 1505; 63. CRHR1: 1935; 64. CASR: 351; 65. FYN: 490; 66. ESR1: 2711; 67. S1PR1: 748; 68. PTPN2: 749; 69. F2R: 843; 70. CXCR2: 967; 71. CXCR1: 312; 72. REN: 2930; 73. P2RY12: 1066; 74. TBXA2R: 802; 75. hERG: 6298; |
75 | R | Regression | 10, 12, 14 | |
sodium_channel_111_datasets | 111 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of pain treatment. The labels are binding affinities to these targets. Smiles strings are provided. |
The details of 111 targets are given in Table S3 of the Supporting Information of Ref.13 |
111 | R | Regression | 13 | |
SVS datasets | The 9 datasets for biomolecules interactions, including 4 regressions and 5 classfications. |
1. PL: 3767(train), 290(test) 2. PP: 1795 3. PN: 186 4. iPPI: 1694(train), 565(test) | 9 | R | Regression | 11 | |
5. S. cerevisiae: 11188 6. H. sapiens: 2434 7. D. melanogaster: 2140 8. H. pylori: 2916 9. M. musculus:694 |
ROC-AUC | Classification |
* Metrics: R - Pearson correlation coefficient; R2 - Squared Pearson correlation coefficient; RMSE - Root Mean Square Error; MAE - Mean Absolute Error;
Note: Each dataset contains the README file, which contains the source or reference of the data.
[1] Wu, Kedi, Zhixiong Zhao, Renxiao Wang, and Guo‐Wei Wei. "TopP–S: Persistent homology‐based multi‐task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility." Journal of computational chemistry 39, no. 20 (2018): 1444-1454. PDF
[2] Wu, Kedi, and Guo-Wei Wei. "Quantitative toxicity prediction using topology based multitask deep neural networks." Journal of chemical information and modeling 58, no. 2 (2018): 520-531. PDF
[3] Chen, Dong, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei, and Feng Pan. "Algebraic graph-assisted bidirectional transformers for molecular property prediction." Nature Communications 12, no. 1 (2021): 1-9. PDF
[4] Jiang, Jian, Rui Wang, and Guo-Wei Wei. "GGL-Tox: Geometric Graph Learning for Toxicity Prediction." Journal of Chemical Information and Modeling (2021). PDF
[5] Chen, Dong, Guowei Wei, and Feng Pan. "Extracting Predictive Representations from Hundreds of Millions of Molecules". PDF
[6] Kaifu Gao, Dong Chen, Alfred J Robison, and Guo-Wei Wei. "Proteome-informed machine learning studies of cocaine addiction". PDF
[7] Hongsong Feng, Kaifu Gao, Dong Chen, Alfred J Robison, Edmund Ellsworth and Guo-Wei Wei. "Machine learning analysis of cocaine addiction informed by DAT, SERT, and NET-based interactome networks". PDF
[8] Bozheng Dou, Zailiang Zhu, Yucang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, and Guo-Wei Wei, "TIDAL: Topology-Inferred Drug Addiction Learning", in print, 2022.
[9] Hongsong Feng and Guo-Wei Wei, Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models, Computers in Biology and Medicine (2023).PDF
[10] Hongsong Feng, Rana Elladki, Jian Jiang, and Guo-Wei Wei, Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks, Computers in Biology and Medicine (2023)PDF
[11] Li Shen, Hongsong Feng, Yuchi Qiu, and Guo-Wei Wei. "SVSBI: Sequence-based virtual screening of biomolecular interactions". PDF
[12]Hongsong Feng, Jian Jiang, and Guo-Wei Wei. "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder". PDF
[13]Long Chen, Jian Jiang, Bozheng Dou, Hongsong Feng, Jie Liu, Yueying Zhu, Bengong Zhang, Tianshou Zhou, and Guo-Wei We. "Machine Learning Study of the Extended Drug-target Interaction Network informed by Pain Related Voltage-Gated Sodium Channels", in submission, 2023
[12]Hongsong Feng, Jian Jiang, and Guo-Wei Wei. "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder". PDF