Welcome

2D Benchmarks for Molecular Machine Learning




Dataset Description # Compounds # Tasks Recommend Metric* Task Type Reference
logP Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set. 8199(train),
406(test-FDA),
223(test-Star),
43(test-Nonstar)
3 R2 Regression 1, 3
logS(1) A diverse dataset of 1708 molecules. 1708 1 R2 Regression 1
Small aqueous solubility datasets. SMILES are provided. 1290(train),
21(test-1),
120(test-2)
2 R Regression 1
Quantitative toxicity LD50 The oral rat LD50 dataset (LD50). SMILES are provided. 5931(train),
1482(test)
1 R2 Regression 2, 3
IGC50 Tetrahymena pyriformis IGC50 dataset (IGC50). SMILES are provided. 1434(train),
358(test)
1 R2 Regression 2, 3
LC50 96 h fathead minnow LC50 dataset. SMILESare provided. 659(train),
164(test)
1 R2 Regression 2, 3
LC50DM Daphnia magna LC50 dataset (LC50DM). SMILES are provided. 283(train),
70(test)
1 R2 Regression 2, 3
Qualitative toxicity Tox21 Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/ 1. NR-AhR:8162(train),609(test)
2. NR-AR:9353(train),585(test)
3. NR-AR-LBD:8591(train),581(test)
4. NR-Aromatase:7220(train),527(test)
5. NR-ER:7689(train),515(test)
6. NR-ER-LBD:8743(train),599(test)
7. NR-ppar-gamma:8176(train),604(test)
8. SR-ARE:7163(train),554(test)
9. SR-ATAD5:9085(train),621(test)
10.SR-HSE:8144(train),609(test)
11.SR-MMP:7314(train),542(test)
12.SR-p53:8626(train),615(test)
12 ROC-AUC Classification 4
FreeSolv Solvation free energy (FreeSolv). SMILES are provided. 642 1 RMSE Regression 3, 5
Lipophilicity SMILES strings are provided. 4200 1 RMSE Regression 3, 5
DPP4 DPP-4 inhibitors (DPP4) was extract from ChEMBL with DPP-4 target. The data was processed by removing salt and normalizing molecular structure, with molecular duplication examination, leaving 3933 molecules. 3933 1 RMSE Regression 5
LogS Dataset LogS original from https://admetmesh.scbdd.com/resources/DA. 4801 1 RMSE Regression 5
ESOL ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds. 1128 1 RMSE Regression 5
BBBP Blood–brain barrier penetration (BBBP). SMILES strings are provided. 2039 1 ROC-AUC Classification 3, 5
Ames Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results. 6512 1 ROC-AUC Classification 5
bace A collection of 1522 compounds with their 2D structures and properties are provided. 1513 1 ROC-AUC Classification 5
beet The toxicity in honey bees (beet) dataset was extract from a study on the prediction of acute contact toxicity of pesticides in honeybees. The data set contains 254 compounds with their experimental values. 254 2 ROC-AUC Classification 5
ClinTox The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures. 1491 2 ROC-AUC Classification 5
DUD A Directory of Useful Decoys (DUD). A total of 21 targets. 1. 'ace':46(actives),1796(decoys)
2. 'ache':99(actives),3859(decoys)
3. 'ar':68(actives),2848(decoys)
4. 'cdk2':47(actives),2070(decoys)
5. 'cox2':212(actives),12606(decoys)
6. 'dhfr':190(actives),8350(decoys)
7. 'egfr':365(actives),15560(decoys)
8. 'agonist':63(actives),2568(decoys)
9. 'fgfr1':71(actives),3462(decoys)
10. 'fxa':64(actives),2092(decoys)
11. 'gpb':49(actives),2132(decoys)
12. 'gr':32(actives),2585(decoys)
13. 'hivrt':34(actives),1494(decoys)
14. 'inha':57(actives),2707(decoys)
15. 'na':49(actives),1713(decoys)
16. 'p38':137(actives),6779(decoys)
17. 'parp':31(actives),1350(decoys)
18. 'pdgfrb':124(actives),5603(decoys)
19. 'sahh':33(actives),1344(decoys)
20. 'src':98(actives),5679(decoys)
21. 'vegfr2':48(actives),2712(decoys)
21 ROC-AUC Rank 5
MUV Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening. A total of 17 targets. 1. '466':30(actives),15000(decoys)
2. '548':30(actives),15000(decoys)
3. '600':30(actives),15000(decoys)
4. '644':30(actives),15000(decoys)
5. '652':30(actives),15000(decoys)
6. '689':30(actives),15000(decoys)
7. '692':30(actives),15000(decoys)
8. '712':30(actives),15000(decoys)
9. '713':30(actives),15000(decoys)
10. '733':30(actives),15000(decoys)
11. '737':30(actives),15000(decoys)
12. '810':30(actives),15000(decoys)
13. '832':30(actives),15000(decoys)
14. '846':30(actives),15000(decoys)
15. '852':30(actives),15000(decoys)
16. '858':30(actives),15000(decoys)
17. '859':30(actives),15000(decoys)
17 ROC-AUC Rank 5
Cocaine addiction datasets The 36 cocaine-addiction related datasets are collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) and literatures (references 1 and 2 in README file), which involve 32 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided. 1. DAT-binding: 1189
2. DAT-uptake: 350
3. Extended-DAT: 2877
4. D3R: 4685
5. D2R: 3721
6. Extended-D2R: 6923
7. D4R: 2411
8. HDAC: 1925
9. Sigma1: 2388
10. Activin receptor 1: 257
11. VMAT2: 248
12. CDK1: 1253
13. CACNA1D: 137
14. CAPN1: 639
15. CNR1: 3922
16. CNR2: 4336
17. EGFR: 6693
18. EPHA2: 490
19. GRM2: 748
20. GRM3: 114
21. HGF: 529
22. IGF1R: 2450
23. ITGB7: 416
24. LRRK2: 1871
25. MET: 3347
26. MMP3: 1909
27. MMP7: 482
28. MMP9: 2523
29. PSEN1: 117
30. SPR: 1026
31. SRC: 3268
32. SSTR5: 788
33. YES1: 121
34. GRK5: 262
35. hERG: 2043
36. Extended-hERG: 6298
36 R Regression 6
Cocaine addiction datasets 2 The 30 additional cocaine-addiction related datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/), which involve 30 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided. 1. SERT: 4327
2. NET: 2981
3. AKR1B1: 725
4. AKT1: 2859
5. APP: 1141
6. CACNA1B: 352
7. CSNK2A1: 737
8. DHFR: 960
9. DPP4: 3883
10. FGFR1: 2060
11. FYN: 459
12. GBA: 398
13. HDAC1: 4608
14. HTR1A: 4342
15. HTR2A: 4307
16. LCK: 1855
17. LYN: 468
18. MAPKAPK2: 789
19. MDM2: 1745
20. MINK1: 364
21. NTRK1: 2783
22. NTRK2: 566
23. NTRK3: 355
24. PLG: 937
25. PRKCD: 792
26. SLC5A2: 1231
27. STAT3: 670
28. SYK: 3175
29. TDO2: 291
30. VCP: 323
30 R Regression 7
Drug_addiction_related Receptors related to opioid or cocaine addiction. Smiles strings are provided. mu-ext(6541), 5HT2A-ext(5765),
5HT2C-ext(4044), 5HT6-ext(5096),
D2-ext(11297), NMDA-ext(815),
NOP-ext(2063), catB-ext(2285),
catL-ext(2655), delta-ext(6338),
and kappa-ext(6139)
11 R2, RMSE, MAE Regression 8
hERG blocker/non-blocker datasets Seven datasets are provided for the classification of hERG blocker/non-blockers. These datasets are from literatures and the original datasets are included. 1. Braga: 6824 (train)
2. C. Zhang: 927(train), 407(test)
3. Li: 3721(train), 1092(test)
4. Cai: 954(train), 493(test)
5. Doddaredy: 2389(train), 255(test)
6. Ogura: 203853(train), 87366(test)
7. X. Zhang: 10859(train), 2570(test)
7 ROC-AUC, MCC, ACC Classification 9
Opioid use disorder datasets 75 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of opioid use disorder. The labels are binding affinities to these targets. Smiles strings are provided. 1. MOR: 4667; 2. KOR: 4249;
3. DOR: 4033; 4. NOR: 1494;
5. ACE: 621; 6. ACKR3: 268;
7. ADRB1: 844; 8. ADORA1: 4266;
9. ADRB2: 1024; 10. AGTR1: 1036;
11. AKT1: 3140; 12. BDKRB2: 500;
13. CNR1: 4016; 14. CREBBP: 482;
15. CXCR4: 826; 16. HGF: 3620;
17. JAK2: 6202; 18. KLKB1: 1158;
19. MAPK1: 3159; 20. MC1R: 626;
21. MC3R: 780; 22. MC5R: 637;
23. MC4R: 2597; 24. MDM2: 2255;
25. NOS3: 675; 26. NTRK1: 3114;
27. PDE4A: 655; 28. SMO: 663;
29. SRC: 3344; 30. JAK1: 4402;
31. HDAC1: 5264; 32. HDAC2: 1768;
33. MAPK10: 1191; 34. BDKRB1: 785;
35. TOP2A: 308; 36. PI4KB: 294;
37. ITGB1: 1300; 38. CDK1: 1260;
39. ADAM17: 1767; 40. ADAMTS4: 288;
41. ADAMTS5: 482; 42. ADRBK1: 321;
43. AR: 2196; 44. ATG4B: 401;
45. AVPR2: 530; 46. CCR5: 2095;
47. DNMT1: 355; 48. EGFR: 7180;
49. ERBB2: 2021; 50. ERBB4: 280;
51. F11: 964; 52. JAK3: 3408;
53. KRAS: 306; 54. MAP3K5: 355;
55. MMP1: 2511; 56. MMP2: 3538;
57. MMP7: 489; 58. MMP8: 1242;
59. MMP9: 2620; 60. PDGFRB: 1371;
61. PPARG: 1863; 62. TYK2: 1505;
63. CRHR1: 1935; 64. CASR: 351;
65. FYN: 490; 66. ESR1: 2711;
67. S1PR1: 748; 68. PTPN2: 749;
69. F2R: 843; 70. CXCR2: 967;
71. CXCR1: 312; 72. REN: 2930;
73. P2RY12: 1066; 74. TBXA2R: 802;
75. hERG: 6298;
75 R Regression 10
SVS datasets The 9 datasets for biomolecules interactions, including 4 regressions and 5 classfications. 1. PL: 3767(train), 290(test)
2. PP: 1795
3. PN: 186
4. iPPI: 1694(train), 565(test)
9 R Regression 11
5. S. cerevisiae: 11188
6. H. sapiens: 2434
7. D. melanogaster: 2140
8. H. pylori: 2916
9. M. musculus:694
ROC-AUC Classification

* Metrics: R - Pearson correlation coefficient; R2 - Squared Pearson correlation coefficient; RMSE - Root Mean Square Error; MAE - Mean Absolute Error;

Note: Each dataset contains the README file, which contains the source or reference of the data.





References


[1] Wu, Kedi, Zhixiong Zhao, Renxiao Wang, and Guo‐Wei Wei. "TopP–S: Persistent homology‐based multi‐task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility." Journal of computational chemistry 39, no. 20 (2018): 1444-1454. PDF

[2] Wu, Kedi, and Guo-Wei Wei. "Quantitative toxicity prediction using topology based multitask deep neural networks." Journal of chemical information and modeling 58, no. 2 (2018): 520-531. PDF

[3] Chen, Dong, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei, and Feng Pan. "Algebraic graph-assisted bidirectional transformers for molecular property prediction." Nature Communications 12, no. 1 (2021): 1-9. PDF

[4] Jiang, Jian, Rui Wang, and Guo-Wei Wei. "GGL-Tox: Geometric Graph Learning for Toxicity Prediction." Journal of Chemical Information and Modeling (2021). PDF

[5] Chen, Dong, Guowei Wei, and Feng Pan. "Extracting Predictive Representations from Hundreds of Millions of Molecules". PDF

[6] Kaifu Gao, Dong Chen, Alfred J Robison, and Guo-Wei Wei. "Proteome-informed machine learning studies of cocaine addiction". PDF

[7] Hongsong Feng, Kaifu Gao, Dong Chen, Alfred J Robison, Edmund Ellsworth and Guo-Wei Wei. "Machine learning analysis of cocaine addiction informed by DAT, SERT, and NET-based interactome networks". PDF

[8] Bozheng Dou, Zailiang Zhu, Yucang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, and Guo-Wei Wei, "TIDAL: Topology-Inferred Drug Addiction Learning", in print, 2022.

[9] Hongsong Feng and Guo-Wei Wei, Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models, Computers in Biology and Medicine (2023).PDF

[10] Hongsong Feng, Rana Elladki, Jian Jiang, and Guo-Wei Wei, Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks, in print (2023)PDF

[11] Li Shen, Hongsong Feng, Yuchi Qiu, and Guo-Wei Wei. "SVSBI: Sequence-based virtual screening of biomolecular interactions". PDF