Welcome

2D Benchmarks for Molecular Machine Learning

Dataset		Description	# Compounds	# Tasks	Recommend Metric*	Task Type	Reference
logP		Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set.	8199(train), 406(test-FDA), 223(test-Star), 43(test-Nonstar)	3	R²	Regression	1, 3
logS(1)		A diverse dataset of 1708 molecules.	1708	1	R²	Regression	1
logS(1)		Small aqueous solubility datasets. SMILES are provided.	1290(train), 21(test-1), 120(test-2)	2	R	Regression	1
Quantitative toxicity	LD50	The oral rat LD50 dataset (LD50). SMILES are provided.	5931(train), 1482(test)	1	R²	Regression	2, 3
	IGC50	Tetrahymena pyriformis IGC50 dataset (IGC50). SMILES are provided.	1434(train), 358(test)	1	R²	Regression	2, 3
	LC50	96 h fathead minnow LC50 dataset. SMILESare provided.	659(train), 164(test)	1	R²	Regression	2, 3
	LC50DM	Daphnia magna LC50 dataset (LC50DM). SMILES are provided.	283(train), 70(test)	1	R²	Regression	2, 3
Qualitative toxicity	Tox21	Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/	1. NR-AhR:8162(train),609(test) 2. NR-AR:9353(train),585(test) 3. NR-AR-LBD:8591(train),581(test) 4. NR-Aromatase:7220(train),527(test) 5. NR-ER:7689(train),515(test) 6. NR-ER-LBD:8743(train),599(test) 7. NR-ppar-gamma:8176(train),604(test) 8. SR-ARE:7163(train),554(test) 9. SR-ATAD5:9085(train),621(test) 10.SR-HSE:8144(train),609(test) 11.SR-MMP:7314(train),542(test) 12.SR-p53:8626(train),615(test)	12	ROC-AUC	Classification	4
FreeSolv		Solvation free energy (FreeSolv). SMILES are provided.	642	1	RMSE	Regression	3, 5
Lipophilicity		SMILES strings are provided.	4200	1	RMSE	Regression	3, 5
DPP4		DPP-4 inhibitors (DPP4) was extract from ChEMBL with DPP-4 target. The data was processed by removing salt and normalizing molecular structure, with molecular duplication examination, leaving 3933 molecules.	3933	1	RMSE	Regression	5
LogS		Dataset LogS original from https://admetmesh.scbdd.com/resources/DA.	4801	1	RMSE	Regression	5
ESOL		ESOL (delaney) is a standard regression dataset containing structures and water solubility data for 1128 compounds.	1128	1	RMSE	Regression	5
BBBP		Blood–brain barrier penetration (BBBP). SMILES strings are provided.	2039	1	ROC-AUC	Classification	3, 5
Ames		Ames mutagenicity. The dataset includes 6512 compounds and corresponding binary labels from Ames Mutagenicity results.	6512	1	ROC-AUC	Classification	5
bace		A collection of 1522 compounds with their 2D structures and properties are provided.	1513	1	ROC-AUC	Classification	5
beet		The toxicity in honey bees (beet) dataset was extract from a study on the prediction of acute contact toxicity of pesticides in honeybees. The data set contains 254 compounds with their experimental values.	254	2	ROC-AUC	Classification	5
ClinTox		The ClinTox dataset compares drugs approved by the FDA and drugs that have failed clinical trials for toxicity reasons. The dataset includes two classification tasks for 1491 drug compounds with known chemical structures.	1491	2	ROC-AUC	Classification	5
DUD		A Directory of Useful Decoys (DUD). A total of 21 targets.	1. 'ace':46(actives),1796(decoys) 2. 'ache':99(actives),3859(decoys) 3. 'ar':68(actives),2848(decoys) 4. 'cdk2':47(actives),2070(decoys) 5. 'cox2':212(actives),12606(decoys) 6. 'dhfr':190(actives),8350(decoys) 7. 'egfr':365(actives),15560(decoys) 8. 'agonist':63(actives),2568(decoys) 9. 'fgfr1':71(actives),3462(decoys) 10. 'fxa':64(actives),2092(decoys) 11. 'gpb':49(actives),2132(decoys) 12. 'gr':32(actives),2585(decoys) 13. 'hivrt':34(actives),1494(decoys) 14. 'inha':57(actives),2707(decoys) 15. 'na':49(actives),1713(decoys) 16. 'p38':137(actives),6779(decoys) 17. 'parp':31(actives),1350(decoys) 18. 'pdgfrb':124(actives),5603(decoys) 19. 'sahh':33(actives),1344(decoys) 20. 'src':98(actives),5679(decoys) 21. 'vegfr2':48(actives),2712(decoys)	21	ROC-AUC	Rank	5
MUV		Maximum Unbiased Validation (MUV) Data Sets for Virtual Screening. A total of 17 targets.	1. '466':30(actives),15000(decoys) 2. '548':30(actives),15000(decoys) 3. '600':30(actives),15000(decoys) 4. '644':30(actives),15000(decoys) 5. '652':30(actives),15000(decoys) 6. '689':30(actives),15000(decoys) 7. '692':30(actives),15000(decoys) 8. '712':30(actives),15000(decoys) 9. '713':30(actives),15000(decoys) 10. '733':30(actives),15000(decoys) 11. '737':30(actives),15000(decoys) 12. '810':30(actives),15000(decoys) 13. '832':30(actives),15000(decoys) 14. '846':30(actives),15000(decoys) 15. '852':30(actives),15000(decoys) 16. '858':30(actives),15000(decoys) 17. '859':30(actives),15000(decoys)	17	ROC-AUC	Rank	5
Cocaine addiction datasets		The 36 cocaine-addiction related datasets are collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) and literatures (references 1 and 2 in README file), which involve 32 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided.	1. DAT-binding: 1189 2. DAT-uptake: 350 3. Extended-DAT: 2877 4. D3R: 4685 5. D2R: 3721 6. Extended-D2R: 6923 7. D4R: 2411 8. HDAC: 1925 9. Sigma1: 2388 10. Activin receptor 1: 257 11. VMAT2: 248 12. CDK1: 1253 13. CACNA1D: 137 14. CAPN1: 639 15. CNR1: 3922 16. CNR2: 4336 17. EGFR: 6693 18. EPHA2: 490 19. GRM2: 748 20. GRM3: 114 21. HGF: 529 22. IGF1R: 2450 23. ITGB7: 416 24. LRRK2: 1871 25. MET: 3347 26. MMP3: 1909 27. MMP7: 482 28. MMP9: 2523 29. PSEN1: 117 30. SPR: 1026 31. SRC: 3268 32. SSTR5: 788 33. YES1: 121 34. GRK5: 262 35. hERG: 2043 36. Extended-hERG: 6298	36	R	Regression	6
Cocaine addiction datasets 2		The 30 additional cocaine-addiction related datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/), which involve 30 cocaine-addiction protein targets. The labels are binding affinities to these targets. Smiles strings are provided.	1. SERT: 4327 2. NET: 2981 3. AKR1B1: 725 4. AKT1: 2859 5. APP: 1141 6. CACNA1B: 352 7. CSNK2A1: 737 8. DHFR: 960 9. DPP4: 3883 10. FGFR1: 2060 11. FYN: 459 12. GBA: 398 13. HDAC1: 4608 14. HTR1A: 4342 15. HTR2A: 4307 16. LCK: 1855 17. LYN: 468 18. MAPKAPK2: 789 19. MDM2: 1745 20. MINK1: 364 21. NTRK1: 2783 22. NTRK2: 566 23. NTRK3: 355 24. PLG: 937 25. PRKCD: 792 26. SLC5A2: 1231 27. STAT3: 670 28. SYK: 3175 29. TDO2: 291 30. VCP: 323	30	R	Regression	7
Drug_addiction_related		Receptors related to opioid or cocaine addiction. Smiles strings are provided.	mu-ext(6541), 5HT2A-ext(5765), 5HT2C-ext(4044), 5HT6-ext(5096), D2-ext(11297), NMDA-ext(815), NOP-ext(2063), catB-ext(2285), catL-ext(2655), delta-ext(6338), and kappa-ext(6139)	11	R², RMSE, MAE	Regression	8
hERG blocker/non-blocker datasets		Seven datasets are provided for the classification of hERG blocker/non-blockers. These datasets are from literatures and the original datasets are included.	1. Braga: 6824 (train) 2. C. Zhang: 927(train), 407(test) 3. Li: 3721(train), 1092(test) 4. Cai: 954(train), 493(test) 5. Doddaredy: 2389(train), 255(test) 6. Ogura: 203853(train), 87366(test) 7. X. Zhang: 10859(train), 2570(test)	7	ROC-AUC, MCC, ACC	Classification	9
Opioid use disorder datasets		75 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of opioid use disorder. The labels are binding affinities to these targets. Smiles strings are provided.	1. MOR: 4667; 2. KOR: 4249; 3. DOR: 4033; 4. NOR: 1494; 5. ACE: 621; 6. ACKR3: 268; 7. ADRB1: 844; 8. ADORA1: 4266; 9. ADRB2: 1024; 10. AGTR1: 1036; 11. AKT1: 3140; 12. BDKRB2: 500; 13. CNR1: 4016; 14. CREBBP: 482; 15. CXCR4: 826; 16. HGF: 3620; 17. JAK2: 6202; 18. KLKB1: 1158; 19. MAPK1: 3159; 20. MC1R: 626; 21. MC3R: 780; 22. MC5R: 637; 23. MC4R: 2597; 24. MDM2: 2255; 25. NOS3: 675; 26. NTRK1: 3114; 27. PDE4A: 655; 28. SMO: 663; 29. SRC: 3344; 30. JAK1: 4402; 31. HDAC1: 5264; 32. HDAC2: 1768; 33. MAPK10: 1191; 34. BDKRB1: 785; 35. TOP2A: 308; 36. PI4KB: 294; 37. ITGB1: 1300; 38. CDK1: 1260; 39. ADAM17: 1767; 40. ADAMTS4: 288; 41. ADAMTS5: 482; 42. ADRBK1: 321; 43. AR: 2196; 44. ATG4B: 401; 45. AVPR2: 530; 46. CCR5: 2095; 47. DNMT1: 355; 48. EGFR: 7180; 49. ERBB2: 2021; 50. ERBB4: 280; 51. F11: 964; 52. JAK3: 3408; 53. KRAS: 306; 54. MAP3K5: 355; 55. MMP1: 2511; 56. MMP2: 3538; 57. MMP7: 489; 58. MMP8: 1242; 59. MMP9: 2620; 60. PDGFRB: 1371; 61. PPARG: 1863; 62. TYK2: 1505; 63. CRHR1: 1935; 64. CASR: 351; 65. FYN: 490; 66. ESR1: 2711; 67. S1PR1: 748; 68. PTPN2: 749; 69. F2R: 843; 70. CXCR2: 967; 71. CXCR1: 312; 72. REN: 2930; 73. P2RY12: 1066; 74. TBXA2R: 802; 75. hERG: 6298;	75	R	Regression	10, 12, 14
sodium_channel_111_datasets		111 datasets collected from ChEMDL database (https://www.ebi.ac.uk/chembl/) used in the machine-learning study of pain treatment. The labels are binding affinities to these targets. Smiles strings are provided.	The details of 111 targets are given in Table S3 of the Supporting Information of Ref.13	111	R	Regression	13
SVS datasets		The 9 datasets for biomolecules interactions, including 4 regressions and 5 classfications.	1. PL: 3767(train), 290(test) 2. PP: 1795 3. PN: 186 4. iPPI: 1694(train), 565(test)	9	R	Regression	11
SVS datasets			5. S. cerevisiae: 11188 6. H. sapiens: 2434 7. D. melanogaster: 2140 8. H. pylori: 2916 9. M. musculus:694	9	ROC-AUC	Classification	11

* Metrics: R - Pearson correlation coefficient; R² - Squared Pearson correlation coefficient; RMSE - Root Mean Square Error; MAE - Mean Absolute Error;

Note: Each dataset contains the README file, which contains the source or reference of the data.

References

[1] Wu, Kedi, Zhixiong Zhao, Renxiao Wang, and Guo‐Wei Wei. "TopP–S: Persistent homology‐based multi‐task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility." Journal of computational chemistry 39, no. 20 (2018): 1444-1454. PDF

[2] Wu, Kedi, and Guo-Wei Wei. "Quantitative toxicity prediction using topology based multitask deep neural networks." Journal of chemical information and modeling 58, no. 2 (2018): 520-531. PDF

[3] Chen, Dong, Kaifu Gao, Duc Duy Nguyen, Xin Chen, Yi Jiang, Guo-Wei Wei, and Feng Pan. "Algebraic graph-assisted bidirectional transformers for molecular property prediction." Nature Communications 12, no. 1 (2021): 1-9. PDF

[4] Jiang, Jian, Rui Wang, and Guo-Wei Wei. "GGL-Tox: Geometric Graph Learning for Toxicity Prediction." Journal of Chemical Information and Modeling (2021). PDF

[5] Chen, Dong, Guowei Wei, and Feng Pan. "Extracting Predictive Representations from Hundreds of Millions of Molecules". PDF

[6] Kaifu Gao, Dong Chen, Alfred J Robison, and Guo-Wei Wei. "Proteome-informed machine learning studies of cocaine addiction". PDF

[7] Hongsong Feng, Kaifu Gao, Dong Chen, Alfred J Robison, Edmund Ellsworth and Guo-Wei Wei. "Machine learning analysis of cocaine addiction informed by DAT, SERT, and NET-based interactome networks". PDF

[8] Bozheng Dou, Zailiang Zhu, Yucang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, and Guo-Wei Wei, "TIDAL: Topology-Inferred Drug Addiction Learning", in print, 2022.

[9] Hongsong Feng and Guo-Wei Wei, Virtual screening of DrugBank database for hERG blockers using topological Laplacian-assisted AI models, Computers in Biology and Medicine (2023).PDF

[10] Hongsong Feng, Rana Elladki, Jian Jiang, and Guo-Wei Wei, Machine-learning Analysis of Opioid Use Disorder Informed by MOR, DOR, KOR, NOR and ZOR-Based Interactome Networks, Computers in Biology and Medicine (2023)PDF

[11] Li Shen, Hongsong Feng, Yuchi Qiu, and Guo-Wei Wei. "SVSBI: Sequence-based virtual screening of biomolecular interactions". PDF

[12]Hongsong Feng, Jian Jiang, and Guo-Wei Wei. "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder". PDF

[13]Long Chen, Jian Jiang, Bozheng Dou, Hongsong Feng, Jie Liu, Yueying Zhu, Bengong Zhang, Tianshou Zhou, and Guo-Wei We. "Machine Learning Study of the Extended Drug-target Interaction Network informed by Pain Related Voltage-Gated Sodium Channels", in submission, 2023

[12]Hongsong Feng, Jian Jiang, and Guo-Wei Wei. "Machine-learning Repurposing of DrugBank Compounds for Opioid Use Disorder". PDF