Welcome

3D Benchmarks for Molecular Machine Learning




Dataset Description # Compounds # Tasks Recommend Metric* Task Type Reference
logP Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set. SMILES and 3D coordinates are provided. 8199(train),
406(test-FDA),
223(test-Star),
43(test-Nonstar)
3 R2 Regression 1, 3
logS A diverse dataset of 1708 molecules. 3D coordinates are provided 1708 1 R2 Regression 1
Small aqueous solubility datasets. 3D coordinates are provided. 1290(train),
21(test-1),
120(test-2)
2 R Regression 1
Quantitative toxicity LD50 The oral rat LD50 dataset (LD50). SMILES and 3D coordinates are provided. 5931(train),
1482(test)
1 R2 Regression 2, 3
IGC50 Tetrahymena pyriformis IGC50 dataset (IGC50). SMILES and 3D coordinates are provided. 1434(train),
358(test)
1 R2 Regression 2, 3
LC50 96 h fathead minnow LC50 dataset. SMILES and 3D coordinates are provided. 659(train),
164(test)
1 R2 Regression 2, 3
LC50DM Daphnia magna LC50 dataset (LC50DM). SMILES and 3D coordinates are provided. 283(train),
70(test)
1 R2 Regression 2, 3
Qualitative toxicity Tox21 Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/ 1. NR-AhR:8162(train),609(test)
2. NR-AR:9353(train),585(test)
3. NR-AR-LBD:8591(train),581(test)
4. NR-Aromatase:7220(train),527(test)
5. NR-ER:7689(train),515(test)
6. NR-ER-LBD:8743(train),599(test)
7. NR-ppar-gamma:8176(train),604(test)
8. SR-ARE:7163(train),554(test)
9. SR-ATAD5:9085(train),621(test)
10.SR-HSE:8144(train),609(test)
11.SR-MMP:7314(train),542(test)
12.SR-p53:8626(train),615(test)
12 ROC-AUC Classification 4
FreeSolv Solvation free energy (FreeSolv). SMILES and 3D coordinates are provided. 642 1 RMSE Regression 3
Lipophilicity SMILES and 3D coordinates are provided. 4200 1 RMSE Regression 3
BBBP Blood-brain barrier penetration (BBBP). SMILES and 3D coordinates are provided. 2039 1 ROC-AUC Classification 3
Drug_addiction_related Receptors related to opioid or cocaine addiction. SMILES and 3D coordinates are provided. mu(3010),5HT2A(2787),
5HT2C(1723),5HT6(2373),
D2(3720),NMDA(246),
NOP(431),catB(1183),
catL(905),delta (2679),
and kappa(2409)
11 ROC-AUC, MCC Classification 4
DAT In the DAT data set, the majority of the data points are of human DAT (hDAT) and rat DAT (rDAT).all-DAT data set includes hDAT, rDAT, and a few other species. The filters were developed to sort the data based on how they were acquired, that is, from either radioligand binding assays "binding", or inhibition of dopamine uptake ("uptake"). SMILES and 3D coordinates are provided all-DAT_binding(887),
all-DAT_uptake(219),
hDAT_binding(503),
hDAT_uptake(45),
rDAT_binding(424),
rDAT_uptake(177)
6 ROC-AUC, MCC,
Accuracy, Sensitivity,
Specificity, F1 score
Classification 4
all-DAT_binding(1189),
all-DAT_uptake(350),
hDAT_binding(684),
hDAT_uptake(126),
rDAT_binding(541),
rDAT_uptake(229)
6 R2,RMSE, D Regression 4
hERG The human ether-a-go-go (hERG) potassium channel. The filters were developed to sort the data based on how they were acquired, that is, from either patch-clamp electrophysiology (referred to as "clamp"), or radioligand binding assays "binding". SMILES and 3D coordinates are provided hERG_binding(1137),
hERG_clamp(783)
2 ROC-AUC, MCC,
Accuracy, Sensitivity,
Specificity, F1 score
Classification 4
hERG_binding(2043),
hERG_clamp(1405)
2 R2,RMSE, D Regression 4

* Metrics: R - Pearson correlation coefficient; R2 - Squared Pearson correlation coefficient; RMSE - Root Mean Square Error; MAE - Mean Absolute Error;

Note: Each dataset contains the README file, which contains the source or reference of the data.





References


[1] Wu, Kedi, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. "TopP-S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility." Journal of computational chemistry 39, no. 20 (2018): 1444-1454. PDF

[2] Wu, Kedi, and Guo-Wei Wei. "Quantitative toxicity prediction using topology based multitask deep neural networks." Journal of chemical information and modeling 58, no. 2 (2018): 520-531. PDF

[3] Chen, Dong, Kaifu Gao, Duc Nguyen, Xin Chen, Yi Jiang, Guowei Wei, and Feng Pan. "Algebraic Graph-assisted Bidirectional Transformers for Molecular Prediction." (2021). PDF

[4] Jiang, Jian, Rui Wang, and Guo-Wei Wei. "GGL-Tox: Geometric Graph Learning for Toxicity Prediction." Journal of Chemical Information and Modeling (2021). PDF

[5] Bozheng Dou, Zailiang Zhu, Yucang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, and Guo-Wei Wei, "TIDAL: Topology-Inferred Drug Addiction Learning", submitted to Nature Communication, in process, 2022.