Dataset | Description | # Compounds | # Tasks | Recommend Metric* | Task Type | Reference | |
---|---|---|---|---|---|---|---|
logP | Partition coefficient datasets, including training set (8199 compounds), Food and Drug Administration (FDA) set, Star, and Nonstar set. SMILES and 3D coordinates are provided. | 8199(train), 406(test-FDA), 223(test-Star), 43(test-Nonstar) |
3 | R2 | Regression | 1, 3 | |
logS | A diverse dataset of 1708 molecules. 3D coordinates are provided | 1708 | 1 | R2 | Regression | 1 | |
Small aqueous solubility datasets. 3D coordinates are provided. | 1290(train), 21(test-1), 120(test-2) |
2 | R | Regression | 1 | ||
Quantitative toxicity | LD50 | The oral rat LD50 dataset (LD50). SMILES and 3D coordinates are provided. | 5931(train), 1482(test) |
1 | R2 | Regression | 2, 3 |
IGC50 | Tetrahymena pyriformis IGC50 dataset (IGC50). SMILES and 3D coordinates are provided. | 1434(train), 358(test) |
1 | R2 | Regression | 2, 3 | |
LC50 | 96 h fathead minnow LC50 dataset. SMILES and 3D coordinates are provided. | 659(train), 164(test) |
1 | R2 | Regression | 2, 3 | |
LC50DM | Daphnia magna LC50 dataset (LC50DM). SMILES and 3D coordinates are provided. | 283(train), 70(test) |
1 | R2 | Regression | 2, 3 | |
Qualitative toxicity | Tox21 | Tox21 Data Challenge 2014 is designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects, which includes 12 date sets. The official web site is https://tripod.nih.gov/tox21/challenge/ | 1. NR-AhR:8162(train),609(test) 2. NR-AR:9353(train),585(test) 3. NR-AR-LBD:8591(train),581(test) 4. NR-Aromatase:7220(train),527(test) 5. NR-ER:7689(train),515(test) 6. NR-ER-LBD:8743(train),599(test) 7. NR-ppar-gamma:8176(train),604(test) 8. SR-ARE:7163(train),554(test) 9. SR-ATAD5:9085(train),621(test) 10.SR-HSE:8144(train),609(test) 11.SR-MMP:7314(train),542(test) 12.SR-p53:8626(train),615(test) |
12 | ROC-AUC | Classification | 4 |
FreeSolv | Solvation free energy (FreeSolv). SMILES and 3D coordinates are provided. | 642 | 1 | RMSE | Regression | 3 | |
Lipophilicity | SMILES and 3D coordinates are provided. | 4200 | 1 | RMSE | Regression | 3 | |
BBBP | Blood-brain barrier penetration (BBBP). SMILES and 3D coordinates are provided. | 2039 | 1 | ROC-AUC | Classification | 3 | |
Drug_addiction_related | Receptors related to opioid or cocaine addiction. SMILES and 3D coordinates are provided. | mu(3010),5HT2A(2787), 5HT2C(1723),5HT6(2373), D2(3720),NMDA(246), NOP(431),catB(1183), catL(905),delta (2679), and kappa(2409) |
11 | ROC-AUC, MCC | Classification | 4 | |
DAT | In the DAT data set, the majority of the data points are of human DAT (hDAT) and rat DAT (rDAT).all-DAT data set includes hDAT, rDAT, and a few other species. The filters were developed to sort the data based on how they were acquired, that is, from either radioligand binding assays "binding", or inhibition of dopamine uptake ("uptake"). SMILES and 3D coordinates are provided | all-DAT_binding(887), all-DAT_uptake(219), hDAT_binding(503), hDAT_uptake(45), rDAT_binding(424), rDAT_uptake(177) |
6 | ROC-AUC, MCC, Accuracy, Sensitivity, Specificity, F1 score |
Classification | 4 | |
all-DAT_binding(1189), all-DAT_uptake(350), hDAT_binding(684), hDAT_uptake(126), rDAT_binding(541), rDAT_uptake(229) |
6 | R2,RMSE, D | Regression | 4 | |||
hERG | The human ether-a-go-go (hERG) potassium channel. The filters were developed to sort the data based on how they were acquired, that is, from either patch-clamp electrophysiology (referred to as "clamp"), or radioligand binding assays "binding". SMILES and 3D coordinates are provided | hERG_binding(1137), hERG_clamp(783) |
2 | ROC-AUC, MCC, Accuracy, Sensitivity, Specificity, F1 score |
Classification | 4 | |
hERG_binding(2043), hERG_clamp(1405) |
2 | R2,RMSE, D | Regression | 4 |
* Metrics: R - Pearson correlation coefficient; R2 - Squared Pearson correlation coefficient; RMSE - Root Mean Square Error; MAE - Mean Absolute Error;
Note: Each dataset contains the README file, which contains the source or reference of the data.
[1] Wu, Kedi, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. "TopP-S: Persistent homology-based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility." Journal of computational chemistry 39, no. 20 (2018): 1444-1454. PDF
[2] Wu, Kedi, and Guo-Wei Wei. "Quantitative toxicity prediction using topology based multitask deep neural networks." Journal of chemical information and modeling 58, no. 2 (2018): 520-531. PDF
[3] Chen, Dong, Kaifu Gao, Duc Nguyen, Xin Chen, Yi Jiang, Guowei Wei, and Feng Pan. "Algebraic Graph-assisted Bidirectional Transformers for Molecular Prediction." (2021). PDF
[4] Jiang, Jian, Rui Wang, and Guo-Wei Wei. "GGL-Tox: Geometric Graph Learning for Toxicity Prediction." Journal of Chemical Information and Modeling (2021). PDF
[5] Bozheng Dou, Zailiang Zhu, Yucang Cao, Jian Jiang, Yueying Zhu, Dong Chen, Hongsong Feng, Jie Liu, Bengong Zhang, Tianshou Zhou, and Guo-Wei Wei, "TIDAL: Topology-Inferred Drug Addiction Learning", submitted to Nature Communication, in process, 2022.