打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
RDKit | 基于PCA探索化学空间

作者丨王建民

单位丨湖南大学

研究方向丨药物设计、生物医药大数据


基于主成分分析(PCA,Principal Component Analysis)和聚类分析探索化学空间。

分析化合物数据库,发现化合物之间的共同描述符(物理化学特性)。


导入库

  1. import os

  2. import pandas as pd

  3. import numpy as np

  4. import matplotlib.pyplot as plt

  5. from matplotlib import gridspec

  6. from rdkit import Chem, DataStructs

  7. from rdkit.Chem import Descriptors,Crippen

  8. from sklearn.decomposition import PCA

  9. from sklearn.preprocessing import StandardScaler

  10. import matplotlib.cm as cm

  11. from sklearn.metrics import silhouette_samples, silhouette_score

  12. from sklearn.cluster import KMeans

载入分子数据

该库包含超过10 000 000个SMILES。可以将.smiles文件作为文本文件读取,将10000个分子保存在pandas中。

  1. #Loading the molecules

  2. data=pd.DataFrame()

  3. with open('mol_parent.smi','r') as file:

  4. for index,line in enumerate(file):

  5. if 0<index<=10000:

  6. data.loc[index,'SMILES']=line.split()[0]

  7. data.loc[index,'Id']=line.split()[1]

  1. data.head(10)

分子描述符计算

使用RDKit来计算2D和3D分子描述符。

  1. # calculate the descriptors and add them to dataframe

  2. for i in data.index:

  3. mol=Chem.MolFromSmiles(data.loc[i,'SMILES'])

  4. data.loc[i,'MolWt']=Descriptors.ExactMolWt (mol)

  5. data.loc[i,'TPSA']=Chem.rdMolDescriptors.CalcTPSA(mol) #Topological Polar Surface Area

  6. data.loc[i,'nRotB']=Descriptors.NumRotatableBonds (mol) #Number of rotable bonds

  7. data.loc[i,'HBD']=Descriptors.NumHDonors(mol) #Number of H bond donors

  8. data.loc[i,'HBA']=Descriptors.NumHAcceptors(mol) #Number of H bond acceptors

  9. data.loc[i,'LogP']=Descriptors.MolLogP(mol) #LogP

  1. data.head(10)

计算分子描述符的PCA

选择将用于PCA计算的的值即计算的描述符。

  1. descriptors = data.loc[:, ['MolWt', 'TPSA', 'nRotB', 'HBD','HBA', 'LogP']].values

一个非常重要的步骤是描述符的标度化。

  1. descriptors_std = StandardScaler().fit_transform(descriptors)

计算PCA

  1. pca = PCA()

  2. descriptors_2d = pca.fit_transform(descriptors_std)

  1. descriptors_pca= pd.DataFrame(descriptors_2d)

  2. descriptors_pca.index = data.index

  3. descriptors_pca.columns = ['PC{}'.format(i+1) for i in descriptors_pca.columns]

  4. descriptors_pca.head(10)

检查解释方差,查看PCA中每个组的解释方差

  1. print(pca.explained_variance_ratio_)

  2. print(sum(pca.explained_variance_ratio_))

[0.40340106 0.31163646 0.15708923 0.07870552 0.03381616 0.01535157]

1.0

  1. plt.rcParams['axes.linewidth'] = 1.5

  2. plt.figure(figsize=(8,6))

  3. fig, ax = plt.subplots(figsize=(8,6))

  4. var=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)

  5. plt.plot([i+1 for i in range(len(var))],var,'k-',linewidth=2)

  6. plt.xticks([i+1 for i in range(len(var))])

  7. plt.ylabel('% Variance Explained',fontsize=16,fontweight='bold')

  8. plt.xlabel('Pincipal Component (PC)',fontsize=16,fontweight='bold')

  9. ax.spines['top'].set_visible(False)

  10. ax.spines['right'].set_visible(False)

  11. plt.tight_layout()

  12. plt.tick_params ('both',width=2,labelsize=12)

  1. fig = plt.figure(figsize=(8,6))

  2. ax = fig.add_subplot(111)

  3. ax.plot(descriptors_pca['PC1'],descriptors_pca['PC2'],'o',color='k')

  4. ax.set_title ('Principal Component Analysis',fontsize=16,fontweight='bold',family='sans-serif')

  5. ax.set_xlabel ('PC1',fontsize=14,fontweight='bold')

  6. ax.set_ylabel ('PC2',fontsize=14,fontweight='bold')

  7. plt.tick_params ('both',width=2,labelsize=12)

  8. plt.tight_layout()

  9. plt.show()

K-means聚类和主要特征识别

将PCA值从-1重新缩放到1,因为想要在特征(描述符)的协方差周期内分析数据。 

  1. # This normalization will be performed just for PC1 and PC2, but can be done for all the components.

  2. scale1 = 1.0/(max(descriptors_pca['PC1']) - min(descriptors_pca['PC1']))

  3. scale2 = 1.0/(max(descriptors_pca['PC2']) - min(descriptors_pca['PC2']))

  4. # add the new values to our PCA coloum

  5. descriptors_pca['PC1_normalized']=[i*scale1 for i in descriptors_pca['PC1']]

  6. descriptors_pca['PC2_normalized']=[i*scale2 for i in descriptors_pca['PC2']]

  1. descriptors_pca.head(10)

  1. fig = plt.figure(figsize=(8,6))

  2. ax = fig.add_subplot(111)

  3. ax.plot(descriptors_pca['PC1_normalized'],descriptors_pca['PC2_normalized'],'o',color='k')

  4. ax.set_title ('Principal Component Analysis',fontsize=16,fontweight='bold',family='sans-serif')

  5. ax.set_xlabel ('PC1',fontsize=14,fontweight='bold')

  6. ax.set_ylabel ('PC2',fontsize=14,fontweight='bold')

  7. plt.tick_params ('both',width=2,labelsize=12)

  8. plt.tight_layout()

  9. plt.show()

K-means聚类

K-means聚类是一种算法,其中必须定义聚类的数量。然而,为了在数学上基于分布为一组点选择多个聚类,可以应用不同的算法。 

  1. range_n_clusters = [2, 3, 4, 5, 6, 7,8,9,10]

  2. for n_clusters in range_n_clusters:

  3. fig, (ax1,ax2)= plt.subplots(1, 2)

  4. fig.set_size_inches(8, 4)

  5. kmeans = KMeans(n_clusters=n_clusters, random_state=10)

  6. cluster_labels = kmeans.fit_predict(descriptors_pca[['PC1_normalized','PC2_normalized']])

  7. silhouette_avg = silhouette_score(descriptors_pca[['PC1_normalized','PC2_normalized']], cluster_labels)

  8. print("For n_clusters =", n_clusters,

  9. "The average silhouette_score is :", silhouette_avg)

  10. sample_silhouette_values = silhouette_samples(descriptors_pca[['PC1_normalized','PC2_normalized']], cluster_labels)

  11. y_lower = 10

  12. for i in range(n_clusters):

  13. # Aggregate the silhouette scores for samples belonging to

  14. # cluster i, and sort them

  15. ith_cluster_silhouette_values = \

  16. sample_silhouette_values[cluster_labels == i]

  17. ith_cluster_silhouette_values.sort()

  18. size_cluster_i = ith_cluster_silhouette_values.shape[0]

  19. y_upper = y_lower + size_cluster_i

  20. color = cm.nipy_spectral(float(i) / n_clusters)

  21. ax1.fill_betweenx(np.arange(y_lower, y_upper),

  22. 0, ith_cluster_silhouette_values,

  23. facecolor=color, edgecolor=color, alpha=0.7)

  24. # Label the silhouette plots with their cluster numbers at the middle

  25. ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

  26. # Compute the new y_lower for next plot

  27. y_lower = y_upper + 10 # 10 for the 0 samples

  28. ax1.set_title("The silhouette plot for the various clusters.")

  29. ax1.set_xlabel("The silhouette coefficient values")

  30. ax1.set_ylabel("Cluster label")

  31. # The vertical line for average silhouette score of all the values

  32. ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

  33. # 2nd Plot showing the actual clusters formed

  34. colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)

  35. ax2.scatter(descriptors_pca['PC1_normalized'], descriptors_pca['PC2_normalized'],

  36. marker='.', s=30, lw=0, alpha=0.7,c=colors, edgecolor='k')

  37. # Labeling the clusters

  38. centers = kmeans.cluster_centers_

  39. # Draw white circles at cluster centers

  40. ax2.scatter(centers[:, 0], centers[:, 1], marker='o',

  41. c="white", alpha=1, s=200, edgecolor='k')

  42. for i, c in enumerate(centers):

  43. ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,

  44. s=50, edgecolor='k')

  45. ax2.set_title("The visualization of the clustered data.")

  46. ax2.set_xlabel("PC1")

  47. ax2.set_ylabel("PC2")

  48. plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "

  49. "with n_clusters = %d" % n_clusters),

  50. fontsize=14, fontweight='bold')

  51. plt.show()

Silhouette_score越高,群集分布越好。 

n_clusters = 2平均silhouette_score是:0.4205650178373521

  1. kmeans = KMeans(n_clusters=2, random_state=10) # We define the best number of clusters

  2. clusters = kmeans.fit(descriptors_pca[['PC1_normalized','PC2_normalized']]) #PC1 vs PC2 (normalized values)

  1. descriptors_pca['Cluster_PC1_PC2'] = pd.Series(clusters.labels_, index=data.index)

  2. descriptors_pca.head(10)

绘制PC1与PC2数据, 每个群集将具有不同的颜色,将找到每个主要组件的主要特征。

  1. plt.rcParams['axes.linewidth'] = 1.5

  2. plt.figure(figsize=(10,8))

  3. fig, ax = plt.subplots(figsize=(7,7))

  4. color_code={ 0: 'magenta',\

  5. 1.0: 'orange',\

  6. 2.0: 'cyan',\

  7. 3.0: 'c',\

  8. 4.0: 'm',\

  9. 5.0: 'y',\

  10. 6.0: 'darkorange',

  11. 7.0: 'k',

  12. }

  13. for i in descriptors_pca.index:

  14. ax.plot(descriptors_pca.loc[i].at['PC1_normalized'],descriptors_pca.loc[i].at['PC2_normalized'],

  15. c=color_code[descriptors_pca.loc[i].at['Cluster_PC1_PC2']],

  16. marker='o',markersize=8,markeredgecolor='k',alpha=0.3)

  17. plt.xlabel ('PC1',fontsize=14,fontweight='bold')

  18. ax.xaxis.set_label_coords(0.98, 0.45)

  19. plt.ylabel ('PC2',fontsize=14,fontweight='bold')

  20. ax.yaxis.set_label_coords(0.45, 0.98)

  21. plt.tick_params ('both',width=2,labelsize=12)

  22. ax.spines['left'].set_position(('data', 0))

  23. ax.spines['bottom'].set_position(('data', 0))

  24. ax.spines['top'].set_visible(False)

  25. ax.spines['right'].set_visible(False)

  26. lab=['MolWt', 'TPSA', 'nRotB', 'HBD','HBA', 'LogP'] #Feature labels

  27. l=np.transpose(pca.components_[0:2, :]) ## We will get the components eigenvectors (main features) for PC1 and PC2

  28. n = l.shape[0]

  29. for i in range(n):

  30. plt.arrow(0, 0, l[i,0], l[i,1],color= 'k',alpha=0.6,linewidth=1.2,head_width=0.025)

  31. plt.text(l[i,0]*1.25, l[i,1]*1.25, lab[i], color = 'k',va = 'center', ha = 'center',fontsize=11)

  32. circle = plt.Circle((0,0), 1, color='gray', fill=False,clip_on=True,linewidth=1.5,linestyle='--')

  33. ax.add_artist(circle)

  34. plt.xlim(-1.2,1.2)

  35. plt.ylim(-1.2,1.2)

  36. plt.tight_layout()

  37. plt.savefig("fig1.png", dpi=300)

  38. plt.show()

可以确定与PC1(MolWt、nRotB、LogP)和PC2(HBD,TPSA,HBA)正相关和负相关的特征。此外,由于向量长度,可以看到LogP是“最重要的”特征(描述符)。

合并保存数据

  1. data=data.join(descriptors_pca)

  2. data.head(10)

  3. data.to_csv('moldscPCA.csv',index=None)

参考资料

http://www.rdkit.org

http://www.rdkit.org/docs/index.html

DrugAI

( 扫描下方二维码订阅获取最新消息!)

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
机器学习中降维技术Python示例
K
单细胞转录组实战02: 数据整理与之质控
应用PCA降维加速模型训练
手把手教你如何利用K均值聚类实现异常值的识别
数据挖掘——聚类分析总结(建议收藏)
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服