Python from Scratch Chapter 5 Bioinformatics 6 GEO Database Hands-on Analysis (1) Table of Contents Text
The GEO database, known as GENE EXPRESSION OMNIBUS, is a gene expression database created and maintained by NCBI, the National Center for Biotechnology Information. It was created in 2000 and contains high-throughput gene expression data submitted by research institutions around the world, which means that as long as the paper is currently published, the data for the gene expression tests involved in the paper can be found in this database. This database is supposed to be the database for introductory bioinformatics learning to mine, the volume of postings is estimated to be in the thousands per year, and GEO has a wealth of sequencing files on it, oncology, non-oncology, etc. Almost everything is available and can be mined for free. There is a lot of information about this database online, so I won't go into it. For those interested, take a look above the raw letter skill tree.
BioShin Skill Tree http://www.biotrainee.com/
# -*- coding: utf-8 -*- """ Created on Fri Dec 14 00:47:52 2018 @author: czh """ %clear %reset -f # In[*] # loadedPython warehouse import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import os os.chdir('D:\train')
# In[*] data = pd.read_csv('GSE18388_series_matrix.txt.gz', delimiter=' ',skiprows=31) # In[*] data= data.drop(data.index[0:34]) data.rename(columns={'!Sample_title':'gene_id'}, inplace=True)
The first 33 lines of introductory information are not needed and need to be removed. Also change the column name in the first column.
# In[*] data.isna().sum() data = data.dropna(axis=0) data.dtypes
And currently the data box contains a lot of characters when read, so the column attributes are OBJECT, and we need to change those column attributes to numeric.
# In[*] data.dtypes.eq(object) # In[*] cols = data.columns[data.dtypes.eq(object)] # In[*] data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=0) data.dtypes.eq(object)
# In[*] data.dtypes.eq(object) # In[*] cols = data.columns[data.dtypes.eq(object)] # In[*] data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=0) data.dtypes.eq(object)
By the above we can see that there is no difference in the sample as a whole and a variance analysis can be done.
GEO2R = pd.read_table("geo_result.txt", sep=" ")
This data box then contains the results of the variance analysis that has been analyzed, including the gene name and probe ID, as well as the more interesting variance multiples and p-values