Python from Scratch Chapter 5 Bioinformatics 6 GEO Database Hands-on Analysis (1) Table of Contents Text


main text (as opposed footnotes)

The GEO database, known as GENE EXPRESSION OMNIBUS, is a gene expression database created and maintained by NCBI, the National Center for Biotechnology Information. It was created in 2000 and contains high-throughput gene expression data submitted by research institutions around the world, which means that as long as the paper is currently published, the data for the gene expression tests involved in the paper can be found in this database. This database is supposed to be the database for introductory bioinformatics learning to mine, the volume of postings is estimated to be in the thousands per year, and GEO has a wealth of sequencing files on it, oncology, non-oncology, etc. Almost everything is available and can be mined for free. There is a lot of information about this database online, so I won't go into it. For those interested, take a look above the raw letter skill tree.

BioShin Skill Tree http://www.biotrainee.com/

This one and the next few are focused on writing about a common analysis process for GEO databases.

  • Import the required python packages and modify the default file location.
# -*- coding: utf-8 -*-
"""
Created on Fri Dec 14 00:47:52 2018

@author: czh
"""
%clear
%reset -f
# In[*]
#  loadedPython warehouse
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
os.chdir('D:\train')
  • Trimming data
# In[*]

data = pd.read_csv('GSE18388_series_matrix.txt.gz', 
                   delimiter='	',skiprows=31)

# In[*]
data= data.drop(data.index[0:34])
data.rename(columns={'!Sample_title':'gene_id'}, inplace=True)

The first 33 lines of introductory information are not needed and need to be removed. Also change the column name in the first column.

  • Deletion of genes containing deletion values
# In[*]
data.isna().sum()
data = data.dropna(axis=0)
data.dtypes
  • Due to the later analysis of variance or mapping, it is required that all genetic data needs to be numeric.

And currently the data box contains a lot of characters when read, so the column attributes are OBJECT, and we need to change those column attributes to numeric.

# In[*]

data.dtypes.eq(object)
# In[*]
cols = data.columns[data.dtypes.eq(object)]
# In[*]
data[cols] = data[cols].apply(pd.to_numeric,
          errors='coerce', axis=0)
data.dtypes.eq(object)
  • See if there are overall differences between samples
# In[*]

data.dtypes.eq(object)
# In[*]
cols = data.columns[data.dtypes.eq(object)]
# In[*]
data[cols] = data[cols].apply(pd.to_numeric,
          errors='coerce', axis=0)
data.dtypes.eq(object)

By the above we can see that there is no difference in the sample as a whole and a variance analysis can be done.

GEO data comes with a difference analysis - geo2R, although it is relatively simple and has many tutorials, no programming is required, but it is not particularly accurate, if you simply do a little raw letter verification, you can use this, we read the results of geo2R analysis directly here

GEO2R = pd.read_table("geo_result.txt", sep="	")

This data box then contains the results of the variance analysis that has been analyzed, including the gene name and probe ID, as well as the more interesting variance multiples and p-values


Recommended>>
1、SQLJOIN illustrated and detailed
2、German media says US NSA hackers used Microsoft error report to break in
3、CaseHow Flying Rabbit Tours Leveraged the Internet to Develop its Brand in the Age of Universal Travel
4、Lesson 1 How to set up an ethereum development environment in a WINDOWS environment
5、Zhang Hancheng Using big data to promote supplyside reform

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号