Academician Chen Runsheng, Chinese Academy of Sciences: Challenges in Precision Medicine Data Processing


Chen Runsheng: Dear experts and friends, it is a great honor to be invited to participate in this conference, and I will mainly talk about big data and precision medicine today, and I hope you will criticize and correct me.

◆ ◆ ◆

What is Precision Medicine

I'll start by talking about what precision medicine is, and at its core is the phrase, the application of histological big data in medicine, especially in clinical medicine. As you know with the deciphering of the genetic code in the 1990s, a huge amount of data at the molecular level represented by the genetic code or the genome, which we also call histological data, has been generated continuously and is now increasing very rapidly, faster than any known data has ever been generated. Thanks to research technology, measuring a human genetic code has now become very, very simple and easy, for example, any of us can invest very little money, about six or seven thousand RMB at the moment as far as I know, and in three or four days you can get your genetic code and find out that the genetic code is highly correlated with certain diseases.

In recent years, molecular-level information has been used in medicine, especially in clinical medicine, to enhance the efficiency of clinical diagnosis and treatment, and such a trend has actually created the application of precision medicine. So the core of precision medicine, from the big data end is to apply the big data of histology to medicine, as you know, all these big data of histology are very big data, so directly as a medical expert, or molecular biologist is no way to understand, must be through the big data scientists with specific theoretical methods and techniques to mine in order to obtain knowledge about the clinical aspects, so precision medicine is the application of big data of histology in the clinic.

As you know very beautifully is the most basic genetic code for everyone, the measurement of this genetic code is not a problem now, but the data mining to find out the correlation with the disease will be a very urgent problem now. Therefore, since the study of the human code, terms such as translational medicine and individualized medicine have emerged, but in 2011, the term "precision medicine" emerged internationally, which is actually a general summary of this trend.

◆ ◆ ◆

What essential changes have been brought about by precision medicine?

The reason why precision medicine itself has received the attention of many national leaders is due to the fact that it has the potential to produce some essential changes, the most significant essential changes, which we can also summarize in one sentence. "Precision medicine" could lead to a fundamental change in the concept of health care What is the concept of , what is the concept? From the days when the health care system was primarily diagnostic and therapeutic, make it shift to a health assurance focus. We all know that medicine nowadays is all about the patient and the diagnosis and treatment, that is to say, such a conceptualized medical system consisting of the patient, the hospital and the doctor. And with the development of precision medicine, we can analyze the big data to understand his health status when he is not sick and predict his future health development, in this case the target of our medical health is no longer the patient, but the whole population, all people.

At this time, the concept of the health care system is also not for the purpose of treatment, but for health prediction, health assessment and health intervention, so that the whole health care system will undergo a conceptual change, from the current predominantly medical care to the future predominantly predictive assurance. Such a fundamental conceptual change will inevitably lead to the development of corresponding industries, so some estimates are that the industries generated around the new concept could reach perhaps $200 billion or more by 2018, so this is a value that can affect GDP to some extent. Therefore, such a concept of precision medicine has become a strategic high point leading the international development trend, which is why it has attracted the attention of leaders of various countries. Therefore, precision medicine will actually bring some, whether in the medical concept or industry, there will be some essential changes, which is why it has attracted the attention of leaders of various countries.

The United States also promotes precision medicine, which I won't go into detail about. The most important performance in the United States is to measure the genetic code of a million natural persons, and a million is a large number. The EU is also conducting research in precision medicine, then he wants to measure the genetic code of 100,000 tumors and rare patients. Japan also has a corresponding plan for precision medicine. So where exactly does precision medicine lead to so-called new growth among new industries? I think in at least four ways.

  • Precision medicine can drive the development of massive biospecimen repositories and massive databases, and precision medicine can lead to the measurement of biological samples from 100,000 to millions of people, which involves the collection, preservation, sample preparation and extraction, and all aspects of sample provisioning for use of substantial biological samples on a massive scale. Without a million-scale database, of course, it cannot accommodate its development, and after these data measurements, these million-scale data should have corresponding databases to keep them, so the first one should promote the development of massive-scale biospecimen repositories and databases, and some people estimate that this can reach ten billion dollars of data samples in the next year or two.
  • The size of the data that can drive genome sequencing, an industry that some estimate could go to $11.7 billion in 2018, and my personal discussions with experts on sequencing, since sequencing is so cheap I think this data must be more than it is.
  • Then you get a lot of target development for new drug design, and this industry is directly involved in medical diagnostics and drug design, which is the third industry.
  • The substantial, let's say, large industry circle in health for health facilities, health practitioners, and this industry circle is estimated to reach $200 billion in 2018, are all aspects of the substantial new industries that can be foreseen as a result of precision medicine. The goals of our precision medicine are consistent and aligned with the international ones above.

◆ ◆ ◆

What are the conditions that must be in place to achieve precision medicine?

I think there are at least two conditions that are not in place before precision medicine has been undertaken.

One is to collect and obtain a large amount of histological data , and these histological data must be deeply mined by big data technology, so the first basis is the current international two frontiers, which are the intersection and integration of the two sciences of histology and big data. With this result, we have access to a large number of variants at the molecular level that are relevant to the disease. Then we want to use these data for a second basic study, which is to build associations between molecular-level information and macroscopic diseases , is to build bridges between molecular-level information and macroscopic disease associations, which is to develop what is called bioinformatics, biological networks, systems biology, and a host of other things. With these two bridges, with information at the molecular level, we're well on our way to precision medicine.

The point that needs to be made in precision medicine is that precision medicine is actually complementary to traditional medicine, imaging, biochemistry, and doctors' experience, and promotes and facilitates each other, unlike some over-promotion of the role of precision medicine that I have come across, saying that we can solve everything after sequencing.

◆ ◆ ◆

Precision medicine is just hitting the road

Although it has good conceptual changes, and although it shows us a bright future for the healthcare system, there are some very huge obstacles on the way to precision medicine whether it's histological measurements or big data analysis, so I think precision medicine is just getting started at the moment and we still have too much to do.

Where exactly are the opportunities for innovation? What are its challenges? I think this is a lot, and I will just briefly mention one or two difficulties today with regard to some of the following difficulties in histology and big data processing, and you can see that actually the road to precision medicine is still quite long.

  • The first one I would like to talk about is the great challenges and difficulties in histological measurements.

We know that precision medicine is now based on genetic code, we first ask the question, in the current how much do we know about our own human genetic code, if we all understand to achieve precision will have a molecular basis, if we know very little, then we have too much to do. And in fact, it is exactly like the latter, this is a piece of human genetic code, everyone here has it, I also have it, who removed it I think he will not live, such a genetic code everyone 3 times 10 to the ninth power, if this character bound into a book, about forty stories high, I believe that no one can read, in the current set of all human intelligence we can only read 3% of it, this is the current challenge. Again, I'll state that our genetic code can be measured by people for $7,000, but the part that you can read is only about 3%, and that 3% is the part that you know from high school that encodes proteins, or that obeys the central law, which we call the coding sequence in the genetic code, and the other 97% is the part that doesn't encode proteins and the part that we can't read so far. In other words roughly 97% of our genetic code is still unreadable right now, and since it doesn't understand what it's doing, when it has a change of course it doesn't know it either. Under this implication, there are of course enormous difficulties and obstacles if we use it as a histological study.

I quote an article from the December 17, 2010 issue of Science magazine, which selected two top ten scientific breakthroughs, one is the top ten scientific breakthroughs in the field of natural sciences in the world in 2010, and the other one is the entry of mankind into the new century, after entering the 21st century, adding the decade from 2001 to 2010 together, that is to say the decade closest to us if added together, which ten items in the field of natural sciences are the most worthy of our attention? The first is the subject I was talking about earlier, dark matter in the genome, and I myself am more fresh with dark information, not that the matter is not measured, just not read. This means that even in the current human genetic code, there are still more than 90% of the genetic code that we cannot read and therefore cannot be precise, and this is the most basic and important challenge in histology, which is that we still do not know about 97% of the genetic code.

I'll expand a bit below to give you some discussion. Looking first at the genetic code, which is genomic research, we know that among the human genetic code 97% of the heritage code so far is still unreadable, so of course it cannot be precise. And if we do a comparative study, looking from lower to higher organisms, the lower the organism, the lower the organism, the genetic code of E. coli, let's denote it by the proto-disc, 85% of it is red, that is, the part that encodes the protein that can be known regularly, it makes up the vast majority of it. Biology is a bit higher, yeast is a single celled eukaryote with less of the coding protein part and more of the non-coding. The nematode, which is already the simplest multicellular organism, uses only 28% of it for coding proteins and 71% for non-coding. Drosophila, at this time the coding fraction is only 17% of the known categorized fraction and the non-coding accounts for over 80%, whereas for humans 97% to 98% are non-coding proteins. So perhaps there is a conventional notion that from simple to complex organisms from lower to higher must be more and more proteins, but in fact it is not true, it is accompanied by an increase in function in the form of non-coding proteins that we do not now grasp the pattern, that is, non-coding proteins are associated with higher organisms, and of course must be associated with disease.

  • Transcriptome study.

This result is 100% certain, laboratories all over the world without exception find non-coding sequence information to issue information to make functional components, without exception, so such work, fully proves that this 97% is to achieve important biological functions, for which I give you a few simple examples although the full picture of this 97% is not understood, but individual examples, for example, one product of 97% can cause so-called prostate cancer. Another from 97% can cause leukemia and another from 97% can cause non-small cell lung cancer. What do these three examples illustrate? It means that the 97% that comes from the laws we don't know can still cause tumors, and you can know if you are a clinician expert here that we are now diagnosing and treating tumors in hospitals, and all the subjects are only using 3% of the information, never the 97%. Now there are ample examples of how that 97% can also lead to very serious diseases, and how can precision be achieved if it is not included in the diagnosis and treatment of disease?

And of course we know that there is something very good in the 97%, and please remember H19, which is a very important non-coding component that is present to allow our already cancerous cells to undergo extinction by some route. So how many of these components have yet to be discovered? For any research workers here who are interested in biology, you know that Japan has done experiments in mice and found that about 160,000 functional components from that 97% as important as proteins have not been discovered so far, so there are too many opportunities for us to discover new important functional components and understand how it relates to health, development, and disease. In this field, these two scientists won the first Nobel Prize in 2006, some people joke we estimate that we now know 3% of the genetic code of people, you can count how many Nobel Prize winners this 3% has created, I counted no less than 50.

We have now discovered the huge 97% again, indicating that there are over a thousand other Nobel Prize positions in this huge 97% of the field, only one of which is now negligibly occupied, so there is a very wide opportunity to create huge scientific results in front of everyone.

So the whole study of non-coding, a huge obstacle in histology, although for precision medicine we only have 3%, just starting, there is still a long way to go. But on the other hand, non-coding research will definitely provide us with great opportunities, that is to say, the mining of this 97% information will definitely provide a new direction for diagnosis and treatment of diseases, and will definitely provide a new platform for new drug design and development. Then it will also provide new opportunities for the breeding of new species and traits in plants and animals, so of course this is an example of what I'm talking about in histology, and we can see that precision medicine is actually just on the way.

Here, because it's a conference on big data, I'm happy to talk briefly about some of the challenges among data processing, and in the interest of time, I'm just pulling dailies now, because everyone here is an expert.

  • Large volume of data.We all know that a person's genetic code is 3 times 10 to the ninth power, but we know that this data is so easy to produce that a commercial sequencer can now get 1T of data in a single measurement, and such data is readily available as a commodity. So as you can see sequencing is now becoming so easy, I have one in my lab in my group and I can get data on 1T people's genetic code in a single sequencing, and there are uncountable people around the world, so you think how fast that data is growing.
  • Less analysis. This is Watson holding his genetic code in his little box, this was about ten years after humans had carried out sequencing of the genetic code, we know that by that time sequencing had become less expensive, but it still took a million dollars two months, another ten years later only six or seven thousand RMB, three days to get his genetic code, unfortunately he took his little box and didn't know how much he could analyse himself.

But now the international microbiome genetic project, this time we know that man does not only live by himself, if we consider his health then we also have to consider the microorganisms that live with man, then the genetic code of microorganisms is now estimated to be a hundred times that of man, if we study a broad human being, study a human being associated with microorganisms, the sequencing of a human being has to increase by two orders of magnitude. But such a data is not very good in terms of data source, it has better noise and therefore lower sex to noise ratio, in addition to having more missing values. Therefore, in terms of data sources, the data is growing very fast, the data quality is not high, and the data contains missing values, so there is the first difficulty in data mining, which is the difficulty of data sources.

  • Small sample size. From the point of view of samples, we always need samples, for example, we study liver cancer, we need liver cancer patients, we know that it is particularly difficult to collect samples for a particular disease, often for a specific staging of tumors, if we collect two or three hundred samples is already very good, we know that our whole mathematical system needs to model the system is often hundreds of thousands or even tens of thousands of independent variables, in this case if we can only take a few hundred samples, of course, our boundary conditions are not enough to fix the number of internal independent variables in this case, of course, our solution is not convergence (sound), this will be the second problem that exists. It is due to the difficulty of sample collection, many conditions we collect insufficient samples to fix the variation of the independent variables within the system, in this case there are two ways, one is to increase the sample, for example, why the United States to measure the heritage code of a million people, we also have to measure a million people in China's precision medicine program, that is, my system to measure the sample size is much larger than the independent variables covered by the system, of course, you can get a favorable convergence (sound) results, but this is often the behavior of the government, our own research group is not possible to do this, to have a huge cost. In this case, of course, we have to consider mathematical modeling to turn our system into subsystems so that the external boundary conditions and internal independent variables can be matched, which is called among the very prominent mathematical analysis needed for histological data on big data processing.
  • Low frequency of effective events.Not only are samples not easy to come by, but the molecular basis of the samples is also varied, so it brings up more questions about the level of demand for the samples. So this leads to a very important question of the philosophy of science among so-called precision medicine, what is common variation in common diseases and what is specific variation in common diseases, which I can't discuss more here in the interest of time.

All of the above talk about changes in individual genes, but each gene does not work independently and often forms networks, so we face further problems of so-called functional analysis, and the problem of precision medicine is the problem of complex networks. As you know, we are all mathematicians here, and we know that this biological network is dynamic, it is directed, it is each component that does another component that is directed; the components are not single, there are both proteins and accounting, and in addition all the ways of acting are, to a large extent, nonlinear, and it is certainly complex for such a dynamic, directed, different components to make up such a thing.

In addition to this, as you know we not only use histology data but also other imaging data, such as how to handle data like doing an MRI or CT, and finally, it is a question beyond the academic community, how to achieve effective sharing of data across China. We know that there is data in every hospital now, and if we can't share data on top of the whole picture, we are doing small data work in the era of big data, and we will lose the context of big data and its significance in the future.

So seeing that there are still very tough problems among data sharing, I am going to be rather crude later on, just to communicate with you on a few concepts of precision medicine, which I think is an important direction worthy of your attention, but for various reasons, we need to overcome the difficulties, and precision medicine is just getting started. But these difficulties are also precisely our opportunities, and seizing them offers the chance to do outstandingly original and important work.


Recommended>>
1、Continuously record day 42PC frontend development route and mobile frontend development route
2、Shandong Xinhua Pharmaceuticals shares big data nuggets
3、Why is the video lagging Lets start with the switch
4、How to get system environment variables in Python
5、Malta develops artificial intelligence strategy to become a successful blockchain island

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送

    已发送

    朋友将在看一看看到

    确定
    分享你的想法...
    取消

    分享想法到看一看

    确定
    最多200字,当前共

    发送中

    网络异常,请稍后重试

    微信扫一扫
    关注该公众号