How to play with Google TensorFlow?| Cowboy Talk
AI is not a simple discipline, and there is no unified platform or language for developing and debugging AI algorithms that integrates a large number of APIs for easy invocation; current AI development platforms are still in a semi-brutal state. Many features need to be built and implemented by yourself.
Fortunately, though, the field has received enough attention that many giants have developed their own platforms for it, and that includes Google's Tensorflow. Google DeepMind is well known for its achievements in the field of AI, and the development language platform it has launched is a good way to get people thinking, so how exactly is Tensorflow suitable for development? Could it lead to a better opportunity for your research or product?
In this open session, we invited Dr. Gabo Li, head of the machine learning lab of Nielsen, a technology company, to lead his team to use his own improved algorithm based on Tensorflow to successfully apply it in the company's precision ad push business. During his more than ten years in industry, Dr. Li has always insisted on combining academic research with industrial applications, maintaining close cooperation with academia for a long time, and introducing academic results into software innovation.
He is responsible for leading the research and development of deep learning based intelligent products, mainly using Tensorflow framework to build new deep neural networks, using GPU to train various user classification models, and applying these models to accurate ad delivery. Prior to joining Nielsen last year worked at biotech, pharma and financial tech software companies including Accelrys, Schrodinger and TD Ameritrade.
Dr. Lee has always combined academic research with industrial applications during his more than 10 years in industry, maintaining close collaboration with academia for a long time and introducing academic results directly into software innovation. He is currently a visiting professor at the School of Pharmacy, Sun Yat-sen University, where he supervises PhD students' research projects and conducts advanced seminars on algorithms and high performance computing programming. In addition, he has published more than sixty scientific papers in international journals. Dr. Lee has an extreme interest in a variety of complex scientific computing algorithmic problems and has invented a range of excellent algorithms in different disciplines.
▎Why did you choose Tensorflow as the platform of choice in the first place?
At first there was uncertainty about which deep learning platform to choose, and Tensorflow was not yet available. The main considerations at the time were the maturity of the platform, the programming languages supported, the GPU support and efficiency, the ease of use for building neural networks, the ease of getting started, the subsequent development of the platform, the development ecosystem, the efficiency of the platform, and other factors. Although we have collected some ratings, it is not easy to make a choice among the many factors weighed, and it is not practical to try each one individually. Shortly after, Tensorflow was open sourced from Google and we didn't hesitate to pick it up.
For one, TF has all the features we asked for (C++/Python language support, GPU, etc.). More importantly, we believe that the platform launched by Google will soon be accepted by everyone and quickly form a corresponding development ecosystem and active development of subsequent updates. Subsequent facts have confirmed our expectations. The table below compares several popular platforms, with data from a paper published in the arXiv in February of this year.
At that time, Caffe and Theano were still the most active open source development communities, but by now Tensorflow has the most developers. See the following table (Github Data as of Sept. 12, 2016).
Overall, TF feels good to me, and I believe in the post-launch advantage of Google products.
Overall, the API provided by Tensorflow gives enough freedom for building neural networks. For the construction of loss functions is also straightforward, the TF framework can automatically compute the derivatives of arbitrarily constructed loss functions. Training of the model parameters also provides a selection of the latest algorithms available. The TF user forum is also very active, and you can get help from it quickly if you have a difficult question.
There are certainly some shortcomings. For example if one wants to construct arbitrarily connected neural networks, TF does not have a direct tool for raising them, but there is a workaround via vector transposition to do so, but at the cost of very slow training and scoring.
Another shortcoming is that for real-time applications (where scoring needs to be calculated individually for individual inputs one by one), it is inefficient. For example, it is 100 times faster to give a thousand input records to TF as a batch to score than to compute them one at a time a thousand times. This means that the latter (real-time applications) are two orders of magnitude less efficient than their predecessors.
proposed by Ho Kai-ming late last year152 The residual network with ultra-deep layers wins2015 annualImageNet The winner of the competition, The paper was subsequently published inarXiv upper("Deep Residual Learning for Image Recognition", by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. (Dec. 10 2015) http://arxiv.org/abs/1512.03385)。
Also this July atarXiv Posted on Sequel("Identity Mappings in Deep Residual Networks", by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. (July, 2016) http://arxiv.org/abs/1603.05027v3)。 Let's call it the former here.ResNet I, the latter ResNet II。 The starting point for residual networks is simple, Simplicity is exactly what makes it so popular。
The core idea consists of three elements: information path shortcuts, residual units, and merging of residual unit outputs with information paths. The mathematical expression is.
where F is the operation of the residual unit, h(x) = x. The difference between ResNet I and ResNet II is the f function in it. ResNet I if f is a nonlinear transformation and ResNet II if f is a constant transformation, see the following figure.
Figure 2a. The tensor flow chart of ResNet I.
Figure 2b. The tensor flow chart of ResNet II.
In summary: ResNet I has a nonlinear transformation immediately after information merging, while ResNet II has a number of nonlinear transformations before merging.
First, there is no off-the-shelf network architecture that fits perfectly into the solution to our problem, which is a practical necessity. At the same time, to get the best results, the latest research findings must be widely assimilated and innovated. For the idea of why ResNet should be used in applications we have to start with understanding why ResNet works. Although Kemin has a bit of an explanation, here I give a different perspective on one understanding. The accuracy of a neural network model is influenced by two competing factors: the higher the complexity of the network, the more expressive the model and the higher the potential to approximate the best results.
On the other hand, the higher the complexity, the more difficult it is to find the best solution from a high-dimensional parameter space by SGD methods. An analogy to SGD optimization can be drawn to the descent from the mountain down the trail to the valley at the bottom of the mountain, as shown in the figure (image from the web).
Figure 3. Stochastic Gradient Descendent (SGD) method
It can be understood in this way: the optimization problem can be understood as finding a way to reach the valley from the uneven mountain path, which is rugged and full of traps. It's just that it's on an ultra-high dimensional (millions of dimensions for three) mountain relative to a neural network model, and the rugged complexity of the trail and the number of traps far exceeds the three-dimensional space situation. To successfully descend to the valley (or close to it), try to avoid going astray and falling into a trap on the way down, and in case you do fall in there is a chance to escape the trap. ResNet is structured to add an information pathway shortcut to the original network so that it can span several layers of the network forward to converge with the original output at some specific layer and serve as the input to the next residual unit. Mathematical intuition suggests that this should be where the potential energy surface gets a little flatter, from which there is a better chance of escape even if you fall down the hill into a trap.
Once the idea of ResNet is understood it can be leveraged in other network constructs, and this is easily implemented on Tensorflow.
Tensorflow provides a Python API that can be used directly to build the network. These APIs are very intuitive and translate the mathematical representation of the network structure directly into calls to the corresponding tf functions (e.g. tf.nn.matmul, tf.nn.relu, tn.nn.l2_loss for L2 regularization, and tf.reduce_sum for L1 regularization). Since TF can automatically calculate derivatives for any loss function, it is very flexible to design any form of loss function and any network structure.
The introduction of ResNet allows us to build ultra-deep neural networks without worrying too much about convergence in the training model. Even for less deep networks, ResNet can still improve convergence and improve accuracy.
Problems to watch out for in the use of Tensorflow?
There is nothing special about using Tensorflow with other frameworks. Some problems carry generalities for training neural networks, though.
First, there are often many macroscopic control parameters for training neural networks, including the depth of the network structure, the width, the choice of learning rate and dynamic adjustment, the type and strength of regularization (regularization), the number of iterations, etc. There is no simple standard how to pick these districts, and usually the best answer can only be found through constant trial and error. The initialization of the parameters in the model is also critical, as iterations can stall if not chosen properly. For example, if the optimization gets stalled, a simple solution is to scale the full starting parameters.
Figure 4. Convergence is very sensitive to both learning rate and model initialization.
The previous graph starts with a learning rate of 0.002 and the model training converges normally. The second graph starts with a learning rate of 0.02, which does not converge at all (the desire to be fast is not enough).
Building neural networks on Tensorflow is still relatively straightforward, as Tensorflow provides a very rich API (Python and C++) and various components for building neural networks, including convolutional network building blocks, various optimization methods, various combinations of loss functions, regularization control, etc. Thus much software development can be done based on the Python application interface provided by Tensorflow, so it is possible to quickly experiment with different constructs in development.
However, it is also because of the use of the Python interface that this can be a bottleneck in efficiency for specific applications. I haven't dived into the underlying C++ to modify Tensorflow yet. But scaling into the underlying layers is necessary in the long run, especially for application-specific efficiency optimizations. For example, for a real-time application, the ability to respond quickly to each individual input and return results is required. We found that while the GPU version of Tensorflow is capable of high-speed batch scoring, if the same amount is processed individually one by one, the efficiency can be two orders of magnitude slower, which is a problem for real-time applications. The solution could be to write a separate efficient scoring function without having to rely on Tensorflow.
The problem is viewed in this way: residual networks (ResNet) and convolutional neural networks (ConvNet) are parallel concepts and are not mutually exclusive.
The basic idea of ResNet can be combined with convolutional neural networks or with any other type of neural network. The core idea of ResNet is to change the condition of the potential energy surface of the loss function without changing the expressive power and complexity of the network, so that the path to the optimal point of optimization becomes a little smoother.
This is a big, somewhat general question. In general, not all applications need to use deep learning. Never use a complex deep model for a problem that can be solved with a simple model. For example if the problem can get good results with a linear model, there is no need to use a deep neural network because the simple model is much more efficient. However, if the problem is highly nonlinear and there is strong coupling between the variables, then this may be a good choice to use a neural network. But even then, it is important to start with a simple network, such as a 3-6 layer network, and then gradually increase the number of layers, looking carefully at whether there is room for improvement. Since SGD optimization results carry some error and uncertainty, the results of each Tensorflow optimization will have g certain differences, so be careful when observing whether adding layers improves the model accuracy, repeat the operation several times, take the average, and then do the comparison. This is particularly important when the sample size is small.
This is a big question and should be asked to big names like Jeff Hinton or Yan LeCun. But some of the usual innovative ways of thinking are still worthwhile. I think it starts with having a thorough understanding of the issues. For example, my door would like to have insight into how convolutional neural networks work, how algorithms that have been improved in recent years have improved model accuracy, and not only that, we want to have an understanding of the results of research in the opposite direction. For example, Szegedy et al. showed that adding noise that is difficult for the human eye to erase to a picture that can be recognized perfectly correctly by a DNN model (e.g., a lion) and then having the computer recognize it would be considered a completely different class.
There is also recent work by Yoshinski showing that using a trained DNN model it is possible to take a completely random TV snowflake noise image (which makes no sense at all) and optimize the image so that the DNN thinks it is a particular animal (like a panda or woodpecker, etc.) with a high confidence level (99.99%) while the resulting image is still as noisy and doesn't look like anything meaningful at all.
If we can understand these phenomena mathematically then we are bound to generate newer ideas that will lead us to try new network structures and thus discover more efficient, accurate, and robust algorithms to come. However, we currently do not know enough about these issues, and precisely because of this, there is much room for innovation waiting to be discovered. Also, most of the current algorithms on image recognition are almost all based on convolutional neural networks. Are there other ways that allow us to train good models even with fewer samples, as with human cognitive abilities? That's something to think about!
The question of how to improve the network architecture according to your needs is more general. Roughly speaking, it has to be adjusted to the specificity of your own application problem. For example, convolutional neural networks are mainly used for image recognition because each pixel in an image is associated with its neighboring pixels, and the full range of such associations, spatial relationships determine the representation of an image. Convolutional neural networks are designed to extract these features and train the model with a large number of examples.
And there are some problems with unclear interactions between variables that cannot be applied to convolutional neural networks. This can be done by using a fully connected network, or by creating network connections based on known or guessed interactions (which can greatly reduce the number of parameters). Practical applications also involve the efficiency (speed) of the model. If a neural network is too large, it will be slower both for training and for scoring, and the size of the model must be reduced if speedup is needed.
How can I use GPU acceleration in my Xerox? Please give examples.
For Tensorflow, its GPU acceleration is already implemented in the underlying layers of its core architecture. There are GPU versions of all the neural network operations in question, so from a developer's perspective, Tensorflow has actually saved developers from the pain of GPU programming. As a result, utilizing GPU acceleration becomes just a matter of installation.
If you have a GPU machine, just load the GPU-enabled version of Tensorflow. The CPU/GPU versions are transparent from an API perspective, and the same PYTHON code can run on both CPU/GPU versions.
One thing to note when installing the GPU version, though: Tensorflow requires a GPU card with 3.5 or higher computing power by default. If you have a GPU with 3.0 computing power (fairly common), then the default installation will have problems. This is done by compiling and installing from the source code, and by setting the computational power option parameter to 3.0 at compile time. Currently the GPU provided by Amazon Cloud is still 3.0, so to install Tensorflow on Amazon you have to compile and install from the source code. Tensorflow supports multiple GPUs, but the corresponding code has to be modified because the assignment of tasks has to be programmed. We compared the speed of Tensorflow running on a 32-core CPU with a single GPU machine for comparison. GPU machines are about 4 times faster than 32-core CPU machines.
How did you persevere in research and introduce the results into software innovation over the years?
It takes a lot of enthusiasm to consistently bring academic results into software innovation over many years.
I have always maintained an interest in academic research, especially on issues related to practical applications, with the aim of incorporating breakthrough research into the development of new products. For example, I was awarded an academic sabbatical in 2006, which gave me several months of free research, and it was during this time that I invented the Caesar (CAESAR) algorithm, which improved the efficiency of 3D molecular structure simulation by a factor of ten or more and is widely used as a core module for drug molecule design in drug discovery by major pharmaceutical companies. He has been with Sun Yat-sen University since 2008, and in addition to giving distance lectures to graduate students in China, he also returns to China once or twice a year to lecture and supervise graduate students' projects.
Another success story: the WEGA (Gaussian Weights for Comparison of 3D Geometric Shapes) algorithm.
Industrial applications: computer-aided design of drug molecules Problem pain point: the comparison of three dimensional billions of molecular shapes in an ultra-large molecular library is computationally huge and time-consuming. Collaborative research: a research group has been formed in collaboration with the School of Pharmacy of Sun Yat-sen University, including the PhD director and his students. Solution: in three steps: 1) new algorithms, 2) GPU acceleration, and 3) massive parallelism of GPU clusters.
Research results.
1) Algorithm-wise, a WEGA (Gaussian weighting) algorithm for comparing molecular 3D shapes is proposed, which greatly improves the computational accuracy while retaining computational simplicity and high efficiency. 2) Guided CUHK PhD students to develop GWEGA using GPU acceleration, which enables a single GPU to improve acceleration by nearly 100 times. 3 ) Using the GPU cluster of Guangzhou Supercomputing Center to realize large-scale GPU parallelism, we have ported the 3D structure retrieval of terabyte-scale ultra-large molecular libraries to the GPU cluster, and achieved 100 million times/second virtual screening of high-throughput drug molecules, which is nearly two orders of magnitude faster than international competitors. We applied for Chinese and international patents for this achievement.
Difficulty: the difficulty of this project is GPU programming. The bar for GPU programming is high, especially to achieve high efficiency, especially to achieve nearly 100x acceleration, which requires demanding programming skills. For this reason, Wai offers a special GPU programming class for Middle Eastern graduate students to get them started quickly and then gain a deep understanding of the GPU architecture, discussing key areas with them, analyzing and optimizing each line of code line by line, and pushing efficiency to the limit. This process not only nurtures the student, but also allows the puzzle to be solved.
You can, but it's more of a hassle because you don't know how to categorize it, for audio it's perfectly fine, time and frequency make up a two-dimensional image.