Tutorials|4 Faster and Easier Ways to Visualize Data in Python
By George Seif
Participation: Geek AI, Xiaokun Liu
Source: heart of the machine (almosthuman2014)
heat map、 two-dimensional density map、 spider web diagramharmony tree diagram, Have you used any of these visualization methods??
Data visualization is a very important part of a data science or machine learning project. Often, you need to do exploratory data analysis (EDA) early in the project so that you have some understanding of the data, and creating visualizations can really make the task of analysis clearer and easier to understand, especially for large, high-dimensional datasets. It's also important to present the end result near the end of the project in a clear, concise and compelling way that your audience (usually non-technical clients) can understand.
Readers may have read my previous post: tutorial | 5 Fast and Easy to Use Python Matplotlib Data Visualization Methods I introduced you to 5 basic data visualization methods with that article: scatter plots, line plots, histograms, bar charts, and box plots. These are simple but powerful visualization methods through which you can gain insight into your dataset. In this article, we'll see 4 more ways to visualize data! This article goes into a little more detail about these methods and can be used next after you have read the basic methods in the previous article to extract more in-depth information from the data.
heat map
heat map(Heat Map) is a matrix representation of the data, where the value of each matrix element is represented by a color。 Different colors represent different values, Associate two or two features that need to be compared together by the index of the matrix。 heat map Ideal for showing relationships between multiple feature variables, Because you can know the size of the matrix element at that position directly from the color。 By looking at heat map Other points in the, You can also see how each relationship compares to the other relationships in the dataset。 The colors are so intuitive, So it provides us with a very simple way of interpreting the data。
Now let's take a look at the implementation code. Compared to 'matplotlib', 'seaborn' can be used to draw more advanced graphs, which often require more components, such as multiple colors, shapes or variables. "matplotlib" can be used to display graphs, "NumPy" can be used to generate data, and "pandas" can be used to manipulate data! Drawing is only one of the simple functions of "seaborn".
# Importing libs import seaborn as sns import pandas as pd import numpy as np import matplotlib.pyplot as plt # Create a random dataset data = pd.DataFrame(np.random.random((10,6)), columns=["Iron Man","Captain America","Black Widow","Thor","Hulk", "Hawkeye"]) print(data) # Plot the heatmap heatmap_plot = sns.heatmap(data, center=0, cmap='gist_ncar') plt.show()
two-dimensional density map
The 2D Density Plot (2D Density Plot) is an intuitive extension of the 1D version of the density plot and has the advantage over the 1D version of being able to see the probability distribution about the two variables. For example, in the two-dimensional density plot below, the scale plot on the right indicates the probability of each point by color. The place where our data has the greatest probability of occurring (i.e., where the data points are most concentrated) seems to be around size=0.5 and speed=1.4. As you know by now, a two-dimensional density map is useful for quickly identifying the most concentrated areas of our data in the case of two variables, as opposed to a one-dimensional density map where there is only one variable. When you have two variables that are important to the output and want to understand how they work together to distribute the output, it is very effective to look at the data in a two-dimensional density plot.
Once again, writing code with "seaborn" has proven to be very convenient! This time, we'll create a skewed distribution to make the data visualization results more interesting. You can tweak most of the optional parameters to make visualizing the results look clearer.
# Importing libs import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import skewnorm # Create the data speed = skewnorm.rvs(4, size=50) size = skewnorm.rvs(4, size=50) # Create and shor the 2D Density plot ax = sns.kdeplot(speed, size, cmap="Reds", shade=False, bw=.15, cbar=True) ax.set(xlabel='speed', ylabel='size') plt.show()
spider web diagram
Spider Plot is one of the best ways to show one-to-many relationships. In other words, you can plot and view the values of multiple variables associated with a particular variable or category. In a spider web diagram, the significance of one variable relative to another is clear and obvious, as the area covered and the length from the center becomes larger in a given direction. If you want to see how the several different classes of objects described using these variables differ, you can plot them side by side. In the chart below, it's easy to compare the different attributes of the Avengers and see where they each excel! (Please note that these numbers are set randomly and I am not biased against the members of the Avengers. )
Here, we can directly use "matplotlib" instead of "seaborn" to create visualizations. We need to make each property equally spaced along the circumference of the circle. We will set labels on each corner and then plot the value as a point whose distance to the center depends on its value/size. Finally, for a clearer display, we will use a translucent color to fill the area surrounded by the lines obtained by joining the property points together.
# Import libs import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt # Get the data df=pd.read_csv("avengers_data.csv") print(df) """ # Name Attack Defense Speed Range Health 0 1 Iron Man 83 80 75 70 70 1 2 Captain America 60 62 63 80 80 2 3 Thor 80 82 83 100 100 3 3 Hulk 80 100 67 44 92 4 4 Black Widow 52 43 60 50 65 5 5 Hawkeye 58 64 58 80 65 """ # Get the data for Iron Man labels=np.array(["Attack","Defense","Speed","Range","Health"]) stats=df.loc[0,labels].values # Make some calculations for the plot angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False) stats=np.concatenate((stats,[stats[0]])) angles=np.concatenate((angles,[angles[0]])) # Plot stuff fig = plt.figure() ax = fig.add_subplot(111, polar=True) ax.plot(angles, stats, 'o-', linewidth=2) ax.fill(angles, stats, alpha=0.25) ax.set_thetagrids(angles * 180/np.pi, labels) ax.set_title([df.loc[0,"Name"]]) ax.grid(True) plt.show()
tree diagram
We've been using Tree Diagrams since elementary school! Tree diagrams are natural and intuitive, which makes them easy to interpret. Nodes that are directly connected are closely related, while nodes with multiple connections are less similar. In the visualization below, I've plotted a tree diagram of a small set of Pokémon games based on Kaggle's statistics (Life, Attack, Defense, Special Attack, Special Defense, Speed).
Thus, the most statistically significant matches of Pokémon will be tightly connected. For example, at the top of the graph, the Arbor and Spikebill are directly connected, and if we look at the data, the Arbor has a total score of 438 and the Spikebill has 442, which is very close! But if we look at Lada, we can see that its total score is 413, which is a big difference from Arbor Monster and Spikebird, so they're separated in the tree chart! As we move up the tree, the Pokémon in the green group are more similar to each other than they are to any of the Pokémon in the red group, even though there is no direct green connection here.
as far as sth is concerned tree diagram, We actually need to use「Scipy」 To draw! After reading the data in the dataset, We will delete the string column。 This is just to make the visualization more intuitive、 Easy to understand, But in practice, Converting these strings to categorical variables will give better results and comparisons。 We also set the index of the data frame, so that it can be properly used as a column referencing each node。 The last thing I need to tell you is, (located) at「Scipy」 Calculated and plotted in tree diagram Just one simple line of code。
# Import libs import pandas as pd from matplotlib import pyplot as plt from scipy.cluster import hierarchy import numpy as np # Read in the dataset # Drop any fields that are strings # Only get the first 40 because this dataset is big df = pd.read_csv('Pokemon.csv') df = df.set_index('Name') del df.index.name df = df.drop(["Type 1", "Type 2", "Legendary"], axis=1) df = df.head(n=40) # Calculate the distance between each sample Z = hierarchy.linkage(df, 'ward') # Orientation our tree hierarchy.dendrogram(Z, orientation="left", labels=df.index) plt.show()
Link to original article:https://towardsdatascience.com/4-more-quick-and-easy-data-visualizations-in-python-with-code-da9030ab3429