A method of visualizing data distribution

Last Update:2018-08-22 Source: Internet

Author: User

Tags set set ggplot

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Draw a simple histogram

Problem

How to draw a histogram.

Method

Run the Geom_histogram () function and map a continuous variable to the parameter x (see Figure 6-1):

Ggplot (Faithful, AES (x=waiting)) + Geom_histogram ()

Discuss

The Geom_histogram () function requires only one column of the data box or a single data vector as a parameter. Taking the faithful dataset as an example, the dataset contains two columns of information describing the Old Faithful fountain: the first column eruptions, describing the length of each eruption of the old Faithful fountain; the second column waiting, describing the interval between the two eruptions. For example, the following columns waiting only:

Faithful

Eruptions waiting

3.600 79

1.800 54

3.333 74

...

If you want to quickly look at the histogram of the data not contained in the data box, you can set the data box parameter to NULL when you run the above command, and pass a vector to the ggplot () function as an argument. The following code is the same as the previous run result:

# Save variable values as a basic vector

W <-faithful$waiting

Ggplot (NULL, AES (X=W)) + Geom_histogram ()

By default, the data is cut into 30 groups, which may be too fine-grained or too rough, depending on the actual data. We can adjust the number of groups of data by the Binwidth parameter, or cut the data into the specified number of groups. The histogram default fill color is black and does not have a border line, which makes it difficult to see the corresponding variable values for each bar, so you can adjust the histogram color settings, as shown in Figure 6-2:

# Set Set distance of 5

Ggplot (Faithful, AES (x=waiting)) +

Geom_histogram (binwidth=5, fill= "white", colour= "Black")

# cut X's values into 15 groups

Binsize <-diff (Range (faithful$waiting))/15

Ggplot (Faithful, AES (x=waiting)) +

Geom_histogram (binwidth=binsize, fill= "white", colour= "Black")

Sometimes, the appearance of the histogram can be very dependent on the group margin and the group boundary. In Figure 6-3, we set the group distance to 8. At the same time, setting the group Origin (origin) parameter makes the boundary of the left graph in 31, 39, 47 and so on. In the right figure, the origin parameter is increased by 4, and the boundary of the group is 35, 43, 51, and so on, respectively:

H <-Ggplot (Faithful, AES (x=waiting)) # Save basic drawing results as variables to facilitate reuse

H + geom_histogram (binwidth=8, fill= "white", colour= "Black", origin=31)

H + geom_histogram (binwidth=8, fill= "white", colour= "Black", origin=35)

The two graphs correspond to the same number of groupings, but the drawing results vary widely. In this example, the faithful dataset contains 272 observations, the amount of data is not small, and when the amount of data is small, the effect on the packet boundary will be greater. Therefore, it is a good idea to try different grouping numbers and grouping boundaries when drawing graphics.

When the dataset contains discrete data, the asymmetry of the histogram can not be ignored. When the data is grouped, the group interval is left closed and right open. For example, when the group boundary is 1, 2, 3 and so on, the corresponding packet interval is [1,2), [2,3], [3,4) and so on. In other words, the first grouping interval contains 1, but does not contain 2, and the second group contains 2, but does not contain 3.

The same result can be obtained by running code geom_bar (stat= "bin"), but Geom_histogram ()

The process of counting is more easily explained.

See Also

The frequency polygon (frequency polygon) is a better solution when plotting the distribution of multiple data, because it avoids interfering with each bar. See section 6.5 for related content.

Plotting grouped histograms based on grouped data

Problem

How to draw a histogram of multiple sets of data.

Method

Run the Geom_histogram () function and use the faceted drawing, as shown in Figure 6-4:

Library (MASS) # in order to use the data

# Use smoke as a faceted variable

Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (smoke ~.)

Discuss

When you draw the diagram above, you require that all of the data used is contained in a data box, and that one column of the data box is a category variable that can be used for grouping.

Here take the BIRTHWT dataset as an example. The dataset contains data on infant birth weight and a range of risk factors that lead to low birth weight:

Birthwt

Low age LWT Race smoke PTL HT UI Ftv BWT

0 19 182 2 0 0 0 1 0 2523

0 33 155 3 0 0 0 0 3 2551

0 20 105 1 1 0 0 0 1 2557

...

A problem with the faceted drawing is that the faceted labels are only 0 and 1 and do not indicate that the label is a variable smoke value. To modify the label, we need to modify the name of the factor level. You first list the existing factor levels, and then give them a new name in the same order:

BIRTHWT1 <-BIRTHWT # Copy a copy of the data

# converts smoke into a factor

Birthwt1$smoke <-factor (Birthwt1$smoke)

Levels (Birthwt1$smoke)

"0" "1"

Library (PLYR) # to use the Revalue () function

Birthwt1$smoke <-Revalue (Birthwt1$smoke, C ("0" = "No Smoke", "1" = "Smoke")

Re-draw the new faceted label in the graph (see Figure 6-4, right).

Ggplot (BIRTHWT1, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (smoke ~.)

The y-axis scaling of each facet is the same for the faceted drawing. When each group of data contains a different number of samples, it may be difficult to compare the distribution shapes of each group of data. We can look at the results of grouping and drawing of birth weight according to race (see Figure 6-5, left):

Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (race ~.)

Set the parameter scales= "free" to set the y-axis scale of each facet separately. Note: This setting applies only to the y-axis scale, and the x-axis is still fixed because the histogram of each facet is aligned according to the X axis.

Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (race ~., scales= "free")

Another way to group a drawing is to map the grouping variable to fill, as shown in Figure 6-6. The grouping variable here must be a factor or a vector of the character type. For a BIRTHWT dataset, the variable smoke is the appropriate grouping variable, because it is stored as a numeric type, so we use the BIRTHWT1 dataset we created earlier, the smoke variable in that dataset is a factor variable:

# turning smoke into a factor

Birthwt1$smoke <-factor (Birthwt1$smoke)

# Map The smoke to fill, remove the bar stacking, and make the graphic translucent

Ggplot (BIRTHWT1, AES (X=BWT, Fill=smoke)) +

Geom_histogram (position= "Identity", alpha=0.4)

Statement position= "Identity" is important. Without it, the ggplot () function vertically stacks the bars of the histogram, making it harder to see the distribution of each set of data.

Draw a density curve

Problem

How to draw the nuclear density curve.

Method

Run the geom_density () function and map a continuous variable to x (see Figure 6-7):

Ggplot (Faithful, AES (x=waiting)) + geom_density ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More