A method of visualizing data distribution

Source: Internet
Author: User
Tags set set ggplot

Draw a simple histogram

Problem


How to draw a histogram.


Method

Run the Geom_histogram () function and map a continuous variable to the parameter x (see Figure 6-1):


Ggplot (Faithful, AES (x=waiting)) + Geom_histogram ()


Discuss

The Geom_histogram () function requires only one column of the data box or a single data vector as a parameter. Taking the faithful dataset as an example, the dataset contains two columns of information describing the Old Faithful fountain: the first column eruptions, describing the length of each eruption of the old Faithful fountain; the second column waiting, describing the interval between the two eruptions. For example, the following columns waiting only:

Faithful

Eruptions waiting

3.600 79

1.800 54

3.333 74

...

If you want to quickly look at the histogram of the data not contained in the data box, you can set the data box parameter to NULL when you run the above command, and pass a vector to the ggplot () function as an argument. The following code is the same as the previous run result:


# Save variable values as a basic vector

W <-faithful$waiting

Ggplot (NULL, AES (X=W)) + Geom_histogram ()


By default, the data is cut into 30 groups, which may be too fine-grained or too rough, depending on the actual data. We can adjust the number of groups of data by the Binwidth parameter, or cut the data into the specified number of groups. The histogram default fill color is black and does not have a border line, which makes it difficult to see the corresponding variable values for each bar, so you can adjust the histogram color settings, as shown in Figure 6-2:

# Set Set distance of 5

Ggplot (Faithful, AES (x=waiting)) +

Geom_histogram (binwidth=5, fill= "white", colour= "Black")

# cut X's values into 15 groups

Binsize <-diff (Range (faithful$waiting))/15

Ggplot (Faithful, AES (x=waiting)) +

Geom_histogram (binwidth=binsize, fill= "white", colour= "Black")


Sometimes, the appearance of the histogram can be very dependent on the group margin and the group boundary. In Figure 6-3, we set the group distance to 8. At the same time, setting the group Origin (origin) parameter makes the boundary of the left graph in 31, 39, 47 and so on. In the right figure, the origin parameter is increased by 4, and the boundary of the group is 35, 43, 51, and so on, respectively:

H <-Ggplot (Faithful, AES (x=waiting)) # Save basic drawing results as variables to facilitate reuse


H + geom_histogram (binwidth=8, fill= "white", colour= "Black", origin=31)

H + geom_histogram (binwidth=8, fill= "white", colour= "Black", origin=35)

The two graphs correspond to the same number of groupings, but the drawing results vary widely. In this example, the faithful dataset contains 272 observations, the amount of data is not small, and when the amount of data is small, the effect on the packet boundary will be greater. Therefore, it is a good idea to try different grouping numbers and grouping boundaries when drawing graphics.


When the dataset contains discrete data, the asymmetry of the histogram can not be ignored. When the data is grouped, the group interval is left closed and right open. For example, when the group boundary is 1, 2, 3 and so on, the corresponding packet interval is [1,2), [2,3], [3,4) and so on. In other words, the first grouping interval contains 1, but does not contain 2, and the second group contains 2, but does not contain 3.


The same result can be obtained by running code geom_bar (stat= "bin"), but Geom_histogram ()

The process of counting is more easily explained.


See Also

The frequency polygon (frequency polygon) is a better solution when plotting the distribution of multiple data, because it avoids interfering with each bar. See section 6.5 for related content.


Plotting grouped histograms based on grouped data

Problem

How to draw a histogram of multiple sets of data.


Method

Run the Geom_histogram () function and use the faceted drawing, as shown in Figure 6-4:

Library (MASS) # in order to use the data

# Use smoke as a faceted variable

Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (smoke ~.)



Discuss

When you draw the diagram above, you require that all of the data used is contained in a data box, and that one column of the data box is a category variable that can be used for grouping.


Here take the BIRTHWT dataset as an example. The dataset contains data on infant birth weight and a range of risk factors that lead to low birth weight:


Birthwt

Low age LWT Race smoke PTL HT UI Ftv BWT

0 19 182 2 0 0 0 1 0 2523

0 33 155 3 0 0 0 0 3 2551

0 20 105 1 1 0 0 0 1 2557

...

A problem with the faceted drawing is that the faceted labels are only 0 and 1 and do not indicate that the label is a variable smoke value. To modify the label, we need to modify the name of the factor level. You first list the existing factor levels, and then give them a new name in the same order:


BIRTHWT1 <-BIRTHWT # Copy a copy of the data

# converts smoke into a factor

Birthwt1$smoke <-factor (Birthwt1$smoke)

Levels (Birthwt1$smoke)


"0" "1"


Library (PLYR) # to use the Revalue () function

Birthwt1$smoke <-Revalue (Birthwt1$smoke, C ("0" = "No Smoke", "1" = "Smoke")


Re-draw the new faceted label in the graph (see Figure 6-4, right).

Ggplot (BIRTHWT1, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (smoke ~.)


The y-axis scaling of each facet is the same for the faceted drawing. When each group of data contains a different number of samples, it may be difficult to compare the distribution shapes of each group of data. We can look at the results of grouping and drawing of birth weight according to race (see Figure 6-5, left):


Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (race ~.)



Set the parameter scales= "free" to set the y-axis scale of each facet separately. Note: This setting applies only to the y-axis scale, and the x-axis is still fixed because the histogram of each facet is aligned according to the X axis.


Ggplot (BIRTHWT, AES (X=BWT)) + Geom_histogram (fill= "white", colour= "black") +

Facet_grid (race ~., scales= "free")


Another way to group a drawing is to map the grouping variable to fill, as shown in Figure 6-6. The grouping variable here must be a factor or a vector of the character type. For a BIRTHWT dataset, the variable smoke is the appropriate grouping variable, because it is stored as a numeric type, so we use the BIRTHWT1 dataset we created earlier, the smoke variable in that dataset is a factor variable:

# turning smoke into a factor

Birthwt1$smoke <-factor (Birthwt1$smoke)


# Map The smoke to fill, remove the bar stacking, and make the graphic translucent


Ggplot (BIRTHWT1, AES (X=BWT, Fill=smoke)) +

Geom_histogram (position= "Identity", alpha=0.4)


Statement position= "Identity" is important. Without it, the ggplot () function vertically stacks the bars of the histogram, making it harder to see the distribution of each set of data.



Draw a density curve


Problem

How to draw the nuclear density curve.


Method

Run the geom_density () function and map a continuous variable to x (see Figure 6-7):


Ggplot (Faithful, AES (x=waiting)) + geom_density ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.