A histogram is an accurate representation of the numerical data distribution. This is the approximate probability distribution of the continuous variable (quantitative variable) and was first introduced by Karl Pearson. This differs from the bar graph, in the sense that the bar graph connects two variables, but the histogram relates only one. To create a histogram, the first step is to "empty" the range of values ââ- that is, divide the whole range of values ââinto a series of intervals - and then calculate how many values ââfall into each interval. Trash can usually be set as a variable interval in a row and not overlapping. Bins (intervals) must be close together, and often (but not required) of the same size.
If the bin is the same size, the rectangle is set up on a tray with a height proportional to the frequency - the number of cases in each tray. Histograms can also be normalized to display "relative" frequencies. This then shows the proportion of cases that go into each of several categories, with an amount of altitudes equivalent to 1.
However, the trash should not be as wide; in this case, an established rectangle is defined as having area proportional to the frequency of the case in the bin. The vertical axis is then not the frequency but the frequency density - the number of cases per unit of the variable on the horizontal axis. An example of variable bin width is shown in the census bureau data below.
Since the dump next to it does not leave a gap, the rectangular histogram touches each other to show that the original variable is sustainable.
The histogram gives a rough impression of the density of the underlying data distribution, and often for density estimation: estimates the probability density function of the underlying variable. The total area of ââthe histogram used for the probability density is always normalized to 1. If the interval length in x -axis is all 1, then the histogram is identical to the relative frequency plot.
The histogram can be thought of as a simple kernel density estimate, which uses the kernel to flatten frequencies above the garbage. This results in a finer probability density function, which will generally more accurately reflect the distribution of the underlying variable. Density estimates can be plotted as alternatives to histograms, and are usually drawn as curves rather than batches of squares.
The alternative is the average shifted histogram, which is quick to calculate and gives an approximate smooth curve of density without using the kernel.
A histogram is one of seven basic tools of quality control.
Histograms are sometimes confusing with bar charts. The histogram is used for continuous data, where garbage represents the range of data, while the bar chart is a plot of the category variables. Some authors recommend that bar charts have a gap between rectangles to clarify the difference.
Video Histogram
Etymology
The etymology of the word histogram is uncertain. Sometimes it says it comes from the Ancient Greek ????? ( histos ) - "whatever is upright" (as a mast of a ship, the trunk of a loom, or a histogram vertical bar); and ?????? ( gramma ) - "drawing, recording, writing". It is also said that Karl Pearson, who introduced this term in 1891, took the name of the "historical diagram".
Maps Histogram
Example
This is the data for the histogram to the right, using 500 items:
Kata-kata yang digunakan untuk menggambarkan pola dalam histogram adalah: "simetris", "miring ke kiri" atau "kanan", "unimodal", "bimodal" atau "multimodal".
It is a good idea to plot data using several different bin widths to learn more about it. Here is an example of the tips given in the restaurant.
Berikut beberapa contoh lainnya:
The US Census Bureau found that there were 124 million people working outside their homes. By using their data at the time occupied by the trip to work, the table below shows the absolute number of people responding with "least 30 but less than 35 minutes" travel times higher than the numbers for the categories above and below. This may be because people collect their reported travel time. The problem of reporting values ââas arbitrarily rounded numbers is a common phenomenon when collecting data from people.
-
This histogram shows the number of cases per unit interval as the height of each block, so the area of ââeach block is equal to the number of people in the survey falling into the category. The area under the curve represents the number of cases (124 million). This histogram type shows absolute numbers, with Q in the thousands.
-
This histogram is different from the first only in the vertical scale. The width of each block is a fraction of the total represented by each category, and the total area of ââall rods equals 1 (meaning "all" fractions). The curve shown is a simple density estimate. This version shows the proportions, and also known as the histogram unit area.
In other words, the histogram represents the frequency distribution by using a rectangle whose width represents the class interval and the region is proportional to the corresponding frequency: the respective height is the average frequency density for that interval. The intervals are placed together to show that the data represented by the histogram, while exclusive, is also adjacent. (For example, in a histogram it is possible to have two connecting intervals between 10.5-20.5 and 20.5-33.5, but not two connecting intervals between 10.5-20.5 and 22.5-32.5.Void intervals are represented as empty and not skipped.)
src: www.evolytics.com
Mathematical definition
In a more general mathematical sense, a histogram is a function of m i which calculates the number of observations that go into each category outlines (known as junk ), while histogram graph is just one way to represent histogram. So if we let n be the total number of observations and k to the total amount of waste, histogram m i comply with the following conditions:
-
Cumulative histogram
Histogram kumulatif adalah pemetaan yang menghitung jumlah kumulatif pengamatan di semua nampan hingga nampan yang ditentukan. Yaitu, histogram kumulatif M i dari histogram m j didefinisikan sebagai:
-
Jumlah sampah dan lebar
There is no "best" trash can, and different tong sizes can reveal various data features. The data clustering was at least as old as Graunt's work in the 17th century, but no systematic guidance was given until the work of Sturges in 1926.
Using a wider trash where the dot dot underlying low data reduces noise due to random sampling; using a narrower trash bin where high density (so that the signal sinks noise) gives greater precision to the density estimation. So varying the bin-width in the histogram can be useful. However, the same width of bins is widely used.
Some theorists have attempted to determine the optimal amount of waste, but this method generally makes strong assumptions about the shape of the distribution. Depending on the actual data distribution and analysis objectives, different bin widths may be appropriate, so experiments are usually required to determine the appropriate width. However, there are various useful guidelines and rules of thumb.
Jumlah sampah k dapat ditetapkan secara langsung atau dapat dihitung dari lebar bin yang disarankan h sebagai:
-
Braces menunjukkan fungsi langit-langit.
Pilihan akar-persegi
-
which takes the square root of the number of data points in the sample (used by the Excel histogram and many others). formula
Sturges
Rumus Sternes berasal dari distribusi binomial dan secara implisit mengasumsikan distribusi sekitar normal.
-
Implicitly base the bin size on the data range and may perform poorly if n Ã, & lt; Ã,30, because the amount of trash will be small - less than seven - and may not display trends in data well. It can also perform poorly if the data is not normally distributed.
Rice Rules
-
The Rice Rule is presented as a simple alternative to Sturges rules.
Doane Formula
Formula Doane adalah modifikasi formulas Sturges yang berupaya meningkatkan kinerjanya dengan data non-normal.
-
di mana adalah perkiraan kemiringan-saat ke-3 dari distribusi dan
-
Normal reference hat Scott
-
di mana adalah standar deviasi sampel. Aturan normal reference Scott optimal untuk sampel acak dari data terdistribusi normal, dalam arti bahwa itu meminimalkan kesalahan rata-rata kuadrat terpadu dari perkiraan kerapata.
Pilihan Freedman-Diaconis'
Aturan Freedman-Diaconis adalah:
-
which is based on the interquartile range, denoted by IQR. It replaces 3.5? Scott's rule with 2 IQR, which is less sensitive than the standard deviation for outliers in the data.
Minimizing cross-validation quadratic squared estimates
Pendekatan ini untuk meminimalkan kesalahan kuadrat rata-rata terintegrasi dari aturan Scott dapat digeneralisasikan di luar distribusi normal, dengan menggunakan meninggalkan satu validasi silang:
-
Di sini, adalah jumlah dari datapoints di bin k , dan memilih nilai h yang meminimalkan < i> J akan meminimalkan kesalahan kuadrat rata-rata yang terintegrasi.
Pilihan Shimazaki dan Shinomoto
Pilihannya didasarkan pada minimalisasi fungsi laughiko L 2
-
Source of the article : Wikipedia