Pick up
[Search] baud stem (Porter streamming) extraction algorithm detailed (2)
The following is a 5-step process for stemming using the previously mentioned substitution criteria.
The left is the rule, and the right is an example of success or failure (denoted in lowercase letters).
Step 1
SSEs, ss caresses -> Caress
ies -i Ponies -> Poni
ties -> ti
SS---SS Caress-Caress
Cat, Cats, S
(m>0) EED-EE Feed
Agreed-Agree
(*v*) ed -> plastered-> Plaster
bled -> Bled
(*v*) ING-> motoring -> Motor
sing -> Sing
At, ATE Conflat (ed), conflate
BL, BLE Troubl (ed), Trouble
IZ, IZE siz (ed), size
(*d and not (*l or *s or *z))
-Letter
Hopp (ing) hop
Tann (ed), Tan
Fall (ing), fall
Hiss (ing), hiss
Fizz (ed), fizz
(M=1 and *o), E fail (ing), fail
Fil (ing), file
(*v*) Y-I Happy-Happi
Sky-Sky
With the processing of step 1, the plural and the past participle are processed.
Step 2
(m>0) Ational, ATE Relational, relate
(m>0) tional, tion Conditional, condition
Rational and rational
(m>0) Enci, ENCE Valenci, Valence
(m>0) Anci, ance Hesitanci, hesitance
(m>0) IZER, IZE Digitizer, digitize
(m>0) Abli, ABLE Conformabli, conformable
(m>0) ALLI, AL Radicalli, radical
(m>0) Entli, ENT Differentli, different
(m>0) ELI-E Vileli-> Vile
(m>0) Ousli, OUS Analogousli, analogous
(m>0) ization, IZE Vietnamization, vietnamize
(m>0) ATION, ATE predication, predicate
(m>0) Ator, ATE operator, operate
(m>0) Alism, AL feudalism, feudal
(m>0) Iveness, IVE decisiveness-decisive
(m>0) Fulness, FUL hopefulness, hopeful
(m>0) Ousness, OUS callousness, callous
(m>0) Aliti, AL Formaliti, formal
(m>0) Iviti, IVE Sensitiviti-sensitive
(m>0) Biliti, BLE Sensibiliti-Sensible
Step 3
(m>0) Icate, IC triplicate, Triplic
(m>0) Form ative, Formative,
(m>0) ALIZE, AL formalize, formal
(m>0) Iciti, IC Electriciti, Electric
(m>0) Electric, electrical IC, ICAL
(m>0) FUL, hopeful, hope
(m>0) Good, goodness, NESS
Step 4
(m>1) Reviv, Revival, AL--
(m>1) ance, allowance, allow
(m>1) ENCE, Inference, infer
(m>1) Airlin, airliner, ER-
(m>1) Gyroscop, gyroscopic, IC
(m>1) ABLE-Adjustable, adjust
(m>1) Ible, defensible, Defens
(m>1) Irrit, ANT-irritant
(m>1) Ement, replacement, Replac
(m>1) ment-adjustment, adjust
(m>1) Depend, dependent, ENT
(M>1 and (*s or *t)) Adopt, adoption, ION
(m>1) homolog, Homologou, OU
(m>1) Commun, communism, ISM
(m>1) Activ, activate, ATE
(m>1) Angular, Angulariti, ITI
(m>1) OUS, homologous, homolog
(m>1) effect, effective, IVE
(m>1) IZE, Bowdlerize, Bowdler
With the previous four steps, the suffix is removed, leaving the last step to do some fine-tuning.
Step 5
(m>1) e -> probate - > Probat
rate -> Rate
(M=1 and not *o) E-> cease -> CEAs
(M > 1 and *d and *l)
Control Controll
Roll
Some people specialize in the evaluation of the Porter algorithm, found that stemming can significantly improve the recall rate, and light extraction has little effect on the accuracy, but the depth of extraction will seriously affect the accuracy rate, so they recommend the first use of light extraction, if the query results are too small to use deep extraction.
[Search] baud stem (Porter streamming) extraction algorithm detailed (3)