Компьютерные науки
44 подписчика

Modified Naive Bayes with Hurst Exponent as Quantitative Measure of Data Mutual Dependence

Yandex School of Data Analysis Conference
Machine Learning: Prospects and Applications

yandexdataschool.com/conference

The so-called naive Bayes classifier is based on the assumption of independency
of characteristics (features) in question. That is, estimated probability of
an event with a given set of features, according to Bayes, is based on product
conditional probabilities of the event relative to the features in question.

A practical use of the Bayes estimate is mostly determined not by its precision,
but by its covariance of the observed probability, meaning that the
higher the observed event probability is, the higher the estimate is.

An essential condition of covariance property of the Bayes estimate to be
implemented is the constancy of the number of features that define it.

For example, if we consider а vital task of CTR prediction – a clickthrough
rate of an online ad banner – based on the known CTR statistics
for banners of a given company, a given type of goods, etc., almost all
conditional probabilities of the Bayes product are low (lower than 1%).
Therefore, a banner with many characteristics will receive a fortiori conservative
estimate, which will cause the loss of its covariance.

The simplest thing to do in the case of variable characteristics is to move
from product to geometrical mean of conditional probabilities.

To further enhance the classifier covariance in the case of variable
characteristics it is necessary to analyze the nature of dependence of the
Bayesian product on the number of multipliers.

If the total number of characteristics is big (thousands) while only a small
part of them (tens) is used to classify each event, it is logical to expect
that the Bayesian product log is showing an asymptotically linear growth
together with the growth of the number of co-multipliers.

This work presents a way of modification of the Bayes classifier based on
the next term of the asymptotic decomposition of the Bayes product log
and retrieving Hurst exponent.

Hurst exponent appears as a result of the data self-similarity and is a
quantitative measure of its mutual dependence.

The experimental results have proved the assumptions underlying the
research that the additional information brings a positive contribution to
prediction in the form of fractal dimension.