Barbara Żogała-Siudem, Szymon Jaroszewicz. Variable screening for Lasso based on multidimensional indexing. Data Mining and Knowledge Discovery, 38(1), pages 49-78, 2024.

Abstract:
In this paper we present a correlation based safe screening technique for building the complete Lasso path. Unlike many other Lasso screening approaches we do not consider prespecified values of the regularization parameter, but, instead, prune variables which cannot be the next best feature to be added to the model. Based on those results we present a modified homotopy algorithm for computing the regularization path. We demonstrate that, even though our algorithm provides the complete Lasso path, its performance is competitive with state of the art algorithms which, however, only provide solutions at a prespecified sample of regularization parameters. We also address problems of extremely high dimensionality, where the variables may not fit into main memory and are assumed to be stored on disk. A multidimensional index is used to quickly retrieve potentially relevant variables. We apply the approach to the important case when multiple models are built against a fixed set of variables, frequently encountered in statistical databases. We perform experiments using the complete Eurostat database as predictors and demonstrate that our approach allows for practical and efficient construction of Lasso models, which remain accurate and interpretable even when millions of highly correlated predictors are present.
Krzysztof Rudaś, Szymon Jaroszewicz. Regularization for Uplift Regression. In Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD'23), pages 593-608, 2023.

Abstract:
We address the problem of regularization of linear regression models in uplift modeling and heterogeneous treatment effect estimation. We consider interaction models which are commonly used by statisticians in medicine and social sciences to estimate the causal effect of a treatment, and introduce a new type of such a model. We demonstrate the equivalence of all interaction models when no regularization is present, and that this is no longer the case when the model is regularized. Interaction terms introduce implicit correlations between treatment and control coefficients into the regularizer, a fact which has not been previously noted. The correlations depend on the type of interaction model, and by interpreting the regularizer as a prior distribution we were able to pinpoint cases when a given regularized interaction model is most appropriate. An interesting property of the proposed new interaction type is that it allows for smooth interpolation between two types of uplift regression models: the double model and the transformed target model. Our results are valid for both ridge ($L_2$) and Lasso ($L_1$) regularization. Experiments on synthetic data fully confirm our analyses. We also compare the usefulness of various regularization schemes on real data.
Szymon Jaroszewicz, Krzysztof Rudaś. Shrinkage Estimators for the Intercept in Linear and Uplift Regression. Scientific Annals of Computer Science, 33(1), pages 35-52, 2023.

Abstract:
Shrinkage estimators modify classical statistical estimators by scaling them towards zero in order to decrease their prediction error. We propose shrinkage estimators for linear regression models which explicitly take into account the presence of the intercept term, shrinking it independently from other coefficients. This is different from current shrinkage estimators, which treat the intercept just as an ordinary regression coefficient. We demonstrate that the proposed approach brings systematic improvements in prediction accuracy if the true intercept term differs in magnitude from other coefficients, which is often the case in practice. We then generalize the approach to uplift regression which aims to predict the causal effect of a specific action on an individual with given characteristics. In this case the proposed estimators improve prediction accuracy over previously proposed shrinkage estimators and achieve impressive performance gains over original models.
Nevo Itzhak, Szymon Jaroszewicz, Robert Moskovitch. Continuous Prediction of a Time Intervals-Related Pattern's Completion. Knowledge and Information Systems, 65(11), pages 4797-4846, 2023.

Abstract:
In many daily applications, such as meteorology or patient data, the starting and ending times of the events are stored in a database, resulting in time interval data. Discovering patterns from time interval data can reveal informative patterns, in which the time intervals are related by temporal relations, such as before or overlaps. When multiple temporal variables are sampled in a variety of forms, and frequencies, as well as irregular events that may or may not have a duration, time intervals patterns can be a powerful way to discover temporal knowledge, since these temporal variables can be transformed into a uniform format of time intervals. Predicting the completion of such patterns can be used when the pattern ends with an event of interest, such as the recovery of a patient, or an undesirable event, such as a medical complication. In recent years, an increasing number of studies have been published on time intervals-related patterns (TIRPs), their discovery, and their use as features for classification. However, as far as we know, no study has investigated the prediction of the completion of a TIRP. The main challenge in performing such a completion prediction occurs when the time intervals are coinciding and not finished yet which introduces uncertainty in the evolving temporal relations, and thus on the TIRP’s evolution process. To overcome this challenge, we propose a new structure to represent the TIRP’s evolution process and calculate the TIRP’s completion probabilities over time. We introduce two continuous prediction models (CPMs), segmented continuous prediction model (SCPM), and fully continuous prediction model (FCPM) to estimate the TIRP’s completion probability. With the SCPM, the TIRP’s completion probability changes only at the TIRP’s time intervals’ starting or ending point. The FCPM incorporates, in addition, the duration between the TIRP’s time intervals’ starting and ending time points. A rigorous evaluation of four real-life medical and non-medical datasets was performed. The FCPM outperformed the SCPM and the baseline models (random forest, artificial neural network, and recurrent neural network) for all datasets. However, there is a trade-off between the prediction performance and their earliness since the new TIRP’s time intervals’ starting and ending time points are revealed over time, which increases the CPM’s prediction performance.
Nevo Itzhak, Szymon Jaroszewicz, Robert Moskovitch. Continuously Predicting the Completion of a Time Intervals Related Pattern. In The 27th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'23), pages 239-251, Osaka, Japan, May, 2023.

Abstract:
In various domains, such as meteorology or patient data, events' durations are stored in a database, resulting in symbolic time interval (STI) data. Additionally, using temporal abstraction techniques, time point series can be transformed into STI data. Mining STI data for frequent time intervals-related patterns (TIRPs) was studied in recent decades. However, for the first time, we explore here how to continuously predict a TIRP's completion, which can be potentially applied with patterns that end with an event of interest, such as a medical complication, for its prediction. The main challenge in performing such a completion prediction occurs when the time intervals are coinciding, but not finished yet, which introduces an uncertainty in the evolving temporal relations, and thus on the TIRP's evolution process. In this study, we introduce a new structure to overcome this challenge and several continuous prediction models (CPMs). In the segmented CPM (SCPM), the completion probability depends only on the pattern's STIs' starting and ending points, while a machine learning-based CPM (CPML) incorporates the duration between the pattern's STIs' beginning and end times. Our experiments show that overall, CPML based on an ANN performed better than the other CPMs, but CPML based on NB or RF provided the earliest predictions.
B. Żogała-Siudem, S. Jaroszewicz. Fast stepwise regression based on multidimensional indexes. Information Sciences, 549, pages 288-309, 2021.

Abstract:
We present an approach to efficiently construct stepwise regression models in a very high dimensional setting using a multidimensional index. The approach is based on an observation that the collections of available predictor variables often remain relatively stable and many models are built based on the same predictors. Example scenarios include data warehouses against which multiple ad hoc analytical models are built or collections of publicly available open data which remain relatively fixed and are used as a source of predictor variables for many models. We propose an approach where the user simply provides a target variable and the algorithm uses a pre-built multidimensional index to automatically select predictors from millions of available variables, yielding results identical to standard stepwise regression, but an order of magnitude faster. The algorithm has been tested on the large statistical database available from Eurostat, and has been demonstrated to produce interpretable and accurate models. We demonstrate experimentally that our approach produces results that are significantly better than other approaches to modeling with ultra-high dimensional data. Finally, we discuss potential pitfalls such as the presence of highly correlated variables, and show how they can be overcome.
R. M. Gubela, S. Lessmann, S. Jaroszewicz. Response transformation and profit decomposition for revenue uplift modeling. European Journal of Operational Research, 283(2), pages 647-661, 2020.

Abstract:
Uplift models support decision-making in marketing campaign planning. Estimating the causal effect of a marketing treatment, an uplift model facilitates targeting marketing actions to responsive customers and efficient allocation of marketing budget. Research into uplift models focuses on conversion models to maximize incremental sales. The paper introduces uplift models for maximizing incremental revenues. If customers differ in their spending behavior, revenue maximization is a more plausible business objective compared to maximizing conversions. The proposed methodology entails a transformation of the prediction target, customer-level revenues, that facilitates implementing a causal uplift model using standard machine learning algorithms. The distribution of campaign revenues is typically zero-inflated because of many non-buyers. Remedies to this modeling challenge are incorporated in the proposed revenue uplift strategies in the form of two-stage models. Empirical experiments using real-world e-commerce data confirm the merits of the proposed revenue uplift strategy over relevant alternatives, including uplift models for conversion and recently developed causal machine learning algorithms. To quantify the degree to which improved targeting decisions raise return on marketing, the paper develops a decomposition of campaign profit. Applying the decomposition to a digital coupon targeting campaign, the paper provides evidence that revenue uplift modeling, as well as causal machine learning, can improve campaign profit substantially.
Rudaś, Krzysztof, Jaroszewicz, Szymon. Shrinkage Estimators for Uplift Regression. In Proc. of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD'19), Würzburg, Germany, Sep, 2019.

Abstract:
Uplift modeling is an approach to machine learning which allows for predicting the net effect of an action (with respect to not taking the action). To achieve this, the training population is divided into two parts: the treatment group, which is subjected to the action, and the control group, on which the action is not taken. Our task is to construct a model which will predict the difference between outcomes in the treatment and control groups conditional on individual objects' features. When the group assignment is random, the model admits a causal interpretation. When we assume linear responses in both groups, the simplest way of estimating the net effect of the action on an individual is to build two separate linear ordinary least squares (OLS) regressions on the treatment and control groups and compute the difference between their predictions. In classical linear models improvements in accuracy can be achieved through the use of so called shrinkage estimators such as the well known James-Stein estimator, which has a provably lower mean squared error than the OLS estimator. In this paper we investigate the use of shrinkage estimators in the uplift modeling problem. Unfortunately direct generalization of the James-Stein estimator does not lead to improved predictions, nor does shrinking treatment and control models separately. Therefore, we propose a new uplift shrinkage method where estimators in the treatment and control groups are shrunk jointly so as to minimize the error in the predicted net effect of the action. We prove that the proposed estimator does indeed improve on the double regression estimator.
Rudaś, Krzysztof, Jaroszewicz, Szymon. Linear regression for uplift modeling. Data Mining and Knowledge Discovery, 32(5), pages 1275-1305, Sep, 2018.

Abstract:
The purpose of statistical modeling is to select targets for some action, such as a medical treatment or a marketing campaign. Unfortunately, classical machine learning algorithms are not well suited to this task since they predict the results after the action, and not its causal impact. The answer to this problem is uplift modeling, which, in addition to the usual training set containing objects on which the action was taken, uses an additional control group of objects not subjected to it. The predicted true effect of the action on a given individual is modeled as the difference between responses in both groups. This paper analyzes two uplift modeling approaches to linear regression, one based on the use of two separate models and the other based on target variable transformation. Adapting the second estimator to the problem of regression is one of the contributions of the paper. We identify the situations when each model performs best and, contrary to several claims in the literature, show that the double model approach has favorable theoretical properties and often performs well in practice. Finally, based on our analysis we propose a third model which combines the benefits of both approaches and seems to be the model of choice for uplift linear regression. Experimental analysis confirms our theoretical results on both simulated and real data, clearly demonstrating good performance of the double model and the advantages of the proposed approach.
Oskar Jarczyk, Szymon Jaroszewicz, Adam Wierzbicki, Kamil Pawlak, Michal Jankowski-Lorek. Surgical teams on GitHub: Modeling performance of GitHub project development processes. Information and Software Technology, 100, pages 32-46, 2018.

Abstract:
Context: Better methods of evaluating process performance of OSS projects can benefit decision makers who consider adoption of OSS software in a company. This article studies the closure of issues (bugs and features) in GitHub projects, which is an important measure of OSS development process performance and quality of support that project users receive from the developer team. Objective: The goal of this article is a better understanding of the factors that affect issue closure rates in OSS projects. Methodology: The GHTorrent repository is used to select a large sample of mature, active OSS projects. Using survival analysis, we calculate short-term, and long-term issue closure rates. We formulate several hypotheses regarding the impact of OSS project and team characteristics, such as measures of work centralization, measures that reflect internal project workflows, and developer social networks measures on issue closure rates. Based on the proposed features and several control features, a model is built that can predict issue closure rate. The model allows to test our hypotheses. Results: We find that large teams that have many project members have lower issue closure rates than smaller teams. Similarly, increased work centralization increases issue closure rates. While desirable social network characteristics have a positive impact on the amount of commits in a project, they do not have significant influence on issue closure. Conclusion: Overall, findings from empirical analysis support the classic notion of Brook’s – the “surgical team” – in the context of OSS project development process performance on GitHub. The model of issue closure rates proposed in this article is a first step towards an improved understanding and prediction of this important measure of OSS development process performance.
M. Sołtys, S. Jaroszewicz. Boosting algorithms for uplift modeling. CoRR, abs/1807.07909, 2018, arXiv preprint.

Abstract:
Uplift modeling is an area of machine learning which aims at predicting the causal effect of some action on a given individual. The action may be a medical procedure, marketing campaign, or any other circumstance controlled by the experimenter. Building an uplift model requires two training sets: the treatment group, where individuals have been subject to the action, and the control group, where no action has been performed. An uplift model allows then to assess the gain resulting from taking the action on a given individual, such as the increase in probability of patient recovery or of a product being purchased. This paper describes an adaptation of the well-known boosting techniques to the uplift modeling case. We formulate three desirable properties which an uplift boosting algorithm should have. Since all three properties cannot be satisfied simultaneously, we propose three uplift boosting algorithms, each satisfying two of them. Experiments demonstrate the usefulness of the proposed methods, which often dramatically improve performance of the base models and are thus new and powerful tools for uplift modeling.
Ł. Zaniewicz, S. Jaroszewicz. Lp-Support vector machines for uplift modeling. Knowledge and Information Systems, 53(1), pages 269-296, Oct, 2017.

Abstract:
Uplift modeling is a branch of machine learning which aims to predict not the class itself, but the difference between the class variable behavior in two groups: treatment and control. Objects in the treatment group have been subjected to some action, while objects in the control group have not. By including the control group, it is possible to build a model which predicts the causal effect of the action for a given individual. In this paper, we present a variant of support vector machines designed specifically for uplift modeling. The SVM optimization task has been reformulated to explicitly model the difference in class behavior between two datasets. The model predicts whether a given object will have a positive, neutral or negative response to a given action, and by tuning a parameter of the model the analyst is able to influence the relative proportion of neutral predictions and thus the conservativeness of the model. Further, we extend \$\$L_p\$\$Lp-SVMs to the case of uplift modeling and demonstrate that they allow for a more stable selection of the size of negative, neutral and positive groups. Finally, we present quadratic and convex optimization methods for efficiently solving the two proposed optimization tasks.
S. Jaroszewicz. Uplift Modeling. In Encyclopedia of Machine Learning and Data Mining, pages 1304-1309, Springer US, Boston, MA, 2017.

Abstract:
Uplift modeling is a machine learning technique which aims at predicting, on the level of individuals, the gain from performing a given action with respect to refraining from taking it. Examples include medical treatments and direct marketing campaigns where the rate of spontaneous recovery and the background purchase rate need to be taken into account to assess the true gains from taking an action. Uplift modeling addresses this problem by using two training sets: the treatment dataset containing data on objects on which the action has been taken and the control dataset containing data on objects left untreated. A model is then built which predicts the difference between outcomes after treatment and without it conditional on available predictor variables. An obvious approach to uplift modeling is to build two separate models on both training sets and subtract their predictions. In many cases, better results can be obtained with models which predict the difference in outcomes directly. A popular class of uplift models are decision trees with splitting criteria favoring tests which promote differences between treatment and control groups. Ensemble methods have proven to be particularly useful in uplift modeling, often leading to significant increases in performance over the base learners. Linear models, such as logistic regression and support vector machines, have also been adapted to this setting. Dedicated methods, such as uplift or qini curves, are necessary for evaluating uplift models. Application of the methodology to survival data and scenarios with more than one possible action have also been considered.
M. Jankowski-Lorek, S. Jaroszewicz, Ł. Ostrowski, A. Wierzbicki. Verifying social network models of Wikipedia knowledge community. Information Sciences, 339, pages 158-174, 2016.

Abstract:
The Wikipedia project has created one of the largest and best-known open knowledge communities. This community is a model for several similar efforts, both public and commercial, and even for the knowledge economy of the future e-society. For these reasons, issues of quality, social processes, and motivation within the Wikipedia knowledge community have attracted attention of researchers. Research has often used Social Network Analysis applied to networks created based on behavioral data available from the edit history of the Wikipedia. This paper asks the following question: are the popular assumptions about the social interpretations of networks created from the edit history valid? We verify commonly assumed interpretations of four types of networks created from discussions on Wikipedia talk pages, co-edits and reverts in Wikipedia articles, and edits of articles in various topics, by comparing these networks with results from a survey of editors of the Polish Wikipedia community. The results indicate that while the behavioral networks are strongly related to the declarations of respondents, only in one case of the network created from talk pages and interpreted as acquaintance we can observe a near equivalence. The article next describes improved definitions of behavioral indicators obtained through machine learning. The improved networks are much closer to their declarative counterparts. The main contribution of the article is a validated model of an acquaintance network among Wikipedia editors that can be derived from behavioral data and validly interpreted as acquaintance. Other contributions are improved versions of behavioral networks based on editing behavior and discussion history on the Wikipedia.
S. Jaroszewicz, Ł. Zaniewicz. Székely Regularization for Uplift Modeling. In Challenges in Computational Statistics and Data Mining, pages 135-154, Springer International Publishing, Cham, 2016.

Abstract:
Uplift modeling is a subfield of machine learning concerned with predicting the causal effect of an action at the level of individuals. This is achieved by using two training sets: treatment, containing objects which have been subjected to an action and control, containing objects on which the action has not been performed. An uplift model then predicts the difference between conditional success probabilities in both groups. Uplift modeling is best applied to training sets obtained from randomized controlled trials, but such experiments are not always possible, in which case treatment assignment is often biased. In this paper we present a modification of Uplift Support Vector Machines which makes them less sensitive to such a bias. This is achieved by including in the model formulation an additional term which penalizes models which score treatment and control groups differently. We call the technique Székely regularization since it is based on the energy distance proposed by Székely and Rizzo. Optimization algorithm based on stochastic gradient descent techniques has also been developed. We demonstrate experimentally that the proposed regularization term does indeed produce uplift models which are less sensitive to biased treatment assignment.
M. Sołtys, S. Jaroszewicz, P. Rzepakowski. Ensemble methods for uplift modeling. Data Mining and Knowledge Discovery, 29(6), pages 1531-1559, Nov, 2015.

Abstract:
Uplift modeling is a branch of machine learning which aims at predicting the causal effect of an action such as a marketing campaign or a medical treatment on a given individual by taking into account responses in a treatment group, containing individuals subject to the action, and a control group serving as a background. The resulting model can then be used to select individuals for whom the action will be most profitable. This paper analyzes the use of ensemble methods: bagging and random forests in uplift modeling. We perform an extensive experimental evaluation to demonstrate that the application of those methods often results in spectacular gains in model performance, turning almost useless single models into highly capable uplift ensembles. The gains are much larger than those achieved in case of standard classification. We show that those gains are a result of high ensemble diversity, which in turn is a result of the differences between class probabilities in the treatment and control groups being harder to model than the class probabilities themselves. The feature of uplift modeling which makes it difficult thus also makes it amenable to the application of ensemble methods. As a result, bagging and random forests emerge from our evaluation as key tools in the uplift modeling toolbox.
Wyrwicz, L., Jaroszewicz, S., Rzepakowski, P., Bujko, K.. Uplift modeling in selection of patients to either radiotherapy or radiochemotherapy in resectable rectal cancer: reassessment of data from the phase 3 study. Annals of Oncology, 26(suppl_4), pages iv107, 2015, conference abstract.

Abstract:
Uplift modeling is a machine learning technique which allows for estimation, at the level of individuals, the gain in success rate of a treatment with respect to an alternative treatment. The technique is becoming popular in marketing. For the current analysis we apply an uplift regression technique which directly models the difference between treatment and control success probabilities, but uses the correlation between the difference in two treatment outcomes and general patient prospects to improve the model. Here we apply uplift modelling to clinical data to improve selection of the preoperative modality in resectable T3/T4 or N+ rectal cancer patients.
S. Jaroszewicz, P. Rzepakowski. Uplift modeling with survival data. In ACM SIGKDD Workshop on Health Informatics (HI-KDD'14), New York City, USA, August, 2014.

Abstract:
Uplift modeling is a subfield of machine learning which aims at predicting differences between class probabilities in treatment and control groups. This setting is frequently encountered in medical trials which are thus a natural application domain for the technique. Unfortunately, clinical trials usually involve survival data with censoring and most machine learning methods, including uplift modeling methods, do not directly allow for the use of such data. In this paper we demonstrate that, under reasonable assumptions, uplift modeling can be easily applied to survival data, while maintaining the correctness of decisions made by the model. This is in contrast to standard classification, where the use of censored training examples is more difficult. We test our approach on a publicly available clinical trial dataset and show that using an uplift model it is possible to obtain signifcant improvements in survival rates.
M. Korzeń, Szymon Jaroszewicz. PaCAL: A Python Package for Arithmetic Computations with Random Variables. Journal of Statistical Software, 57(10), 5, 2014.

Abstract:
In this paper we present PaCAL, a Python package for arithmetical computations on random variables. The package is capable of performing the four arithmetic operations: addition, subtraction, multiplication and division, as well as computing many standard functions of random variables. Summary statistics, random number generation, plots, and histograms of the resulting distributions can easily be obtained and distribution parameter fitting is also available. The operations are performed numerically and their results interpolated allowing for arbitrary arithmetic operations on random variables following practically any probability distribution encountered in practice. The package is easy to use, as operations on random variables are performed just as they are on standard Python variables. Independence of random variables is, by default, assumed on each step but some computations on dependent random variables are also possible. We demonstrate on several examples that the results are very accurate, often close to machine precision. Practical applications include statistics, physical measurements or estimation of error distributions in scientific computations.
O. Jarczyk, B. Gruszka, S. Jaroszewicz, L. Bukowski, A. Wierzbicki. GitHub Projects. Quality Analysis of Open-Source Software. In Proc. of the 6th International Conference on Social Informatics (SocInfo'14), pages 80-94, Barcelona, Spain, November, 2014, Best paper nominee.

Abstract:
Nowadays Open-Source Software is developed mostly by decentralized teams of developers cooperating on-line. GitHub portal is an online social network that supports development of software by virtual teams of programmers. Since there is no central mechanism that governs the process of team formation, it is interesting to investigate if there are any significant correlations between project quality and the characteristics of the team members. However, for such analysis to be possible, we need good metrics of a project quality. This paper develops two such metrics, first one reflecting project's popularity, and the second one - the quality of support offered by team members to users. The first metric is based on the number of `stars' a project is given by other GitHub members, the second is obtained using survival analysis techniques applied to issues reported on the project by its users. After developing the metrics we have gathered characteristics of several GitHub projects and analyzed their influence on the project quality using statistical regression techniques.
B. Żogała-Siudem, S. Jaroszewicz. Fast stepwise regression on Linked Data. In Proc. of the 1st Workshop on Linked Data for Knowledge Discovery (LD4KD) co-located with ECML/PKDD'14, pages 17-26, Nancy, France, September, 2014.

Abstract:
The main focus of research in machine learning and statistics is on building more advanced and complex models. However, in practice it is often much more important to use the right variables. One may hope that recent popularity of open data would allow researchers to easily find relevant variables. However current linked data methodology is not suitable for this purpose since the number of matching datasets is often overwhelming. This paper proposes a method using correlation based indexing of linked datasets which can significantly speed up feature selection based on classical stepwise regression procedure. The technique is efficient enough to be applied at interactive speed to huge collections of publicly available linked open data.
Ł. Zaniewicz, S. Jaroszewicz. Support Vector Machines for Uplift Modeling. In The First IEEE ICDM Workshop on Causal Discovery (CD 2013), Dallas, Texas, December, 2013.

Abstract:
Uplift modeling is a branch of Machine Learning which aims to predict not the class itself, but the difference between the class variable behavior in two groups: treatment and control. Objects in the treatment group have been subject to some action, while objects in the control group have not. By including the control group it is possible to build a model which predicts the causal effect of the action for a given individual. In this paper we present a variant of Support Vector Machines designed specifically for uplift modeling. The SVM optimization task has been reformulated to explicitly model the difference in class behavior between two datasets. The model predicts whether a given object will have a positive, neutral or negative response to a given action, and by tuning a parameter of the model the analyst is able to influence the relative proportion of neutral predictions and thus the sensitivity of the model. We adapt the dual coordinate descent method to efficiently solve our optimization task. Finally the proposed method is compared experimentally with other uplift modeling approaches.
L. Bukowski, M. Jankowski-Lorek, S. Jaroszewicz, M. Sydow. What Makes a Good Team of Wikipedia Editors?~A Preliminary Statistical Analysis. In Proc. of the 5th International Conference on Social Informatics (SocInfo'14), pages 14-28, Kyoto, Japan, November, 2013.

Abstract:
The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attributes not related to the team itself. The analysis confirmed that the key issue significantly influencing article's quality are discussions between teem members. The second part of the paper successfully uses machine learning models to predict good articles based on features of the teams that created them.
M. Korzeń, S. Jaroszewicz, P. Klęsk. Logistic regression with weight grouping priors. Computational Statistics & Data Analysis, 64, pages 281-298, August, 2013.

Abstract:
A generalization of the commonly used Maximum Likelihood based learning algorithm for the logistic regression model is considered. It is well known that using the Laplace prior ($L^1$ penalty) on model coefficients leads to a variable selection effect, when most of the coefficients vanish. It is argued that variable selection is not always desirable; it is often better to group correlated variables together and assign equal weights to them. Two new kinds of a-priori distributions over weights are investigated: Gaussian Extremal Mixture (GEM) and Laplacian Extremal Mixture (LEM) which enforce grouping of model coefficients in a manner analogous to $L^1$ and $L^2$ regularization. An efficient learning algorithm is presented, which simultaneously finds model weights and the hyperparameters of those priors. Examples are shown in the experimental part where the proposed a-priori distributions outperform Gauss and Laplace priors as well as other methods which take coefficient grouping into account, such as the elastic net. Theoretical results on parameter shrinkage and sample complexity are also included.
P. Rzepakowski, S. Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32, pages 303-327, August, 2012.

Abstract:
Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases, some action such as mailing an offer or treating a patient is to be taken on the classified objects, and we should model not the class probabilities themselves, but instead, the change in class probabilities caused by the action. The action should then be performed on those objects for which it will be most profitable. This problem is known as uplift modeling, differential response analysis, or true lift modeling, but has received very little attention in machine learning literature. An important modification of the problem involves several possible actions, when for each object, the model must also decide which action should be used in order to maximize profit. In this paper, we present tree-based classifiers designed for uplift modeling in both single and multiple treatment cases. To this end, we design new splitting criteria and pruning methods. The experiments confirm the usefulness of the proposed approaches and show significant improvement over previous uplift modeling techniques.
M. Jaśkowski, S. Jaroszewicz. Uplift Modeling for Clinical Trial Data. In ICML 2012 Workshop on Machine Learning for Clinical Data Analysis, Edinburgh, Scotland, June, 2012.

Abstract:
Traditional classification methods predict the class probability distribution conditional on a set of predictor variables. Uplift modeling, in contrast, tries to predict the difference between class probabilities in the treatment group (on which some action has been taken) and the control group (not subjected to the action) such that the model predicts the net effect of the action. Such an approach seems to be well suited to analysis of clinical trial data and to allow for discovering groups of patients for which the treatment is most beneficial. One of the purposes of this paper is to verify this claim experimentally. Additionally, we present an approach to uplift modeling which allows for application of standard probabilistic classification models, such as logistic regression, in the uplift setting. Further, we extend the approach such that standard classification models built on the treatment and control datasets can be incorporated in a manner similar to semi-supervised learning in order to improve prediction accuracy. The usefulness of both approaches has been verified experimentally on publicly available clinical trial data.
S. Jaroszewicz, M. Korzeń. Arithmetic Operations on Independent Random Variables: A Numerical Approach. SIAM Journal on Scientific Computing, 34, pages A1241-A1265, 2012.

Abstract:
Dealing with imprecise quantities is an important problem in scientific computation. Model parameters are known only approximately, typically in the form of probability density functions. Unfortunately there are currently no methods of taking uncertain parameters into account which would at the same time be easy to apply and highly accurate. An important special case is operations on independent random variables which occur frequently in obtaining confidence intervals for physical measurements and statistical estimators. In this paper we investigate the possibility of implementing arithmetic operations on independent random variables numerically. Equivalently, the problem can be viewed as propagating approximated probability density functions through arithmetic operations. We introduce a very broad family of distributions which is closed under the four arithmetic operations and taking powers. Furthermore we show that the densities of the distributions in the family can be effectively approximated using Chebyshev polynomials. A practical implementation is also provided, demonstrating the feasibility and usefulness of the approach. Several examples show applications in physical measurements, statistics, and probability theory, demonstrating very high numerical accuracy. These include an interesting problem related to combining independent measurements of a physical quantity, distributions of sample statistics of the Hill's estimator for tail exponents, generalized $\chi^2$ distribution, and others. The results are usually extremely accurate, in some cases more accurate than specialized solutions available in statistical packages, and can be achieved with very little effort.
P. Rzepakowski, S. Jaroszewicz. Decision Trees for Uplift Modeling. In Proc. of the 10th International Conference on Data Mining (ICDM), pages 441-450, Sydney, Australia, December, 2010.

Abstract:
Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases, some action, such as mailing an offer or treating a patient, is to be taken on the classified objects and we should model not the class probabilities themselves, but instead, the change in class probabilities caused by the action. The action should then be performed on those objects for which it will be most profitable. This problem is known as uplift modeling, differential response analysis or true lift modeling, but has received very little attention in Machine Learning literature. In the paper we present a tree based classifier tailored specifically to this task. To this end, we design new splitting criteria and pruning methods. The experiments confirm the usefulness of the proposed approach and show significant improvement over previous uplift modeling techniques.
S. Jaroszewicz. Using interesting sequences to interactively build Hidden Markov Models. Data Mining and Knowledge Discovery, 21(1), pages 186-220, 2010.

Abstract:
The paper presents a method of interactive construction of global Hidden Markov Models (HMMs) based on local sequence patterns discovered in data. The method is based on finding interesting sequences whose frequency in the database differs from that predicted by the model. The patterns are then presented to the user who updates the model using their intelligence and their understanding of the modelled domain. It is demonstrated that such an approach leads to more understandable models than automated approaches. Two variants of the problem are considered: mining patterns occurring only at the beginning of sequences and mining patterns occurring at any position; both practically meaningful. For each variant, algorithms have been developed allowing for efficient discovery of all sequences with given minimum interestingness. Applications to modelling webpage visitors behavior and to modelling protein secondary structure are presented, validating the proposed approach.
S. Jaroszewicz, T. Scheffer, D.A. Simovici. Scalable pattern mining with Bayesian networks as background knowledge. Data Mining and Knowledge Discovery, 18(1), pages 56-100, 2009.

Abstract:
We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available in the form of a graphical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves in an interactive process of discovery. We focus on the central step of this process: given a graphical model and a database, we address the problem of finding the most interesting attribute sets. We formalize the concept of interestingness of attribute sets as the divergence between their behavior as observed in the data, and the behavior that can be explained given the current model. We derive an exact algorithm that finds all attribute sets whose interestingness exceeds a given threshold. We then consider the case of a very large network that renders exact inference unfeasible, and a very large database or data stream. We devise an algorithm that efficiently finds the most interesting attribute sets with prescribed approximation bound and confidence probability, even for very large networks and infinite streams. We study the scalability of the methods in controlled experiments; a case-study sheds light on the practical usefulness of the approach.
S. Jaroszewicz. Interactive HMM Construction Based on Interesting Sequences. In Proc. of Local Patterns to Global Models (LeGo'08) Workshop at the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'08), pages 82-91, Antwerp, Belgium, 2008.

Abstract:
The paper presents a method of interactive construction of global Hidden Markov Models based on local patterns discovered in sequence data. The method works by finding interesting sequences whose probability in data differs from that predicted by the model. The patterns are then presented to the user who updates the model using their understanding of the modelled domain. It is argued that such an approach leads to more understandable models than automated approaches. An application to modelling webpage visitors behavior is presented, showing the strengths of the proposed approach.
S. Jaroszewicz. Minimum Variance Associations -- Discovering Relationships in Numerical Data. In The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 172-183, Osaka, Japan, 2008.

Abstract:
The paper presents minimum variance patterns: a new class of itemsets and rules for numerical data, which capture arbitrary continuous relationships between numerical attributes without the need for discretization. The approach is based on finding polynomials over sets of attributes whose variance, in a given dataset, is close to zero. Sets of attributes for which such functions exist are considered interesting. Further, two types of rules are introduced, which help extract understandable relationships from such itemsets. Efficient algorithms for mining minimum variance patterns are presented and verified experimentally.
T. Calders, S. Jaroszewicz. Efficient AUC Optimization for Classification. In 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'07), pages 42-53, Warsaw, Poland, 2007, Best paper award.

Abstract:
In this paper we show an efficient method for inducing classifiers that directly optimize the area under the ROC curve. Recently, AUC gained importance in the classification community as a mean to compare the performance of classifiers. Because most classification methods do not optimize this measure directly, several classification learning methods are emerging that directly optimize the AUC. These methods, however, require many costly computations of the AUC, and hence, do not scale well to large datasets. In this paper, we develop a method to increase the efficiency of computing AUC based on a polynomial approximation of the AUC. As a proof of concept, the approximation is plugged into the construction of a scalable linear classifier that directly optimizes AUC using a gradient descent method. Experiments on real-life datasets show a high accuracy and efficiency of the polynomial approximation.
S. Jaroszewicz, L. Ivantysynova, T. Scheffer. Schema Matching on Streams with Accuracy Guarantees. Intelligent Data Analysis, 12(3), pages 253-270, 2008.

Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.
S. Jaroszewicz, M. Korzeń. Approximating Representations for Large Numerical Databases. In 7th SIAM International Conference on Data Mining (SDM'07), pages 521-526, Minneapolis, MN, 2007.

Abstract:
The paper introduces a notion of support for realvalued functions. It is shown how to approximate supports of a large class of functions based on supports of so called polynomial itemsets, which can efficiently be mined using an Apriori-style algorithm. An upper bound for the error of such an approximation can be reliably computed. The concept of an approximating representation was introduced, which extends the idea of concise representations to numerical data. It has been shown that many standard statistical modelling tasks such as nonlinear regression and least squares curve fitting can efficiently be solved using only the approximating representation, without accessing the original data at all. Since many of those methods traditionally require several passes over the data, our approach makes it possible to use such methods with huge datasets and data streams where several repeated scans are very costly or outright impossible.
S. Jaroszewicz, L. Ivantysynova, T. Scheffer. Accurate Schema Matching on Streams. In 4th International Workshop on Knowledge Discovery from Data Streams at the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), pages 3-12, 2006.

Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to match the most similar attributes with high probability. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.
T. Calders, B. Goethals, S. Jaroszewicz. Mining Rank-Correlated Sets of Numerical Attributes. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), 2006.

Abstract:
We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendall's tau, and Spearman's Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on real-life datasets.
S. Jaroszewicz. Polynomial Association Rules with Applications to Logistic Regression. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), 2006.

Abstract:
A new class of associations (polynomial itemsets and polynomial association rules) is presented which allows for discovering nonlinear relationships between numeric attributes without discretization. For binary attributes, proposed associations reduce to classic itemsets and association rules. Many standard association rule mining algorithms can be adapted to finding polynomial itemsets and association rules. We applied polynomial associations to add non-linear terms to logistic regression models. Significant performance improvement was achieved over stepwise methods, traditionally used in statistics, with comparable accuracy.
S. Jaroszewicz, M. Korzeń. Comparison of Information Theoretical Measures for Reduct Finding. In Proc. of the 8th International Conference on Artificial Intelligence and Soft Computing (ICAISC'06), pages 518-527, Zakopane, June, 2006.

Abstract:
The paper discusses the properties of an attribute selection criterion for building rough set reducts based on discernibilty matrix and compares it with Shannon entropy and Gini index used for building decision trees. It has been shown theoretically and experimentally that entropy and Gini index tend to work better if the reduct is later used for prediction of previously unseen cases, and the criterion based on the discernibility matrix tends to work better for learning functional relationships where generalization is not an issue.
D. A. Simovici, S. Jaroszewicz. A new metric splitting criterion for decision trees. Parallel Algorithms Appl., 21(4), pages 239-256, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
D. A. Simovici, S. Jaroszewicz. Generalized Conditional Entropy and a Metric Splitting Criterion for Decision Trees. In 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'06), pages 35-44, Singapore, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
S. Jaroszewicz, T. Scheffer. Fast Discovery of Unexpected Patterns in Data, Relative to a Bayesian Network. In 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005), pages 118-127, Chicago, IL, August, 2005.

Abstract:
We consider a model in which background knowledge on a given domain of interest is available in terms of a Bayesian network, in addition to a large database. The mining problem is to discover unexpected patterns: our goal is to find the strongest discrepancies between network and database. This problem is intrinsically difficult because it requires inference in a Bayesian network and processing the entire, potentially very large, database. A sampling-based method that we introduce is efficient and yet provably finds the approximatelymost interesting unexpected patterns. We give a rigorous proof of the method's correctness. Experiments shed light on its efficiency and practicality for large-scale Bayesian networks and databases.
D. Simovici, S. Jaroszewicz. A New Metric Splitting Criterion for Decision Trees. International Journal of Parallel, Emergent and Distributed Systems, 21(4), pages 239-256, August, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
S. Jaroszewicz, W. Kosiński. Machine Learning for Speech Recognition Researchers. In Proceedings of the Speech Analysis, Synthesis and Recognition Applications of Phonetics Conference, Kraków, Poland, September, 2005, Publication on CD-ROM.

Abstract:
In this paper we present an overview of machine learning and data mining methods which we believe are useful for researchers working in the area of speech recognition and analysis. We present selected algorithms for classification, clustering, sequence mining and rough sets based methods. Potential applications in speech recognition and analysis are also discussed. We hope that this paper will encourage speech recognition researchers to more frequently use machine learning and data mining algorithms.
M. Korzeń, S. Jaroszewicz. Finding Reducts Without Building the Discernibility Matrix.. In Proceedings of the Fifth International Conference on Intelligent Systems Design and Applications (ISDA'05), pages 450-455, Wrocław, Poland, 2005.

Abstract:
We present algorithms for fast generation of short reducts which avoid building the discernibility matrix explicitly. We show how information obtained from this matrix can be obtained based only on the distributions of attribute values. Since the size of discernibility matrix is quadratic in the number of data records, not building the matrix explicitly gives a very significant speedup and makes it possible to find reducts even in very large databases. Algorithms are given for both absolute and relative reducts. Experiments show that our approach outperforms other reduct finding algorithms. Furthermore it is shown that many heuristic reduct finding algorithms using the discernibility matrix in fact select attributes based on their Gini index. A new definition of conditional Gini index is presented, motivated by reduct finding heuristics.
S. Jaroszewicz, D. A. Simovici, I. Rosenberg. Measures on Boolean Polynomials and Their Applications in Data Mining. Discrete Applied Mathematics, 144(1-2), pages 123-139, November, 2004.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. An extension of the notion of support that is well suited for tables with missing values is presented. Finally, we obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures for several types of queries. An approximation algorithm and an analysis of the results produced is also included.
Szymon Jaroszewicz, Dan Simovici. Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge. In 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), pages 178-186, Seattle, WA, August, 2004.

Abstract:
The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian network. Efficient algorithms are presented for finding interestingness of a collection of frequent itemsets, and for finding all attribute sets with a given minimum interestingness. Practical usefulness of the algorithms and their efficiency have been verified experimentally.
S. Jaroszewicz, D. A. Simovici, W. P. Kuo, L. Ohno-Machado. The Goodman-Kruskal Ceofficient and its Applications in Genetic Diagnosis of Cancer. IEEE Transactions on Biomedical Engineering, 51(7), pages 1095-1102, July, 2004.

Abstract:
Increasing interest in new pattern recognition methods has been motivated by bioinformatics research. The analysis of gene expression data originated from microarrays constitutes an important application area for classification algorithms and illustrates the need for identifying important predictors. We show that the Goodman-Kruskal coefficient can be used for constructing minimal classifiers for tabular data, and we give an algorithm that can construct such classifiers.
Dan A. Simovici, Szymon Jaroszewicz. Generalized Conditional Entropy and Decision Trees. In Journees francophones d'Extraction et de Gestion de Connaissances (EGC 2003), pages 369-380, Lyon, France, January, 2003.

Abstract:
We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be used to construct decision trees. Experimental results suggest that by varying the parameter that defines the entropy it is possible to obtain smaller decision trees for certain databases without sacrificing accurracy.
S. Jaroszewicz, D. A. Simovici. Support Approximations Using Bonferroni-type Inequalities. In 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002), pages 212-223, Helsinki, Finland, August, 2002.

Abstract:
The purpose of this paper is to examine the usability of Bonferroni-type combinatorial inequalities to estimation of support of itemsets as well as general Boolean expressions. Families of inequalities for various types of Boolean expressions are presented and evaluated experimentally.
S. Jaroszewicz, D. A. Simovici. Pruning Redundant Association Rules Using Maximum Entropy Principle. In Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD'02, pages 135-147, Taipei, Taiwan, May, 2002.

Abstract:
Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of efficiency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of the proposed technique indicates that it efficiently produces small sets of interesting association rules.
I. Rosenberg, D. A. Simovici, S. Jaroszewicz. On Functions Defined on Free Boolean Algebras. In IEEE Intenrational Symposiun on Multiple-Valued Logic ISMVL'02, Boston, MA, May, 2002.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exists between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions and an algorithm that leads to the construction of measures starting from its values on positive conjunction is also given, including a formula that allows the evaluation of measures for arbitrary polynomials. Finally, we study pairs of measures generated by ternary tables, that is, by tables that contain missing or unknown values.
S. Jaroszewicz, D. A. Simovici, I. Rosenberg. An Inclusion-Exclusion Result for Boolean Polynomials and Its Applications in Data Mining. In Workshop on Discrete Mathematics and Data Mining (DM&DM), Second SIAM International Conference on Data Mining, Arlington, VA, April, 2002.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. We also obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures. Finally we present a measure extending the notion of support that is well suited for tables with missing values.
D. A. Simovici, S. Jaroszewicz. An Axiomatization of Partition Entropy. IEEE Transactions on Information Theory, 48(7), pages 2138-2142, July, 2002.

Abstract:
The aim of this paper is to present an axiomatization of a generalization of Shannon's entropy starting from partitions of finite sets. The proposed axiomatization yields as special cases the Havrda-Charvat entropy, and thus, provides axiomatizations for the Shannon entropy, the Gini index, and for other types of entropy used in classification and data mining.
S. Jaroszewicz, D. A. Simovici. A General Measure of Rule Interestingness. In 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001), pages 253-265, 2001.

Abstract:
The paper presents a new general measure of rule interestingness. Many known measures such as chi-square, gini gain or entropy gain can be obtained from this measure by setting some numerical parameters, including the amount of trust we have in the estimation of the probability distribution of the data. Moreover, we show that there is a continuum of measures having chi-square, Gini gain and entropy gain as boundary cases. Therefore our measure generalizes both conditional and unconditional classical measures of interestingness. Properties and experimental evaluation of the new measure are also presented.
D. A. Simovici, S. Jaroszewicz. On Information-Theoretical Aspects of Relational Databases. In Finite versus Infinite, Springer Verlag, London, 2000.

Abstract:
We introduce the notion of entropy for a set of attributes of a table in a relational database starting from the notion of entropy for finite functions. We examine the connections that exist between conditional entropies of attribute sets and lossy decompositions of tables and explore the properties of the entropy of sets of attributes regarded as an outer measure on the set of subsets of a heading of a table. Finally, we suggest a generalization of functional dependencies based on conditional entropy.
S. Jaroszewicz, D. A. Simovici. On Axiomatization of Conditional Entropy of Functions Between Finite Sets. In Proceedings of the 29th International Symposium on Multiple-Valued Logic, pages 24-28, Freiburg, Germany, 1999.

Abstract:
In this paper we present a new axiomatization of the notion of entropy of functions between finite sets and we introduce and axiomatize the notion of conditional entropy between functions. The results can be directly applied to logic functions, which can be regarded as functions between finite sets. Our axiomatizations are based on properties of entropy with regard to operations commonly applied to discrete functions and are related to the usage of entropy as a measure of the energy dissipated by circuits that implement discrete functions.