WSKNN - Weighted Session-based K-NN recommender system

Users of e-commerce systems generate vast amounts of unstructured, sequential data streams. Each sequence is a varying-length list of directional (timestamped) user-product interactions. There are hidden patterns within those sequences. Users tend to interact with similar products, and interactions change over time. Based on this behavior, we can recommend the sequence of products that the user may be interested in.

The WSKNN recommender was designed to evaluate complex deep-learning architectures (Twardowski et al., 2021).During the research, it became clear that the performance of the k-NN model is comparable to, if not better than, that of neural network algorithms (see experimental comparison).Moreover, the literature analysis about recommender systems shows that the k-NN-based solutions are performing well in different conditions (M.Ludewig & Jannach, 2018).This makes WSKNN a valuable benchmarking tool against novel algorithms and architectures and the first-choice tool for the fresh start and design of the recommender system.The package's algorithm can be a recommender for small and medium-sized datasets.During the internal studies in the company, the algorithm performed well for the small datasets (25k sessions; 3k items) and bigger datasets -see MovieLens 25M tutorial (Moliński, 2023).The model has its limitations, and the main drawback is that it is memory-hungry.As a memory-based method, it can grow to the moment when its usage is unfeasible.It could be an issue for production environments where the memory costs may exceed potential benefits.
The package was created during the research project of the Sales Intelligence Sp. z o.o.company (Twardowski et al., 2021).The company owns the price comparison service Nokaut.pland cooperates with multiple big stores across Poland.Thus, it has access to vast amounts of sequential data sources.Currently, the package is used for SMS and mailing recommendations for big customers.

Related work
A similar architecture can be found in a stand-alone repository (L.Ludewig Mauro, 2019) that seems to be not actively maintained and is linked to a specific publication (Latifi et al., 2020).The main technical difference between WSKNN and the V-SKNN model from the presented repository is that the former is a ready-to-use package.The analytical differences are related to the fact that WSKNN has more ways of session-weighting up to a point where custom heuristics can be applied to the recommendations.The W letter in WSKNN indicates that it differs from the baseline V-SKNN algorithm, utilizing external weighting factors (prices, weights applied to actions).
The other example of a repository with scripts that is not a package is (Baltrunas Hidasi Karatzoglou, 2015) with Python implementation of Gru4Rec session-based recommender (Balázs Hidasi et al., 2015).

Package structure
The package is lightweight.It depends on the numpy (Harris et al., 2020), pandas (team, 2020), tqdm (Costa-Luis et al., 2023), more_itertools ("More Itertools Github Repository," 2023), and pyyaml (Simonov, 2023) libraries.It works with currently supported Python versions, starting from Python 3.8.It has two main functions: • fit() to build a memory representation of a model as Python dictionaries with the session-items and item-sessions maps of varying sizes.• predict() to return recommendations.It is worth noticing that the recommendation strategy may be altered after fitting a model; it allows testing different weighting scenarios in parallel without additional models training.
The user may pass additional parameters to the predict() method as a dictionary to control model behavior on the fly.Those parameters are: • the number of recommendations, • the number of neighbors to choose items from (the closest neighbors), • the sampling strategy of neighbors (common items, recent sessions, random subset, custom weights assigned to events' type), • the sample size (an initial subset of neighbors to look for the closest neighbors), • a session similarity weighting function, • an item ranking strategy, • should algorithm return items that are in the recommended session?
• is there any event (user action) that must be performed within a session to build a similarity map (for example, the transaction event)?• should the algorithm recommend random items if the neighbors-items-set is smaller than the number of recommendations?
The YAML file documenting options is provided in the top level of the package repository as model_settings.yaml.The user may load those settings with pyyaml with the function parse_settings().Then, a dictionary with settings may be passed to the predict() function.
The sample flow and recommendations are presented in the repository (Moliński, 2022).The package has built-in evaluation metrics: • the mean reciprocal rank of top k recommendations, • the precision score of top k recommendations, • the recall score of top k recommendations.
The package can process static JSON-lines, gzipped JSON-lines files, and static CSV files with e-commerce events.The recommended way of parsing is to pass pandas DataFrame for large datasets.
The primary data types are Items and Sessions.Those classes store item-sessions and session-items mappings and session-related attributes.Those may be updated with the new events.
In the near future, the package will introduce the tensorflow (Abadi et al., 2015) version of the algorithm.It is internal work within the company.The Items and Sessions classes currently have the metadata attributes that allow data transformation from the custom format into tensorflow tensors.

Data Formats
The basic data type required by the algorithm is an event, which consists of: • session index, or user index, • a product with which the user interacts, • timestamp of each interaction, • (optional) action type, • (optional) other information, for example, product price, quantity, and user type.
A group of events with the same session index or user index is a session.A session is a sequence of events whose length is not fixed.

Experiments
This section describes the performance of WSKNN.The table comes from internal experiments at Sales Intelligence Sp. z o.o..The algorithm was compared to Session Metric Learning algorithms (SML-RNN-*) (Twardowski et al., 2021), GRU4Rec (Baltrunas Hidasi Karatzoglou, 2015), popularity-based recommender (POP), and Markov model (MM).A comparison has been performed on the RecSys-2015 dataset (Ben-Shimon et al., 2015); 90% of the oldest sessions were used as a training set, and the rest as a test set.The dataset contains 7 981 581 sessions (44% unique), 31 708 505 events, and 37 486 items.Monitored metrics are recall (REC@5, REC@20), mean reciprocal rank (MRR@5, MRR@20), mean average precision MAP@20, hit rate HR@20, training time, and latency -how long does it take for a model to prepare recommendations for 10% of the newest session in a dataset.

Algorithm
MAP@20REC@20 HR@20 MRR@20 REC@5 MRR@5 While the performance of WSKNN on analytical metrics is comparable to RNN-based models, its response times are less optimal.Detailed comparison with more models and datasets is presented in (Twardowski et al., 2021).

Performance
The model's performance concerning the number of sessions and items in a set is presented in the package repository in the

Limitations
Like all Machine Learning systems, WSKNN has limitations: • model memorizes session-items and item-sessions maps, and if the product base is significant, and we use sessions for an extended period, then the model may be too big to fit into memory; in this case, we can categorize products and train a different model for each category.Benchmarking shows that model memory size is directly related to the number of sessions.• Response time may be slower than from other models, especially if there are many items to recommend.Benchmarking shows that the mean response time increases with the number of items used for training, • There's additional overhead related to preparing the data structure for modeling.It can be done as a stand-alone step because the model uses Python dictionaries with session-items and item-sessions maps.WSKNN has a built-in preprocessing module and Items and Sessions classes, which transform and store common events structure into the model's format.

Figure 1 :Figure 2 :Figure 3 :
Figure 1: Training time in relation to Session length vs number of items