Dataset- and Task-Independent Recommender System

Dataset- and Task-Independent Recommender System (DTIRS)

Recommender systems offer personalized experiences but often require extensive dataset- and task-specific configurations, limiting reusability and accessibility. I To this end, we propose the Dataset- and Task-Independent Recommender System (DTIRS), which minimizes manual intervention by standardizing dataset descriptions and task definitions through the novel Dataset Description Language (DsDL). DTIRS enables autonomous feature engineering, model selection, and optimization, reducing the need for constant reconfiguration while still allowing model retraining.

For recommendation system tasks, we categorize them into four major types:

Figure 1

Overview of the typical recommender system workflow (left) compared to our proposed DTIRS (right). The typical workflow often requires human expert and manual effort for feature engineering, model development, and hyperparameter tuning across different datasets, creating a barrier to entry; the results are typically dataset- or task-specific codes or pipelines, reducing reusability. In contrast, with the help of DsDL (Section~\ref{sec:dsdl}), DTIRS aims to eliminate the need for manual reconfiguration in many of these steps, lowering the barrier to entry and increasing reusability.

1. DsDL Specification

DsDL (Dataset Description Language) is used to define the structure of datasets. The format is as follows:

DsDL ::= "columns" ":" "[" ColumnList "]"
         [TimestampCol]
         "target" ":" "[" TargetList "]"

ColumnList ::= Column { "," Column }

Column ::= "{" "col_name" ":" String ","
               "type" ":" ColumnType "}"

ColumnType ::= "numeric" | "binary" |
               "categorical" | "ordinal" |
               "textual" | "url" |
               "list_of_numeric" | "list_of_binary" |
               "list_of_categorical" | "list_of_url" |
               "list_of_ordinal" | "list_of_textual"

TimestampCol ::= "timestamp_col" ":" String

TargetList ::= "{" "type" ":" TargetType ","
                   "label_col" ":" String ","
                   "key_col" ":" String ","
                   [ "," "list_size" ":" PositiveInt
                     "," "relevance_col" ":" String ] "}"

TargetType ::= "binary" | "numeric" |
               "ordered_list" | "unordered_list"

String ::= <any string>
PositiveInt ::= <any positive integer>

Descriptions:

2.Code Example

2.1 Project Structure

├── DCN/                 # DCN example directory
│   ├── Criteo_tiny/     # Criteo tiny dataset
│   ├── dcn.py           # Script od DCN model
|   ├── main.py          # Main script for running recommendation tasks of DCN algorithm
|   ├── task.py          # Abstract class defining the task interface
|   ├── user_task.py     # Implementation of DTIRS with DCN algorithm 
|
├── real_dataset/        # Real-world dataset directory
│   ├── binary/          # Binary classification tasks
│   ├── numeric/         # Numeric prediction tasks
│   ├── ordered_list/    # Ordered list recommendation (Top-N)
│   ├── unordered_list/  # Unordered list recommendation (Tag Prediction)
│   ├── data_remove_label.py  # Script to remove labels from dataset
│
├── synthetic_dataset/          # Toy dataset directory for testing
│   ├── binary/          # Binary classification toy dataset
│   ├── numeric/         # Numeric prediction toy dataset
│   ├── ordered_list/    # Ordered list toy dataset
│   ├── unordered_list/  # Unordered list toy dataset
│
├── main.py               # Main script for running recommendation tasks
├── task.py               # Abstract class defining the task interface
├── user_task.py          # Implementation of different recommendation tasks
├── README.md             # Project documentation

Gitlab Link: https://gitlab.com/dtirs/dtirs.gitlab.io

2.2 Code Dependency

Before running the code, make sure you have the following Python dependencies

Manually install them:

pip install pandas numpy pyyaml scikit-learn surprise

2.3 Running the Code

Execute main.py with the following parameters:

python main.py <train_data_path> <test_data_path> <dsdl_path> <output_path>

Parameter description:

Example usage:

python main.py data/train.csv data/test.csv config/dsdl.yaml output/

2.4 Code Description

main.py

task.py

user_task.py

2.5 Example using DCN algorithm

This section demonstrates how to use the Deep & Cross Network (DCN) model for binary classification tasks. The DCN model combines both deep learning and cross features to capture complex feature interactions.

a. Files Paths and Dataset

b. Task Description

The UserTask class in ./DCN/user_task.py defines the preprocessing, training, and prediction steps for the binary classification task. The DCN model is integrated during the training phase.

3. Resource

Example Dataset: https://gitlab.com/dtirs/dtirs.gitlab.io/-/tree/main/synthetic_dataset

Example Code: https://gitlab.com/dtirs/dtirs.gitlab.io/-/blob/main/user_task.py