varch/doc/set.en.md

### Introduction

The set collection is a logically discrete container, which is different from vector and list containers. The access index of the set is discrete and not necessarily continuous. When elements are inserted into the set, sorting and index deduplication are performed.
The set container in varch also has the basic properties of a set. In its underlying implementation, it uses a **red-black tree**. The efficiency of addition and deletion operations is higher than that of the vector, and it also supports random access. The time complexity of random access is much better than that of the list. As a chained structure, the set collection also supports iterators.

### Interface

#### Creation and Deletion of set Objects
```c
set_t set_create(int dsize);
void set_delete(set_t set);
#define set(type) // For more convenient use, a macro definition is wrapped around set_create
#define _set(set) // A macro definition is wrapped around set_delete, and the set is set to NULL after deletion
```
Here, **set_t** is the structure of the set. The creation method will return an empty set object, and it will return NULL if the creation fails. The `dsize` parameter is used to pass in the size of the data. The deletion method is used to delete the passed-in set object. The creation method and the deletion method should be used in pairs. Once created, the set object should be deleted when it's no longer in use.
```c
void test(void)
{
    set_t set = set(int); // Define and create a set of the int type
    _set(set); // Use them in pairs and delete it after use
}
```

#### Insertion and Removal of set
```c
void* set_insert(set_t set, int index, void* data);
int set_erase(set_t set, int index);
```
The set has good efficiency in insertion and removal. There's no need to shift data; only the pointers in the linked list need to be modified.
The insertion method adds a specified index and copies the data to this index (when `data` is passed as NULL, it only allocates space without assigning a value). During the process of inserting the index, duplicate checking will be performed to ensure the uniqueness of the index. If the insertion is successful, it returns the address of the inserted data; otherwise, it returns NULL. The removal method removes the data at the specified index. It returns 1 if successful and 0 if failed.

#### Reading and Writing of set Data
```c
void* set_data(set_t set, int index);
void* set_error(set_t set);
#define set_at(set, type, i)
```
The `set_data` method is used to obtain the address of the data according to the index and returns the address of the specified data. `set_error()` is used to indicate failure. The `set_at` method adds the data type on the basis of `set_data`. The `set_data` has a read-write protection mechanism. Since `set_error()` is returned instead of NULL, when using the `set_at` method and the `i` value is entered incorrectly, the content pointed to by `set_error()` will be modified instead of causing the program to crash.
The random access of the set is different from that of arrays with continuous addresses or lists with non-continuous addresses. Arrays can directly locate the address of the specified index, while lists need to use links to point step by step for random access. The set uses a **red-black tree** and can quickly access the specified index through binary search.
```c
void test(void)
{
    set_t set = set(int);
    int i;

    for (i = 0; i < 100; i++)
    {
        set_insert(set, i, &i);
    }
    i = -100; set_insert(set, i, &i);
    i = 1024; set_insert(set, i, &i);

    printf("set[6] = %d\r\n", set_at(set, int, 6));
    printf("set[-100] = %d\r\n", set_at(set, int, -100));
    printf("set[1024] = %d\r\n", set_at(set, int, 1024));

    set_at(set, int, 6) = 11111;
    printf("set[6] = %d\r\n", set_at(set, int, 6));

    _set(set);
}
```
**Results**:
```
set[6] = 6
set[-100] = -100
set[1024] = 1024
set[6] = 11111
```

#### Size of set and Data Size
```c
int set_size(set_t set);
int set_dsize(set_t set);
```
The `size` of the set is easy to understand. It's similar to the size of an array. The `dsize` is the size of the data passed in during creation.
```c
void test(void)
{
    set_t set = set(int);
    int i;

    for (i = 0; i < 100; i++)
    {
        set_insert(set, i, &i);
    }
    printf("size = %d, data size = %d\r\n", set_size(set), set_dsize(set));

    _set(set);
}
```
**Results**:
```
size = 100, data size = 4
```

#### Set Finding
```c
int set_find(set_t set, int index);
```
This method is actually implemented by wrapping `set_data`. It returns 1 if the find is successful and 0 if it fails.

#### Set Iterator
```c
void set_it_init(set_t set, int orgin);
void* set_it_get(set_t set, int *out_index);
```
The set also supports a built-in iterator, but mainly the iterator of the set is used for traversal. Since the list can be traversed by incrementing the index starting from 0, but the set has discrete indexes and can't be traversed in this way, two iterator functions are provided here for traversing the set.
The `set_it_init` function initializes the iterator. When `orgin` is specified as `SET_HEAD` or `SET_TAIL`, it represents forward iteration and reverse iteration respectively.
The `set_it_get` function obtains the iteration and updates the iteration position. `*out_index` is used to output the index (the current index, and it can also be passed as NULL if not needed to receive), and it returns the data at the iteration position.
The number of iterations is controlled by `set_size`.
```c
void test(void)
{
    set_t set = set(int);
    int i, index;
    void *data;

    i = -100; set_insert(set, i, &i);
    i = 1024; set_insert(set, i, &i);
    i = 0; set_insert(set, i, &i);
    i = 7; set_insert(set, i, &i);
    i = -2; set_insert(set, i, &i);
    i = -2; set_insert(set, i, &i);

    set_at(set, int, 3) = 100;

    set_it_init(set, SET_HEAD);
    i = set_size(set);
    while (i--)
    {
        data = set_it_get(set, &index);
        printf("set[%d] = %d\r\n", index, *(int *)data);
    }

    _set(set);
}
```
**Results**:
```
set[-100] = -100
set[-2] = -2
set[0] = 0
set[7] = 7
set[1024] = 1024
```

### Source Code Analysis

#### set Structure

All the structures of the set container are implicit, which means that the members of the structures can't be accessed directly. This way ensures the independence and security of the module and prevents external calls from modifying the members of the structures, which could otherwise damage the storage structure of the set. So the set parser only leaves the single declaration of the set in the head file, and the definitions of the structures are placed in the source file. Only the methods provided by the set container can be used to operate on set objects.
The declaration of the set type:
```c
typedef struct SET *set_t;
```
When using it, just use `set_t`.
```c
/* type of set */
typedef struct SET
{
    NODE* root;                          /* root node */
    NODE* nil;                          /* nil node */
    NODE* iterator;                      /* iterator of set */
    int orgin;                          /* iterator orgin */
    int size;                          /* set size */
    int dsize;                          /* data size */
} SET;
```
The `SET` structure contains 6 members: `root` (the root node of the red-black tree), `nil` (the nil node of the red-black tree), `iterator` (the node currently pointed to by the iterator), `orgin` (the starting position of the iterator), `size` (the size of the set, that is, the total number of nodes in the red-black tree of the set), and `dsize` (the size of each data).
```c
/* set node type define */
typedef struct NODE
{
    struct NODE *parent;
    struct NODE *left;
    struct NODE *right;
    int color;
    int index;
} NODE;
#define data(node) ((node)+1) /* data of node */
```
In the `NODE` structure, the explicit members include `parent`, `left`, and `right` which are pointers to form the tree structure, `color` which represents the color of the red-black tree node, and `index`. The data field is the same as that of the `list`, which follows the node space and is dynamically allocated in size.

#### Red-Black Tree

The red-black tree is a binary search tree, and an additional storage bit is added to each node to represent the color of the node, which can be Red or Black. By restricting the coloring method of each node on any path from the root to the leaves, the red-black tree ensures that no path is twice as long as other paths, so it is nearly balanced.
The storage structure of the set is completely based on the red-black tree. Here, we won't go into details about the red-black tree.

#### Iterator

As mentioned before, the set has a built-in iterator. So how does this iterator work?
Let's look at the source code:
```c
void set_it_init(set_t set, int orgin)
{
    if (!set) return;
    set->orgin = (orgin==SET_HEAD)?SET_HEAD:SET_TAIL;
    set->iterator = (set->orgin==SET_HEAD)?(NODE*)node_min(set, set->root):(NODE*)node_max(set, set->root); // According to the starting point, iterate the iterator to the minimum or maximum node of the root node
}
void* set_it_get(set_t set, int *out_index)
{
    NODE *node;
    if (!set) return NULL;
    node = set->iterator;
    set->iterator = (set->orgin==SET_HEAD)?node_next(set, set->iterator):node_prev(set, set->iterator); // According to the iteration direction, choose to iterate to the next or previous node
    if (out_index) *out_index = node->index; // Output the index of the iteration
    return data(node);
}
```
So how are `node.prev` and `node.next` implemented in the red-black tree?
In the red-black tree, the current node is larger than all the nodes in its left subtree and smaller than any node in its right subtree.
So when getting the next node, we need to find the smallest node that is larger than the current node. The priority for searching is as follows:
1. Nodes in the right subtree. When looking at the nodes in the right subtree, the smallest node in the right subtree is on the leftmost side of the right subtree.
2. Parent node. When looking at the parent node, we also need to distinguish whether the current node is the left or right child of the parent node. If it's the left child of the parent node, then the parent node is larger than the current node, and we can directly return the parent node. If it's the right child of the parent node, then the parent node is smaller than the current node, so we need to return the "grandparent node" which is larger than the current node and is the smallest one.
Similarly, the logic for `prev` is just the opposite.
How to end the iteration? We can control the number of iterations through the size of the set. If we keep iterating, the iterator will finally stay at the `nil` position.
```c
static NODE* node_next(set_t set, NODE* node)
{
    if (node->right!= set->nil) // There is a right subtree
    {
        node = node->right; // First, go to the right child node
        node = node_min(set, node); // Then, go to the leftmost node (the smallest node) of the right subtree
    }
    else // There is no right subtree
    {
        if (node == node->parent->left) node = node->parent; // If the current node is the left child of the parent node, return the parent node
        else node = node->parent->parent; // If the current node is the right child of the parent node, return the "grandparent node"
    }
    return node;
}

static NODE* node_prev(set_t set, NODE* node)
{
    if (node->left!= set->nil)
    {
        node = node->left;
        node = node_max(set, node);
    }
    else
    {
        if (node == node->parent->right) node = node->parent;
        else node = node->parent->parent;
    }
    return node;
}
```

These functions related to the iterator implement the traversal operation in the set based on the characteristics of the red-black tree, enabling us to access each element in the set in an orderly manner.