varch/doc/dict.en.md

372 lines
17 KiB
Markdown

## Introduction
The `dict` dictionary is a logically discrete container, which is quite similar to the `set` container. The `set` container exists in the form of `index - data`, while the `dict` container exists in the form of `key - value`. The `index` of the `set` is an integer number, and the `key` of the `dict` is a string (here it's different from Python's `dict`, as Python's `key` can be not only a string but also an integer or a tuple, etc. In varch, something similar to Python's `dict` is called a `map` mapping). The `data` of the `set` and the `value` of the `dict` are essentially the same thing.
The `dict` container in varch adopts a **hash table** in its underlying implementation. It enables quick lookups, supports random access, and occupies relatively small space. The `dict` can be traversed through an iterator.
## Interface
### Creating and Deleting dict Objects
```c
dict_t dict_create(int dsize);
void dict_delete(dict_t dict);
#define dict(type) // For more convenient use, a macro definition is wrapped around dict_create
#define _dict(dict) // A macro definition is wrapped around dict_delete, and the dict is set to NULL after deletion
```
Here, **dict_t** is the structure of `dict`. The creation method will return an empty `dict` object if successful, or NULL if it fails. The parameter `dsize` is used to pass in the size of the data. The deletion method is used to delete the input `dict` object. The creation and deletion methods should be used in pairs. Once created, it should be deleted when it's no longer in use.
```c
void test(void)
{
dict_t dict = dict(int); // Define and create a dict of type int
_dict(dict); // Use them in pairs and delete it after use
}
```
### Inserting and Removing in dict
```c
void* dict_insert(dict_t dict, const char *key, void *value);
int dict_erase(dict_t dict, const char *key);
```
This `dict` locates data through hash values and can quickly find the storage location of data based on hash values.
The insertion method is used to add a specified `key` and copy the data to this key (when `value` is passed in as NULL, only space will be allocated without assignment). During the process of inserting the `key`, duplicate checking will be performed to ensure the uniqueness of the `key`. If the insertion is successful, the address of the inserted data will be returned; otherwise, NULL will be returned. The removal method is used to remove the data of the specified key. It returns 1 if successful and 0 if it fails.
### Reading and Writing of dict Data
```c
void* dict_value(dict_t dict, const char *key);
void* dict_error(dict_t dict);
#define dict_at(dict, type, key)
```
The `dict_value` method is used to obtain the address of the data according to the key and returns the address of the specified data. `dict_error()` is used to indicate failure. The `dict_at` method adds a type on the basis of `dict_value`. The `dict_value` has a read-write protection mechanism. Since `dict_error()` is returned instead of NULL, when using the `dict_at` method and if the key is written incorrectly, the content pointed to by `dict_error()` will be modified instead of causing a crash.
The random access of `dict` is achieved by calculating the hash value of the key and locating the data in the hash table according to the hash value.
```c
void test(void)
{
dict_t dict = dict(int);
int value;
value = 100; dict_insert(dict, "hello", &value);
value = 1; dict_insert(dict, "ZhangSan", &value);
value = 2; dict_insert(dict, "LiSi", &value);
value = 3; dict_insert(dict, "WangWu", &value);
value = 4; dict_insert(dict, "SunLiu", &value);
value = 5; dict_insert(dict, "QianQi", &value);
printf("dict[hello] = %d\r\n", dict_at(dict, int, "hello"));
printf("dict[SunLiu] = %d\r\n", dict_at(dict, int, "SunLiu"));
_dict(dict);
}
```
Result:
```
dict[hello] = 100
dict[SunLiu] = 4
```
### Size and Data Size of dict
```c
int dict_size(dict_t dict);
int dict_dsize(dict_t dict);
```
The `size` of `dict` is similar to the size of an array, and `dsize` is the size of the data passed in during creation.
```c
void test(void)
{
dict_t dict = dict(int);
int value;
value = 100; dict_insert(dict, "hello", &value);
value = 1; dict_insert(dict, "ZhangSan", &value);
value = 2; dict_insert(dict, "LiSi", &value);
value = 3; dict_insert(dict, "WangWu", &value);
value = 4; dict_insert(dict, "SunLiu", &value);
value = 5; dict_insert(dict, "QianQi", &value);
printf("size = %d, value size = %d\r\n", dict_size(dict), dict_dsize(dict));
_dict(dict);
}
```
Result:
```
size = 6, value size = 4
```
### dict Search
```c
int dict_find(dict_t dict, const char *key);
```
This method is actually implemented by wrapping `dict_value`. It returns 1 if the search is successful and 0 if it fails.
### dict Iterator
```c
void dict_it_init(dict_t dict);
void* dict_it_get(dict_t dict, char **key);
```
The `dict` also supports a built-in iterator, which is mainly used for traversing. Since when traversing an array, we know that the keys start from 0 and can be traversed one by one in an increasing order. However, the `dict` has discrete keys, so it can't be traversed in this incremental way. Therefore, two iterator functions are provided here for traversing the `dict`.
The `dict_it_init` function initializes the iterator, and the `dict_it_get` function obtains the iteration and updates the iteration position. The `*key` is the output key (the current key, or NULL can be passed in if not needed), and the function returns the data at the iteration position. The number of iterations is controlled by `dict_size`.
```c
void test(void)
{
dict_t dict = dict(int);
int value;
char *key;
void *data;
int i;
value = 100; dict_insert(dict, "hello", &value);
value = 1; dict_insert(dict, "ZhangSan", &value);
value = 2; dict_insert(dict, "LiSi", &value);
value = 3; dict_insert(dict, "WangWu", &value);
value = 4; dict_insert(dict, "SunLiu", &value);
value = 5; dict_insert(dict, "QianQi", &value);
dict_it_init(dict, DICT_HEAD);
i = dict_size(dict);
while (i--)
{
data = dict_it_get(dict, &key);
printf("dict[%s] = %d\r\n", key, *(int *)data);
}
_dict(dict);
}
```
Result:
```
dict[LiSi] = 2
dict[QianQi] = 5
dict[SunLiu] = 4
dict[WangWu] = 3
dict[ZhangSan] = 1
dict[hello] = 100
```
## Source Code Analysis
### dict Structure
All the structures of the `dict` container are implicit, which means that the structure members cannot be directly accessed. This way ensures the independence and security of the module and prevents external calls from modifying the structure members and thus destroying the `dict` storage structure. Therefore, the `dict` parser only leaves a single `dict` declaration in the header file, and the definitions of the structures are in the source file. Only the methods provided by the `dict` container can be used to operate on `dict` objects.
The `dict` type declaration:
```c
typedef struct DICT *dict_t;
```
When using it, just use `dict_t`.
```c
/* dict type define */
typedef struct DICT
{
groove_t *base; /* base address for groove data */
void *error; /* error space */
int vsize; /* size of value */
unsigned int size; /* size of dict */
unsigned int capacity; /* capacity of dict */
unsigned int it; /* iterator index */
} DICT;
```
The `DICT` structure contains 6 members. `base` is the base address of the hash table, `error` is the error area of `dict` (when randomly accessing a non-existent key, the address of this error area will be returned, so that when using the `at` method, it won't operate on invalid memory addresses), `size` is the size (length) of `dict`, `vsize` is the size of each data, `capacity` is the capacity of the hash table, and `it` is used to record the current accessed hash table index during iterator traversal.
The main problems that the `dict` container needs to solve are hash collisions and the adjustment of the hash table capacity.
```c
/* dict node type define */
typedef struct
{
unsigned int hash; /* hash value */
char *key; /* key */
void *value; /* value */
} GROOVE, *groove_t;
```
In `dict`, data is stored through an array. Each array item only stores the address of each slot (groove), and each slot stores three parts of content: the hash value of the current slot in the current hash table (for faster lookup of key-value pairs with hash collisions), and the actual stored key-value pairs.
```
+------+------------------------+------------------------+
| hash | key | value |
+------+------------------------+------------------------+
|... | ... | ... |
+------+------------------------+------------------------+
|... | ... | ... |
+------+------------------------+------------------------+
```
### dict Creation and Deletion
```c
dict_t dict_create(int vsize)
{
dict_t dict;
if (vsize <= 0) return NULL;
dict = (dict_t)malloc(sizeof(DICT));
if (!dict) return NULL;
dict->error = malloc(vsize);
if (!dict->error) { free(dict); return NULL; }
dict->base = NULL;
dict->vsize = vsize;
dict->size = 0;
dict->capacity = 0;
dict->it = 0;
return dict;
}
```
The creation of `dict` only creates an empty `dict` and initializes the basic parameters. The deletion method is used to release the memory space allocated during the operation of `dict`.
### dict Insertion
```c
void* dict_insert(dict_t dict, const char *key, void *value)
{
groove_t groove = NULL;
unsigned int hash = 0, index;
int len = 0;
/*
Check the validity of the input parameters.
*/
if (!dict) return NULL;
if (!key) return NULL;
/*
Check if the size has exceeded 3/4 of the hash table capacity. If it exceeds 3/4, resize the hash table.
Because when it exceeds 3/4, it means the hash table is quite full and the lookup efficiency will be greatly reduced. So some free space in the hash table should be maintained.
The resizing here changes in powers of 2, such as growing as 4, 8, 16, 32...
*/
/* the current capacity affects the search rate and needs to be expanded */
if (dict->size >= ((dict->capacity >> 2) + (dict->capacity >> 1))) /* size exceeds 3/4 of capacity */
{
/*
During the resizing process, first allocate a new space with a new capacity,
then rehash and insert the members in the old hash table back into the new hash table one by one.
*/
/* allocate new hash table space */
if (!dict_resize(dict, dict->capacity < MIN_CAPACITY? MIN_CAPACITY : dict->capacity << 1)) return NULL;
}
/*
Calculate the hash value of the inserted key (the bkdr hash algorithm is used by default), and take the modulus of the hash value with the hash table capacity.
Then check if there is a hash collision. If there is a collision, use the open addressing method, linearly adding 1 to search for free space backward.
During the backward search, check the hash value in the corresponding slot. A hash value of -1 indicates that it has been removed by the `erase` method but the actual slot has not been deleted, and it is also regarded as an available slot.
*/
/* find a free groove */
len = strlen(key);
hash = hash_bkdr((void *)key, len) % dict->capacity;
index = hash;
while (dict->base[index] && dict->base[index]->hash!= -1)
{
index = (index + 1) % dict->capacity;
if (index == hash) return NULL;
}
/*
Allocate the space for the slot.
*/
/* space allocation */
groove = dict->base[index];
if (!groove) groove = (groove_t)malloc(sizeof(GROOVE));
if (!groove) return NULL;
groove->key = (char *)malloc(len + 1);
if (!groove->key) { if (!dict->base[index]) free(groove); return NULL; }
groove->value = malloc(dict->vsize);
if (!groove->value) { free(groove->key), groove->key = NULL; if (!dict->base[index]) free(groove); return NULL; }
/* assign */
groove->hash = hash; // Record the hash value of the current slot
strcpy(groove->key, key);
if (value) memcpy(groove->value, value, dict->vsize);
/* insert */
dict->base[index] = groove;
dict->size++;
return groove->value;
}
```
Actual storage example:
```c
value = 100; dict_insert(dict, "hello", &value);
value = 1; dict_insert(dict, "ZhangSan", &value);
value = 2; dict_insert(dict, "LiSi", &value);
value = 3; dict_insert(dict, "WangWu", &value);
value = 4; dict_insert(dict, "SunLiu", &value);
value = 5; dict_insert(dict, "QianQi", &value);
value = 8; dict_insert(dict, "WangBa", &value);
value = 9; dict_insert(dict, "LiuJiu", &value);
```
Hash table:
```
+------+------------------------+------------------------+
| hash | key | value |
+------+------------------------+------------------------+
| 0 | ZhangSan | 1 |
+------+------------------------+------------------------+
| 1 | QianQi | 5 |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| 4 | SunLiu | 4 |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| 6 | WangBa | 8 |
+------+------------------------+------------------------+
| 7 | LiSi | 2 |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| 9 | WangWu | 3 |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| | | |
+------+------------------------+------------------------+
| 14 | hello | 100 |
+------+------------------------+------------------------+
| 14 | LiuJiu | 9 |
+------+------------------------+------------------------+
```
### Removal from the `dict`
```c
int dict_erase(dict_t dict, const char *key)
{
groove_t groove;
unsigned int index, next;
// Check if the dictionary pointer is NULL. If so, return 0 to indicate failure of the removal operation.
if (!dict) return 0;
// Check if the key pointer is NULL. If so, also return 0 as the operation can't proceed without a valid key.
if (!key) return 0;
/*
To remove an element, we first need to find the position of the key in the hash table.
The `find_index` function follows a similar logic as the insertion method described earlier. If the key is not found, it will return -1.
*/
index = find_index(dict, key);
// If the `find_index` function returns -1, meaning the key was not found in the hash table, return 0 to indicate failure of the removal.
if (index == -1) return 0;
groove = dict->base[index];
// If the `key` pointer in the `groove` structure is not NULL, free the memory occupied by the key string and set the pointer to NULL to avoid dangling pointers.
if (groove->key) { free(groove->key); groove->key = NULL; }
// Similarly, if the `value` pointer in the `groove` structure is not NULL, free the memory occupied by the value and set the pointer to NULL.
if (groove->value) { free(groove->value); groove->value = NULL; }
// Instead of actually freeing the entire `groove` structure, we just set its `hash` value to -1. A hash value of -1 indicates that this slot is logically removed (although the structure itself isn't deallocated yet). In the hash table, a hash value of -1 is not used for normal elements.
groove->hash = -1;
// Decrease the size of the dictionary by 1 to reflect the removal of one key-value pair.
dict->size--;
/*
Just like with the insertion method where the size and capacity need to be adjusted, the same applies to the removal operation.
When the number of elements in the dictionary (`size`) is less than one-fourth of the capacity, we resize the hash table by reducing its capacity to half. This way, the dictionary will occupy about half of its previous capacity.
*/
if (dict->capacity > MIN_CAPACITY && dict->size <= (dict->capacity >> 2)) /* size less than 1/4 of capacity */
{
dict_resize(dict, dict->capacity >> 1);
}
// If all the above operations are completed successfully, return 1 to indicate that the removal operation was successful.
return 1;
}
```
This function `dict_erase` is mainly responsible for removing a key-value pair from the `dict` data structure. It first locates the position of the key in the hash table. Then, it frees the memory associated with the key and value, marks the slot as logically removed, adjusts the size of the dictionary, and finally may resize the hash table if necessary to optimize memory usage and maintain the efficiency of the hash table structure.