varch/doc/dict.en.md

17 KiB

Introduction

The dict dictionary is a logically discrete container, which is quite similar to the set container. The set container exists in the form of index - data, while the dict container exists in the form of key - value. The index of the set is an integer number, and the key of the dict is a string (here it's different from Python's dict, as Python's key can be not only a string but also an integer or a tuple, etc. In varch, something similar to Python's dict is called a map mapping). The data of the set and the value of the dict are essentially the same thing. The dict container in varch adopts a hash table in its underlying implementation. It enables quick lookups, supports random access, and occupies relatively small space. The dict can be traversed through an iterator.

Interface

Creating and Deleting dict Objects

dict_t dict_create(int dsize);
void dict_delete(dict_t dict);
#define dict(type) // For more convenient use, a macro definition is wrapped around dict_create
#define _dict(dict) // A macro definition is wrapped around dict_delete, and the dict is set to NULL after deletion

Here, dict_t is the structure of dict. The creation method will return an empty dict object if successful, or NULL if it fails. The parameter dsize is used to pass in the size of the data. The deletion method is used to delete the input dict object. The creation and deletion methods should be used in pairs. Once created, it should be deleted when it's no longer in use.

void test(void)
{
    dict_t dict = dict(int); // Define and create a dict of type int
    _dict(dict); // Use them in pairs and delete it after use
}

Inserting and Removing in dict

void* dict_insert(dict_t dict, const char *key, void *value);
int dict_erase(dict_t dict, const char *key);

This dict locates data through hash values and can quickly find the storage location of data based on hash values. The insertion method is used to add a specified key and copy the data to this key (when value is passed in as NULL, only space will be allocated without assignment). During the process of inserting the key, duplicate checking will be performed to ensure the uniqueness of the key. If the insertion is successful, the address of the inserted data will be returned; otherwise, NULL will be returned. The removal method is used to remove the data of the specified key. It returns 1 if successful and 0 if it fails.

Reading and Writing of dict Data

void* dict_value(dict_t dict, const char *key);
void* dict_error(dict_t dict);
#define dict_at(dict, type, key)

The dict_value method is used to obtain the address of the data according to the key and returns the address of the specified data. dict_error() is used to indicate failure. The dict_at method adds a type on the basis of dict_value. The dict_value has a read-write protection mechanism. Since dict_error() is returned instead of NULL, when using the dict_at method and if the key is written incorrectly, the content pointed to by dict_error() will be modified instead of causing a crash. The random access of dict is achieved by calculating the hash value of the key and locating the data in the hash table according to the hash value.

void test(void)
{
    dict_t dict = dict(int);
	int value;

	value = 100; dict_insert(dict, "hello", &value);
	value = 1; dict_insert(dict, "ZhangSan", &value);
	value = 2; dict_insert(dict, "LiSi", &value);
	value = 3; dict_insert(dict, "WangWu", &value);
	value = 4; dict_insert(dict, "SunLiu", &value);
	value = 5; dict_insert(dict, "QianQi", &value);

	printf("dict[hello] = %d\r\n", dict_at(dict, int, "hello"));
	printf("dict[SunLiu] = %d\r\n", dict_at(dict, int, "SunLiu"));

	_dict(dict);
}

Result:

dict[hello] = 100
dict[SunLiu] = 4

Size and Data Size of dict

int dict_size(dict_t dict);
int dict_dsize(dict_t dict);

The size of dict is similar to the size of an array, and dsize is the size of the data passed in during creation.

void test(void)
{
    dict_t dict = dict(int);
	int value;

	value = 100; dict_insert(dict, "hello", &value);
	value = 1; dict_insert(dict, "ZhangSan", &value);
	value = 2; dict_insert(dict, "LiSi", &value);
	value = 3; dict_insert(dict, "WangWu", &value);
	value = 4; dict_insert(dict, "SunLiu", &value);
	value = 5; dict_insert(dict, "QianQi", &value);

	printf("size = %d, value size = %d\r\n", dict_size(dict), dict_dsize(dict));

	_dict(dict);
}

Result:

size = 6, value size = 4
int dict_find(dict_t dict, const char *key);

This method is actually implemented by wrapping dict_value. It returns 1 if the search is successful and 0 if it fails.

dict Iterator

void dict_it_init(dict_t dict);
void* dict_it_get(dict_t dict, char **key);

The dict also supports a built-in iterator, which is mainly used for traversing. Since when traversing an array, we know that the keys start from 0 and can be traversed one by one in an increasing order. However, the dict has discrete keys, so it can't be traversed in this incremental way. Therefore, two iterator functions are provided here for traversing the dict. The dict_it_init function initializes the iterator, and the dict_it_get function obtains the iteration and updates the iteration position. The *key is the output key (the current key, or NULL can be passed in if not needed), and the function returns the data at the iteration position. The number of iterations is controlled by dict_size.

void test(void)
{
    dict_t dict = dict(int);
	int value;
	char *key;
	void *data;
	int i;

	value = 100; dict_insert(dict, "hello", &value);
	value = 1; dict_insert(dict, "ZhangSan", &value);
	value = 2; dict_insert(dict, "LiSi", &value);
	value = 3; dict_insert(dict, "WangWu", &value);
	value = 4; dict_insert(dict, "SunLiu", &value);
	value = 5; dict_insert(dict, "QianQi", &value);

	dict_it_init(dict, DICT_HEAD);
	i = dict_size(dict);
	while (i--)
	{
		data = dict_it_get(dict, &key);
		printf("dict[%s] = %d\r\n", key, *(int *)data);
	}

	_dict(dict);
}

Result:

dict[LiSi] = 2
dict[QianQi] = 5
dict[SunLiu] = 4
dict[WangWu] = 3
dict[ZhangSan] = 1
dict[hello] = 100

Source Code Analysis

dict Structure

All the structures of the dict container are implicit, which means that the structure members cannot be directly accessed. This way ensures the independence and security of the module and prevents external calls from modifying the structure members and thus destroying the dict storage structure. Therefore, the dict parser only leaves a single dict declaration in the header file, and the definitions of the structures are in the source file. Only the methods provided by the dict container can be used to operate on dict objects. The dict type declaration:

typedef struct DICT *dict_t;

When using it, just use dict_t.

/* dict type define */
typedef struct DICT
{
	groove_t *base;				/* base address for groove data */
	void *error;				/* error space */
	int vsize;					/* size of value */
	unsigned int size;			/* size of dict */
	unsigned int capacity;		/* capacity of dict */
	unsigned int it;			/* iterator index */
} DICT;

The DICT structure contains 6 members. base is the base address of the hash table, error is the error area of dict (when randomly accessing a non-existent key, the address of this error area will be returned, so that when using the at method, it won't operate on invalid memory addresses), size is the size (length) of dict, vsize is the size of each data, capacity is the capacity of the hash table, and it is used to record the current accessed hash table index during iterator traversal. The main problems that the dict container needs to solve are hash collisions and the adjustment of the hash table capacity.

/* dict node type define */
typedef struct 
{
	unsigned int hash;			/* hash value */
	char *key;					/* key */
	void *value;				/* value */
} GROOVE, *groove_t;

In dict, data is stored through an array. Each array item only stores the address of each slot (groove), and each slot stores three parts of content: the hash value of the current slot in the current hash table (for faster lookup of key-value pairs with hash collisions), and the actual stored key-value pairs.

+------+------------------------+------------------------+
| hash |           key          |          value         |
+------+------------------------+------------------------+
|...  |          ...          |          ...          |
+------+------------------------+------------------------+
|...  |          ...          |          ...          |
+------+------------------------+------------------------+

dict Creation and Deletion

dict_t dict_create(int vsize)
{
	dict_t dict;
	if (vsize <= 0) return NULL;
	dict = (dict_t)malloc(sizeof(DICT));
	if (!dict) return NULL;
	dict->error = malloc(vsize);
	if (!dict->error) { free(dict); return NULL; }
	dict->base = NULL;
	dict->vsize = vsize;
	dict->size = 0;
	dict->capacity = 0;
	dict->it = 0;
	return dict;
}

The creation of dict only creates an empty dict and initializes the basic parameters. The deletion method is used to release the memory space allocated during the operation of dict.

dict Insertion

void* dict_insert(dict_t dict, const char *key, void *value)
{
	groove_t groove = NULL;
	unsigned int hash = 0, index;
	int len = 0;

	/*
	Check the validity of the input parameters.
	 */
	if (!dict) return NULL;
	if (!key) return NULL;

	/*
	Check if the size has exceeded 3/4 of the hash table capacity. If it exceeds 3/4, resize the hash table.
	Because when it exceeds 3/4, it means the hash table is quite full and the lookup efficiency will be greatly reduced. So some free space in the hash table should be maintained.
	The resizing here changes in powers of 2, such as growing as 4, 8, 16, 32...
	 */
	/* the current capacity affects the search rate and needs to be expanded */
	if (dict->size >= ((dict->capacity >> 2) + (dict->capacity >> 1))) /* size exceeds 3/4 of capacity */
	{
		/*
		During the resizing process, first allocate a new space with a new capacity,
		then rehash and insert the members in the old hash table back into the new hash table one by one.
		*/
		/* allocate new hash table space */
		if (!dict_resize(dict, dict->capacity < MIN_CAPACITY? MIN_CAPACITY : dict->capacity << 1)) return NULL;
	}

	/*
	Calculate the hash value of the inserted key (the bkdr hash algorithm is used by default), and take the modulus of the hash value with the hash table capacity.
	Then check if there is a hash collision. If there is a collision, use the open addressing method, linearly adding 1 to search for free space backward.
	During the backward search, check the hash value in the corresponding slot. A hash value of -1 indicates that it has been removed by the `erase` method but the actual slot has not been deleted, and it is also regarded as an available slot.
	 */
	/* find a free groove */
	len = strlen(key);
	hash = hash_bkdr((void *)key, len) % dict->capacity;
	index = hash;
	while (dict->base[index] && dict->base[index]->hash!= -1)
	{
		index = (index + 1) % dict->capacity;
		if (index == hash) return NULL;
	}

	/*
	Allocate the space for the slot.
	 */
	/* space allocation */
	groove = dict->base[index];
	if (!groove) groove = (groove_t)malloc(sizeof(GROOVE));
	if (!groove) return NULL;
	groove->key = (char *)malloc(len + 1);
	if (!groove->key) { if (!dict->base[index]) free(groove); return NULL; }
	groove->value = malloc(dict->vsize);
	if (!groove->value) { free(groove->key), groove->key = NULL; if (!dict->base[index]) free(groove); return NULL; }

	/* assign */
	groove->hash = hash; // Record the hash value of the current slot
	strcpy(groove->key, key);
	if (value) memcpy(groove->value, value, dict->vsize);

	/* insert */
	dict->base[index] = groove;
	dict->size++;

	return groove->value;
}

Actual storage example:

value = 100; dict_insert(dict, "hello", &value);
value = 1; dict_insert(dict, "ZhangSan", &value);
value = 2; dict_insert(dict, "LiSi", &value);
value = 3; dict_insert(dict, "WangWu", &value);
value = 4; dict_insert(dict, "SunLiu", &value);
value = 5; dict_insert(dict, "QianQi", &value);
value = 8; dict_insert(dict, "WangBa", &value);
value = 9; dict_insert(dict, "LiuJiu", &value);

Hash table:

+------+------------------------+------------------------+
| hash |           key          |          value         |
+------+------------------------+------------------------+
|   0  |         ZhangSan       |            1           |
+------+------------------------+------------------------+
|   1  |          QianQi        |            5           |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|   4  |          SunLiu        |            4           |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|   6  |          WangBa        |            8           |
+------+------------------------+------------------------+
|   7  |           LiSi         |            2           |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|   9  |          WangWu        |            3           |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|      |                        |                        |
+------+------------------------+------------------------+
|  14  |          hello         |           100          |
+------+------------------------+------------------------+
|  14  |         LiuJiu         |            9           |
+------+------------------------+------------------------+

Removal from the dict

int dict_erase(dict_t dict, const char *key)
{
    groove_t groove;
    unsigned int index, next;

    // Check if the dictionary pointer is NULL. If so, return 0 to indicate failure of the removal operation.
    if (!dict) return 0;
    // Check if the key pointer is NULL. If so, also return 0 as the operation can't proceed without a valid key.
    if (!key) return 0;

    /*
    To remove an element, we first need to find the position of the key in the hash table.
    The `find_index` function follows a similar logic as the insertion method described earlier. If the key is not found, it will return -1.
    */
    index = find_index(dict, key);
    // If the `find_index` function returns -1, meaning the key was not found in the hash table, return 0 to indicate failure of the removal.
    if (index == -1) return 0;
    groove = dict->base[index];

    // If the `key` pointer in the `groove` structure is not NULL, free the memory occupied by the key string and set the pointer to NULL to avoid dangling pointers.
    if (groove->key) { free(groove->key); groove->key = NULL; }
    // Similarly, if the `value` pointer in the `groove` structure is not NULL, free the memory occupied by the value and set the pointer to NULL.
    if (groove->value) { free(groove->value); groove->value = NULL; }
    // Instead of actually freeing the entire `groove` structure, we just set its `hash` value to -1. A hash value of -1 indicates that this slot is logically removed (although the structure itself isn't deallocated yet). In the hash table, a hash value of -1 is not used for normal elements.
    groove->hash = -1;
    // Decrease the size of the dictionary by 1 to reflect the removal of one key-value pair.
    dict->size--;

    /*
    Just like with the insertion method where the size and capacity need to be adjusted, the same applies to the removal operation.
    When the number of elements in the dictionary (`size`) is less than one-fourth of the capacity, we resize the hash table by reducing its capacity to half. This way, the dictionary will occupy about half of its previous capacity.
    */
    if (dict->capacity > MIN_CAPACITY && dict->size <= (dict->capacity >> 2)) /* size less than 1/4 of capacity */
    {
        dict_resize(dict, dict->capacity >> 1);
    }

    // If all the above operations are completed successfully, return 1 to indicate that the removal operation was successful.
    return 1;
}

This function dict_erase is mainly responsible for removing a key-value pair from the dict data structure. It first locates the position of the key in the hash table. Then, it frees the memory associated with the key and value, marks the slot as logically removed, adjusts the size of the dictionary, and finally may resize the hash table if necessary to optimize memory usage and maintain the efficiency of the hash table structure.