esProc External-Memory Computing: Indexed Files

Indexed Sequences in esProc discusses how to create an index of a table sequence or a text file in esProc to greatly speed up the query on desired fields. But how to deal with a data-intensive file whose index table cannot be loaded into the memory? Here let’s look at how to increase the speed of big data handling by creating a suitable index for a big file.

With esProc only the binary files can use external index files, whose creation will be explained in the following through the binary file PersonnelInfo used by esProc External-Memory Computing: Binary Files. The file contains personnel data, including 100,000 records in 6 fields.

f.index(f_i,k_i,…) function is used to create an index file f_ifor a binary file f on fields k_i,…. For example, you can create an index file PIndex-N on PersonnelInfo’s Name field using the following code:

	A
1	=file(“PersonnelInfo”)
2	=file(“PIndex-N”)
3	=A1.index(A2,Name)

After an index is created, A3 returns true. If the original file changes, the index file needs to be re-created. If data is only added to the original file, you can use @a option when calling f.index() to append new index to the original index file by handling only the newly-added data. In this way, efficiency will increase.

It is convenient to get the number of records according to an index file using f_i.icount() function. Without an index file, you need to use the groups function to perform the aggregate by traversal, which is much slower:

	A	B	C
1	=file(“PersonnelInfo”)	=file(“PIndex-N”)	=now()
2	=A1.cursor@b()
3	=A2.groups(;count(~):Count)	=interval@ms(C1,now())	=now()
4	=B1.icount()	=interval@ms(C3,now())

Both A3 and A4 get the same result of 100,000 records, while B3 and B4 compare the time (milliseconds) they spend in achieving that:

esProc_external_file_index_3

Actually you can directly obtain the number of records from an index file, without traversing all the records.

After an index file on names is created, you can use it to facilitate the query on Name field. To retrieve data satisfying the condition x from the data file f using f.icursor(f_i,x; F_i,…) function, use the index file f_iduring the query. If no field F_i is specified for retrieval, cs.fetch() will return the complete records. Now let’s compare the performance of query with an index file and that without one:

	A	B	C
1	=file(“PersonnelInfo”)	=file(“PIndex-N”)	=B1.icount()
2	=A1.cursor@b()	>A2.skip(rand(C1))
3	=A2.fetch@x(1)		=now()
4	=A1.cursor@b()	=A4.select(Name==A3.Name)	=B4.fetch()
5	=interval@ms(C3,now())		=now()
6	=A1.icursor(B1,Name==A3.Name)	=A6.fetch()
7	=interval@ms(C5,now())

First, skip some records randomly in B2 to get a random record in A3:

esProc_external_file_index_5

Then find all records with the same name from the data file using the two methods. In A3, the randomness of the data has little impact on the query performance because all records need to be traversed to return the eligible records. The 4^th line filters the cursor by the name. The 6^th line finds out the eligible records using the icursor function according to the index file. C4 and B6 get the same result after execution:

esProc_external_file_index_6

A5 and A7 respectively calculate the time (milliseconds) the two methods spend:

esProc_external_file_index_7

As you can see the index-based query runs extremely faster. Note: As the index file PIndex-N is created on Name field, you can only perform the index-based query by Name field.

As you may find, there are many persons who have the same name in the data file PersonnelInfo. If you want to find a certain person, you need to find the person according to other information, such as birthday, except for the name. And if you make the query according to both the name and the birthday, you need to create a multi-field index. For example, you could create a new index file PIndex-NB:

	A
1	=file(“PersonnelInfo”)
2	=file(“PIndex-NB”)
3	=A1.index(A2,Name,Birthday)

Next let’s take a look at the performance of finding persons according to both name and birthday with an index file and that without one. Since there’s a narrow chance that persons have the same name and birthday at the same time, it’s reasonable to return the record of the first found person only. To eliminate the randomness from the result, the query will be repeated 100 times by the following program:

	A	B	C
1	=file(“PersonnelInfo”)	=file(“PIndex-NB”)	=B1.icount()
2	=create(Name,Birthday)
3	for 100	=A1.cursor@b()	>B3.skip(rand(C1))
4		=B3.fetch@x(1)	>A2.insert(0,B4.Name,B4.Birthday)
5		[]	=now()
6	for A2	=A1.cursor@b()	=B6.select(Name==A6.Name && Birthday==A6.Birthday)
7		>B5=B5\|C6.fetch@x(1)
8	=interval@ms(C5,now())	[]	=now()
9	for A2	=A1.icursor(B1,Name==A9.Name && Birthday==A9.Birthday)
10		>B8=B8\|B9.fetch@x(1)
11	=interval@ms(C8,now())

The 3^rd and 4^th lines perform the query for 100 times, skipping some records randomly each time. After that A2 gets the names and birthdays of 100 persons:

esProc_external_file_index_10

The 6^th and 7^th lines find the records of persons whose names and birthdays meet the specified condition. The 9^th and 10^th line do the filtering with icursor function based on the index file PIndex-NB just created. B5 and B8 get the same result:

esProc_external_file_index_11

A8 and A11 calculate the time (milliseconds) the two methods spends:

esProc_external_file_index_12

As in the preceding comparison, an index has significantly increased the query efficiency.

The index file PIndex-NB indexes two fields – Name and Birthday. But you can use only one of them for the query. For example:

	A	B	C
1	=file(“PersonnelInfo”)	=file(“PIndex-N”)	=file(“PIndex-NB”)
2	=A1.cursor@b()	=B1.icount()	>A2.skip(rand(B2))
3	=A2.fetch(1)	>A2.close()	=now()
4	=A1.cursor@b()	=A4.select(Name==A3.Name)	=B4.fetch()
5	=interval@ms(C3,now())		=now()
6	=A1.icursor(B1,Name==A3.Name)	=A6.fetch()
7	=interval@ms(C5,now())		=now()
8	=A1.icursor(C1,Name==A3.Name)	=A8.fetch()
9	=interval@ms(C7,now())

Three methods are used to issue the query on data: common select query, single-key index PIndex-N and two-key index PIndex-NB. A5, A7 and A9 respectively calculate the time the three methods spend:

esProc_external_file_index_14

As you can see, the query using an index finds the eligible records much faster than one without using an index. Though only one or more certain fields in a multi-key index can be used in a query, it’s not as efficient as an index with fewer keys. The fact is that, in the above example, a query using the index file PIndex-NB is slower if Birthday field is used to find the desired data. On the other hand, a multi-key index file has a larger size. In short, you need to be flexible in creating an index for different scenarios.

Besides, you can use f.ncursor(f_i,a:b; F_i,…) function to find data whose values are within the range of [a,b] based on an index. For example:

	A	B	C
1	=file(“PersonnelInfo”)	=file(“PIndex-N”)
2	=A1.ncursor(B1,”Sa”:”Sz”;ID,Name,Birthday,City,State)	=A2.fetch()

A2 finds the information of persons whose names begin from Sa to Se, using certain fields of the index file. Below is the result of B2:

esProc_external_file_index_16

Note that the order of a:b in ncursor function should be consistent with their order in the index. By the way, if you don’t want a closed interval, use @l or @ r options, which represent a left half-open interval and a right half-open interval.

Hi，You need to fill in the Username and Email！