esProc File Computing: Cascaded Foreign Key Relationships

There are some computational tasks of processing structured data that involves multi-table data sources joined through cascaded foreign key relationships. For example, you want to find particular orders from the order management system, where the following structure of cascaded foreign key relationships is involved. Computational target: Find orders in which the suppliers and customers are from the same region; group these orders by customers and compute total order amount and number of orders in each group.

Data structure is shown in the following figure:

esProc_file_foreign_key_1

To perform the computation with the database, you use the following SQL script:

select cid,count(ordered),sum(price*quantity) from orders

left join customer on orders.cid=customer.cid

left join region a1 on customer.city=a1.city

left join product on orders.pid=product.pid

left join supplier on product.pid=supplier.sid

left join region a2 on supplier.city=a2.city

where a1.region=a2.region and to_char(orderdate,’YYYY’)=?

Group by cid

Now let’s look at how it is solved using esProc file computing approach.

Define the cellet parameter year and perform related computing with cascaded foreign keys relationships using the following esProc script orders.dfx:

	A
1	=file(“D:/files/orders.b”).cursor@b()
2	=A1.select(string(orderdate, “yyyy”)==string(year))
3	=file(“D:/files/customer.b”).import@b()
4	=file(“D:/files/product.b”).import@b()
5	=file(“D:/files/supplier.b”).import@b()
6	=file(“D:/files/region.b”).import@b()
7	=A3.switch(city,A6:city)
8	=A5.switch(city,A6:city)
9	=A4.switch(sid,A5:sid)
10	=A2.switch(pid,A9:pid)
11	=A10.switch(cid,A7:cid)
12	=A11.select(pid.sid.city.region==cid.city.region)
13	=A12.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)
14	result A13

A1: Create a file cursor for local Orders table to process data in segments and avoid memory overflow.

A2: Find orders for a certain year. year is a pre-specified cellset parameter.

A3-A6: Import dimension tables from local machine – Customer, Product, Supplier and Region.

A7: Switch values of the foreign key “city” in the table sequence Customer into corresponding records references in the table sequence Region.

A8: Switch values of the foreign key “city” in the table sequence Supplier into corresponding records references in the table sequence Region.

A9: Switch values of the foreign key “sid” in the table sequence Product into corresponding records references in the table sequence Supplier.

A10: Switch values of the foreign key “pid” in the table sequence Orders into corresponding records references in the table sequence Product.

A11: Switch values of the foreign key “cid” in the table sequence Orders into corresponding records references in the table sequence Customer.

A12: Filter records of the table sequence Orders by the criterion. With switches from A7 to A11, you can write the filtering criterion as pid.sid.city.region==cid.city.region.

A13: Perform group and aggregate based on the filtering result.

A14: Return A13 to an external program.

If there is the great volume of data in the fact table, Orders, you can use esProc multithreaded-parallel-processing approach to increase processing performance. As for the dimension tables, they are small enough to be entirely loaded into memory. All dimension tables can be accessed by every thread.

esProc script for doing this:

	A	B	C
1	=file(“D:/files/customer.b”).import@b()
2	=file(“D:/files/product.b”).import@b()
3	=file(“D:/files/supplier.b”).import@b()
4	=file(“D:/files/region.b”).import@b()
5	=A1.switch(city,A4:city)	=A3.switch(city,A4:city)	=A2.switch(sid,A3:sid)
6	4
7	fork to(A6)	=file(“D:/files/orders.b”).cursor@bz(;,A7:A6)
8		=B7.switch(pid,C5:pid)
9		=B8.switch(cid,A5:cid)
10		=B9.select(string(orderdate, “yyyy”)==string(year) && pid.sid.city.region==cid.city.region)
11		=B10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)
12		result B11
13	=A7.conj()
14	=A13.groups(cid;sum(count):count,sum(amount):amount)
15	result A14

A1-A4: Import dimension tables from local machine – Customer, Product, Supplier and Region.

A5: Switch values of the foreign key “city” in the table sequence Customer into corresponding records references in the table sequence Region.

B5: Switch values of the foreign key “city” in the table sequence Supplier into corresponding records references in the table sequence Region.

C5: Switch values of the foreign key “sid” in the table sequence Product into corresponding records references in the table sequence Supplier.

A6: Set the number of parallel tasks.

A7: Execute parallel processing according to A6.

B7: Create a file cursor for local Orders table, with each thread processing a part of the data.

B8: Switch values of the foreign key “pid” in the table sequence Orders into corresponding records references in the table sequence Product.

B9: Switch values of the foreign key “cid” in the table sequence Orders into corresponding records references in the table sequence Customer.

B10: Filter records of the table sequence Orders by the specified criterion and year. With all those switches, you can write the filtering criterion as pid.sid.city.region==cid.city.region.

B11: Data grouping and aggregation.

B12: Each thread returns its result got in B11.

A13: Merge A7’s results.

A14: Group and aggregate the merged data.

A15: Return A14 to an external program.

One point to note is that you can import data from either a database or a file using esProc script. So you can decide where the fact table and dimension tables should be stored as needed. For example, store tables whose data is manipulated little in the file system and put others with frequent data manipulation in the database. The more the data is stored in the file system, the higher the performance and the less the pressure on the database.

Hi，You need to fill in the Username and Email！