Grouping text rows in equal-sized groups
asked 8 hours ago by @qa-h7sl8pmg39zkiyewusxm 0 rep · 151 views
I have a table containing a lot of typed/named items and I want to create a (materialized?) view to group them together so that a UI can later make it easier for users to select.
Description
The goal is:
If there are N elements of the same type, make sqrt(N) groups each of size sqrt(N). Instead of having e.g. 10k elements to scroll through, the UI would show 100 lines (the groups), each of which can be expanded into 100 lines (the items inside each group). 1 more click to make but a lot less scrolling.
The UI will display group names as the range containing the named elements inside it.
For instance, if a group contains all the items whose names start with A and B (but not C and beyond), then the group should be named A-B.
If a group contains all the elements starting with C until e.g. EG, but elements starting with EH are the the next group, the group should be name C-EG.
The idea is:
Instead of comparing the texts and deciding where the lists should be split into groups, I am using a function to score the items. Why I do that will make sense in a minute.
Anyway, a heuristics turns that whole mess into a relatively usual problem to solve, that is searching for the local minimums of a function.
Here is the twist (that explains why I am building a function):
I could make it so that each group has exactly the same size (ignoring the rounding of sqrt(N) to the nearest integer) but instead, I want to find a compromise between group size and length of the group name. As groups are named with the pattern <start of group> - <end of group>, I ideally want <start of group> and <end of group> to be of length 3 at most.
Since I am fine with counts being "reasonably" different from sqrt(N), I am also fine with them being slightly out of date. My idea therefore is to feed the result of this query into a materialized view that is refreshed e.g. every night.
The function works that way:
- For each item, count how many first letters it has in common with the item immediately before it.
- Take a SIN squared function, with parameters so that it oscillate sqrt(N) times in the range [1-N].
- Insert parametrization into the 2 steps above (add constants, multiply by constant, raise to some power ... whatever seems to give good results once I figure the problem out).
- Multiply the 2 together to get a "score"
- Look for a minimum score, it will act as the end of the group.
If that helps, here is what it looks like for 100 items (names generated randomly with the below queries), with SIN in blue and the score in black.
If I want, e.g. shorter group names at the cost of consistent group sizes, I can tweak the score so that the SIN is not just squared anymore but instead raised to a higher power: that will flatten the troughs of the sin-wave / sharpen the crests.
By doing so, more items will be place inside the troughs, therefore I will get more chances that a shorter but off boundary will be preferred over a longer boundary that lands on sqrt(N).
Here is a zoomed in version of the 2 charts:
Here, the heuristics determins the list should be split around rtetx.
The group right before this split will be named [...] - RC and the group after this split will be named RT - [...].
Raising the SIN function to a higher power (note: the software I used to draw the chart interpolates the points into a smooth curve. The curve is only correct on the points, not between them) increases the tolerance, like I said. Now, the group can equally be split around rcEYu or slkkE.
Now, the group names will be either:
[...] - QandR - [...]- Or
[...] - RandS - [...]
Now, 1 letter is required for the common boundary of both groups whereas 2 letters were required before.
Queries
I tried the following query
- Building the data sample with random data
CREATE TABLE Test (
ItemType INTEGER,
ItemName VARCHAR
);
WITH Chars(chars) AS (
SELECT ARRAY[
'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
]
)
INSERT INTO Test
SELECT t, string_agg(chars[random(0, 165)], '')
FROM chars
CROSS JOIN generate_series(1, 3) t
CROSS JOIN generate_series(1, 5) l
CROSS JOIN generate_series(1, 400) i
GROUP BY t, i;
- Finding the text to use in group names (query takes < 5sec, suggests it needs a materialized view).
WITH Items(ItemType, ItemName) AS (
SELECT ItemType, TRIM(BOTH FROM upper(ItemName)) FROM Test
), stats1(ItemType, NameStarts) AS (
SELECT DISTINCT Items.ItemType, TRIM(BOTH FROM upper(left(ItemName, generate_series(1, 3)))) AS btrim
FROM Items
), stats2(ItemType, NameStarts, NameCount) AS (
SELECT s1.ItemType, s1.NameStarts,
count(*) FILTER (WHERE left(i.ItemName, length(s1.NameStarts)) <= s1.NameStarts) AS count
FROM stats1 s1
JOIN items i ON s1.ItemType = i.ItemType
GROUP BY s1.ItemType, s1.NameStarts
), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS (
SELECT stats2.ItemType,
min(stats2.NameStarts) AS min,
stats2.NameCount,
max(stats2.NameCount) OVER (PARTITION BY stats2.ItemType) AS max
FROM stats2
GROUP BY stats2.ItemType, stats2.NameCount
), stats4(ItemType, NameStarts, NameCount, score, groupnumber) AS (
SELECT stats3.ItemType,
stats3.NameStarts,
stats3.NameCount,
length(stats3.NameStarts) * (.05 + .95 * power(sin(pi() * stats3.NameCount / sqrt(stats3.totalnamecount)::integer), 2)),
((stats3.NameCount - sqrt(stats3.totalnamecount) / 2) / sqrt(stats3.totalnamecount))::integer AS int4
FROM stats3
WHERE stats3.NameCount > 4 AND (stats3.totalnamecount - stats3.NameCount) > 4 AND stats3.totalnamecount > 50
)
SELECT s1.ItemType, s1.NameStarts, s1.NameCount
FROM stats4 s1
JOIN (
SELECT stats4.ItemType, min(stats4.score) AS bestscore, stats4.groupnumber
FROM stats4
GROUP BY stats4.ItemType, stats4.groupnumber
) s2 ON s1.ItemType = s2.ItemType AND s1.groupnumber = s2.groupnumber
WHERE s1.score = s2.bestscore
Problem
In the above query, I have added the NameCount column to track how many items belong to 1 group. As the sample data consists in 3 groups of 400 items each, I expect 3 types, each of 20 groups, each group having around 20 items (thus this 3rd columns should show 0, 20, 40, 60, ...).
However, while I get the correct rows, I also get additional groups with 1-2 items in it. For instance, I will see:
- For the first group:
NameCount= 20 - For the second group:
NameCount= 39 - For the third group:
NameCount= 41
When this happens, that means the function calculates 2 group boundaries with equal value; one of them needs to go.
I can't seem to find an easy way to remove those extra records. Has anyone got an idea?
Edit: To clarify what I am after (some of which will be a repetition, bear with me):
- An application displays the result of the query in its UI. It can (and probably will) process the output of the query.
- Ultimately, my goal is to decrease the amount of scrolling required to view any item in the list by grouping them into a 2-level hierarchy.
- for N items, groups should have sqrt(N) groups, each with around sqrt(N) items. The number of item in each group depends on the heuristic being strict (aims at sqrt(N)) or tolerant (aims at a compromise between sqrt(N) and the length of the group name).
- sqrt(N) is not guaranteed to be an integer.
For 10000, I expect exactly 100 groups of 100 items.
For 10001, I can accept having 100 or 101 groups. If you choose to create 101 groups, I do not mind items being spread onto all those groups.
The approach in my query (at least in 1 version before I try to fix its issues) was to ignore the issue completely: 100 items per group in the first 100 group, 1 item in the final group (it is only fair the final group is smaller as it appears last and requires more scrolling to reach than any of the other groups). - Data comes from real-world name so the chances I ever get a list of items all sharing the same first 3 letters is basically 0. Even if there was e.g. 3 groups worth of items sharing the same first 3 letters, returning 1 group is OK. The application will deal with the rest.
- There are 3 reasons why I currently save data in a materialized view:
- It is faster for the application.
- New data is not inserted often enough in the table for it to matter (think about 10 records inserted in a 10k table, that will often not change the group).
- Users visiting the same list several times in a day will be guaranteed to get the same grouping. It is only after they go to sleep that the grouping will be different in the UI.
- Because of the delay between data insertion and grouping update, the query does not need to (and probably should not) return both the lower and upper bound of each group.
For example, if 2 consecutive groups areA-BandE-F, the application will show contiguous segments, such asA-DandE-F. If it did not, then items starting withCwould not be findable.