Skip to content
Learn Netverks
7

Grouping text rows in equal-sized groups

asked 8 hours ago by @qa-h7sl8pmg39zkiyewusxm 0 rep · 151 views

sql postgresql

I have a table containing a lot of typed/named items and I want to create a (materialized?) view to group them together so that a UI can later make it easier for users to select.

Description

The goal is:

If there are N elements of the same type, make sqrt(N) groups each of size sqrt(N). Instead of having e.g. 10k elements to scroll through, the UI would show 100 lines (the groups), each of which can be expanded into 100 lines (the items inside each group). 1 more click to make but a lot less scrolling.

The UI will display group names as the range containing the named elements inside it.
For instance, if a group contains all the items whose names start with A and B (but not C and beyond), then the group should be named A-B.
If a group contains all the elements starting with C until e.g. EG, but elements starting with EH are the the next group, the group should be name C-EG.

The idea is:

Instead of comparing the texts and deciding where the lists should be split into groups, I am using a function to score the items. Why I do that will make sense in a minute.

Anyway, a heuristics turns that whole mess into a relatively usual problem to solve, that is searching for the local minimums of a function.

Here is the twist (that explains why I am building a function):

I could make it so that each group has exactly the same size (ignoring the rounding of sqrt(N) to the nearest integer) but instead, I want to find a compromise between group size and length of the group name. As groups are named with the pattern <start of group> - <end of group>, I ideally want <start of group> and <end of group> to be of length 3 at most.
Since I am fine with counts being "reasonably" different from sqrt(N), I am also fine with them being slightly out of date. My idea therefore is to feed the result of this query into a materialized view that is refreshed e.g. every night.

The function works that way:

  1. For each item, count how many first letters it has in common with the item immediately before it.
  2. Take a SIN squared function, with parameters so that it oscillate sqrt(N) times in the range [1-N].
  3. Insert parametrization into the 2 steps above (add constants, multiply by constant, raise to some power ... whatever seems to give good results once I figure the problem out).
  4. Multiply the 2 together to get a "score"
  5. Look for a minimum score, it will act as the end of the group.

If that helps, here is what it looks like for 100 items (names generated randomly with the below queries), with SIN in blue and the score in black.

If I want, e.g. shorter group names at the cost of consistent group sizes, I can tweak the score so that the SIN is not just squared anymore but instead raised to a higher power: that will flatten the troughs of the sin-wave / sharpen the crests.
By doing so, more items will be place inside the troughs, therefore I will get more chances that a shorter but off boundary will be preferred over a longer boundary that lands on sqrt(N).

Here is a zoomed in version of the 2 charts:

Here, the heuristics determins the list should be split around rtetx. The group right before this split will be named [...] - RC and the group after this split will be named RT - [...].

Raising the SIN function to a higher power (note: the software I used to draw the chart interpolates the points into a smooth curve. The curve is only correct on the points, not between them) increases the tolerance, like I said. Now, the group can equally be split around rcEYu or slkkE. Now, the group names will be either:

  • [...] - Q and R - [...]
  • Or [...] - R and S - [...]

Now, 1 letter is required for the common boundary of both groups whereas 2 letters were required before.


Queries

I tried the following query

  1. Building the data sample with random data
CREATE TABLE Test (
 ItemType INTEGER,
 ItemName VARCHAR
);

WITH Chars(chars) AS (
 SELECT ARRAY[
     'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
 ]
)
INSERT INTO Test
SELECT t, string_agg(chars[random(0, 165)], '')
FROM chars
CROSS JOIN generate_series(1, 3) t
CROSS JOIN generate_series(1, 5) l
CROSS JOIN generate_series(1, 400) i
GROUP BY t, i;
  1. Finding the text to use in group names (query takes < 5sec, suggests it needs a materialized view).
WITH Items(ItemType, ItemName) AS (
 SELECT ItemType, TRIM(BOTH FROM upper(ItemName)) FROM Test
), stats1(ItemType, NameStarts) AS (
 SELECT DISTINCT Items.ItemType, TRIM(BOTH FROM upper(left(ItemName, generate_series(1, 3)))) AS btrim
 FROM Items
 ), stats2(ItemType, NameStarts, NameCount) AS (
SELECT s1.ItemType, s1.NameStarts,
 count(*) FILTER (WHERE left(i.ItemName, length(s1.NameStarts)) <= s1.NameStarts) AS count
 FROM stats1 s1
 JOIN items i ON s1.ItemType = i.ItemType
 GROUP BY s1.ItemType, s1.NameStarts
), stats3(ItemType, NameStarts, NameCount, totalnamecount) AS (
SELECT stats2.ItemType,
 min(stats2.NameStarts) AS min,
 stats2.NameCount,
 max(stats2.NameCount) OVER (PARTITION BY stats2.ItemType) AS max
 FROM stats2
 GROUP BY stats2.ItemType, stats2.NameCount
), stats4(ItemType, NameStarts, NameCount, score, groupnumber) AS (
 SELECT stats3.ItemType,
 stats3.NameStarts,
 stats3.NameCount,
     length(stats3.NameStarts) * (.05 + .95 * power(sin(pi() * stats3.NameCount / sqrt(stats3.totalnamecount)::integer), 2)),
 ((stats3.NameCount - sqrt(stats3.totalnamecount) / 2) / sqrt(stats3.totalnamecount))::integer AS int4
 FROM stats3
 WHERE stats3.NameCount > 4 AND (stats3.totalnamecount - stats3.NameCount) > 4 AND stats3.totalnamecount > 50
)
SELECT s1.ItemType, s1.NameStarts, s1.NameCount
FROM stats4 s1
JOIN (
 SELECT stats4.ItemType, min(stats4.score) AS bestscore, stats4.groupnumber
 FROM stats4
 GROUP BY stats4.ItemType, stats4.groupnumber
) s2 ON s1.ItemType = s2.ItemType AND s1.groupnumber = s2.groupnumber
WHERE s1.score = s2.bestscore

Problem

In the above query, I have added the NameCount column to track how many items belong to 1 group. As the sample data consists in 3 groups of 400 items each, I expect 3 types, each of 20 groups, each group having around 20 items (thus this 3rd columns should show 0, 20, 40, 60, ...).
However, while I get the correct rows, I also get additional groups with 1-2 items in it. For instance, I will see:

  • For the first group: NameCount = 20
  • For the second group: NameCount = 39
  • For the third group: NameCount = 41

When this happens, that means the function calculates 2 group boundaries with equal value; one of them needs to go.

I can't seem to find an easy way to remove those extra records. Has anyone got an idea?


Edit: To clarify what I am after (some of which will be a repetition, bear with me):

  1. An application displays the result of the query in its UI. It can (and probably will) process the output of the query.
  2. Ultimately, my goal is to decrease the amount of scrolling required to view any item in the list by grouping them into a 2-level hierarchy.
  3. for N items, groups should have sqrt(N) groups, each with around sqrt(N) items. The number of item in each group depends on the heuristic being strict (aims at sqrt(N)) or tolerant (aims at a compromise between sqrt(N) and the length of the group name).
  4. sqrt(N) is not guaranteed to be an integer.
    For 10000, I expect exactly 100 groups of 100 items.
    For 10001, I can accept having 100 or 101 groups. If you choose to create 101 groups, I do not mind items being spread onto all those groups.
    The approach in my query (at least in 1 version before I try to fix its issues) was to ignore the issue completely: 100 items per group in the first 100 group, 1 item in the final group (it is only fair the final group is smaller as it appears last and requires more scrolling to reach than any of the other groups).
  5. Data comes from real-world name so the chances I ever get a list of items all sharing the same first 3 letters is basically 0. Even if there was e.g. 3 groups worth of items sharing the same first 3 letters, returning 1 group is OK. The application will deal with the rest.
  6. There are 3 reasons why I currently save data in a materialized view:
    1. It is faster for the application.
    2. New data is not inserted often enough in the table for it to matter (think about 10 records inserted in a 10k table, that will often not change the group).
    3. Users visiting the same list several times in a day will be guaranteed to get the same grouping. It is only after they go to sleep that the grouping will be different in the UI.
  7. Because of the delay between data insertion and grouping update, the query does not need to (and probably should not) return both the lower and upper bound of each group.
    For example, if 2 consecutive groups are A-B and E-F, the application will show contiguous segments, such as A-D and E-F. If it did not, then items starting with C would not be findable.

Comments on this question (0)

Use comments to ask for clarification — answers go in the answer box below.

Log in to comment on this question.

1

I've included some performance improvements:

  • materialise the TRIM(UPPER()) and then index on it
  • avoid cross join on generate(1,3)
  • avoid joins and aggregations, by using window functions instead
  • match the window clause to the index where possible

The two main logical issues it addresses are

  • Avoiding the break-points being close to each other
  • Avoiding selecting two minima in a single "group"

The first issue is addressed by offsetting everything by "half a page", and not breaking in the first or last "pages".

string old_group old break point new group new break point aaa 0 0 bbb 0 0 ccc 0 0 ddd 0 1 eee 0 yes 1 fff 1 yes 1 yes ggg 1 1 hhh 1 1 iii 1 2 jjj 1 2

(These a total fictional values to illustrate the change.)

Old method breaks two pages once each, and yielding three pages, one being tiny:

  • aaa -> eee
  • fff -> fff
  • ggg -> jjj

New method breaks one fewer page, and biases the break point the the centre:

  • aaa -> fff
  • ggg -> jjj

The second issue is addressed using ROW_NUMBER() = 1 instead of score = MIN(score).

CREATE TABLE Items (
  item_type       INTEGER,
  item_name       VARCHAR(8),
  item_name_upper VARCHAR(8) GENERATED ALWAYS AS (TRIM(BOTH FROM UPPER(item_name))) STORED
);
CREATE INDEX ix_test_type_nameUpper ON items(item_type, item_name_upper);

SET LOCAL SEED = 0.5;

WITH Chars(chars) AS (
 SELECT ARRAY[
     'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
     'a', 'b', 'c', 'd', 'e',
     'a', 'b', 'c', 'd', 'e',
     'a', 'b', 'c', 'd', 'e',
     'a', 'b', 'c', 'd', 'e',
     'a', 'b', 'c', 'd', 'e'
 ]
)
INSERT INTO items
SELECT t, string_agg(chars[random(0, 300)], '')
FROM chars
CROSS JOIN generate_series(1, 3) t
CROSS JOIN generate_series(1, 5) l
CROSS JOIN generate_series(1, 421) i
GROUP BY t, i;
CREATE VIEW
  items_paginated AS
WITH
  look_around(item_type, item_name, type_row_count, rn, prev_name)
AS
(
  SELECT
    item_type,
    item_name_upper,
    COUNT(*) OVER (PARTITION BY item_type),
    ROW_NUMBER()                OVER (PARTITION BY item_type ORDER BY item_name_upper),
    LAG(item_name_upper, 1, '') OVER (PARTITION BY item_type ORDER BY item_name_upper)
  FROM
    items
),
  prefixed(item_type, item_name, type_row_count, rn, prev_name, name_prefix)
AS
(
  SELECT
    *,
    prefix
  FROM
    look_around
  CROSS JOIN LATERAL
  (
    SELECT
      CASE
        WHEN LEFT(item_name, 3) = LEFT(prev_name, 3) THEN NULL
        WHEN LEFT(item_name, 2) = LEFT(prev_name, 2) THEN LEFT(item_name, 3)
        WHEN LEFT(item_name, 1) = LEFT(prev_name, 1) THEN LEFT(item_name, 2)
                                                     ELSE LEFT(item_name, 1)
      END
  )
    AS prefix(prefix)
),
  paginated_and_scored(item_type, item_name, type_row_count, rn, name_prefix, group_count, group_size, theta, naive_group_id, score, new_group_marker, min_score)
AS
(
  SELECT
    item_type,
    item_name,
    type_row_count,
    rn,
    name_prefix,
    group_count,
    group_size,
    theta,
    page.id,
    page.score,
    -- Mark as new page start if lowest score in prelimary group
    -- > But never mark a new page start in first or last group
    CASE
      WHEN page.id = 0                                                                     THEN 0
      WHEN page.id = group_count                                                           THEN 0
      WHEN 1 = ROW_NUMBER() OVER (PARTITION BY item_type, page.id ORDER BY page.score)     THEN 1
                                                                                           ELSE 0
    END,
    MIN(page.score) OVER (PARTITION BY item_type, page.id)
  FROM
    prefixed
  CROSS JOIN LATERAL
  (
    -- Calculate the integer number of expected output groups
    -- > a group either exists, or it doesn't, so it has to be an integer value
    SELECT
      CASE
        WHEN type_row_count < 20
        THEN 1
        ELSE SQRT(type_row_count)
     END::INT
  )
    AS group_count(group_count)
  CROSS JOIN LATERAL
  (
    -- Calculate the average number of rows per group
    -- > FLOAT so that different groups can be different sizes such as 2,2,3
    -- > Later code takes care of "carrying the remainder" to augment subsequent group sizes
    SELECT
      type_row_count::FLOAT / group_count
  )
    AS group_size(group_size)
  CROSS JOIN LATERAL
  (
    SELECT
      -- Calculate the phase angle for the current row
      -- > Used by scoring function
      -- > start at pi()
      -- > complete one cycle (2*pi()) every `group_size` rows
      pi() * 2 * (rn-1) / group_size + pi()
  )
    AS theta(theta)
  CROSS JOIN LATERAL
  (
    SELECT
      -- Calculate the provisional page id
      -- > page 0 = first (group size / 2) rows
      -- > page 1 = next (group size) rows
      -- > page n = last (group size / 2) rows
      FLOOR(theta / pi())::INT / 2,
  
      -- arbitrary scoring
      -- > lower is better
      -- > biased towards every `group_size` rows, through cosine function
      -- > biased towards shorter prefixes by mutiplying by prefix length
      (LENGTH(name_prefix) + 1)
      *
      (POWER((COS(theta) + 1) / 2, 2) + 0.5) -- `+ 1` and `/ 2` to make cosine give values 0..1
  )
    AS page(id, score)
)
SELECT
  *,
  -- cumulative sum to derive new page_id
  SUM(new_group_marker) OVER (PARTITION BY item_type ORDER BY item_name)  AS page_id
FROM
  paginated_and_scored
/*
 * Summary info, for debugging
 */
SELECT
  item_type,
  page_id,
  MIN(name_prefix)           AS first_prefix,
  MAX(name_prefix)           AS last_prefix,
  COUNT(*)                   AS page_size
FROM
  items_paginated
GROUP BY
  item_type,
  page_id
ORDER BY
  1, 2
;
item_type page_id first_prefix last_prefix page_size 1 0 0 0V 22 1 1 1 28 18 1 2 29 37W 20 1 3 38 4AE 20 1 4 4B 5CE 20 1 5 5G 7H 20 1 6 7I 8W 17 1 7 9 AYJ 27 1 8 B BP 17 1 9 C CZ 19 1 10 D DZ 18 1 11 E FJ 23 1 12 G HZ 20 1 13 I JY 17 1 14 K LY 21 1 15 M NN 21 1 16 O PS 16 1 17 Q ST 27 1 18 T UZ 17 1 19 V WW 20 1 20 X ZY 21 2 0 0 0ZE 18 2 1 1 1Z 19 2 2 2 3Z 26 2 3 4 4M 17 2 4 4N 5Y 21 2 5 6 6Z 16 2 6 7 7Z 24 2 7 8 8ZZ 16 2 8 9 9V 19 2 9 A B5 24 2 10 BA CC 21 2 11 CF DH 20 2 12 DL ET 24 2 13 F GP 13 2 14 H IW 19 2 15 J LR 23 2 16 M PX 23 2 17 Q RS 15 2 18 S UX 21 2 19 V XZ 24 2 20 Y ZQ 18 3 0 0 0X8 18 3 1 1 25 22 3 2 27 3C 20 3 3 3F 47 19 3 4 49 5D 21 3 5 5G 6X 23 3 6 7 7S 18 3 7 8 8Z 17 3 8 9 A7 21 3 9 AB B8S 22 3 10 BA CYP 23 3 11 D DY 17 3 12 E EY 15 3 13 F HU 24 3 14 I LE 25 3 15 M OY 19 3 16 P QZ 17 3 17 R TY 21 3 18 U VW 16 3 19 W WZ 20 3 20 X ZT 23
/*
 * Final output for actual use.
 */

SELECT
  item_type,
  page_id,
  name_prefix   AS first_prefix
FROM
  items_paginated
WHERE
  new_group_marker = 1
ORDER BY
  1, 2
;
item_type page_id first_prefix 1 1 1 1 2 29 1 3 38 1 4 4B 1 5 5G 1 6 7I 1 7 9 1 8 B 1 9 C 1 10 D 1 11 E 1 12 G 1 13 I 1 14 K 1 15 M 1 16 O 1 17 Q 1 18 T 1 19 V 1 20 X 2 1 1 2 2 2 2 3 4 2 4 4N 2 5 6 2 6 7 2 7 8 2 8 9 2 9 A 2 10 BA 2 11 CF 2 12 DL 2 13 F 2 14 H 2 15 J 2 16 M 2 17 Q 2 18 S 2 19 V 2 20 Y 3 1 1 3 2 27 3 3 3F 3 4 49 3 5 5G 3 6 7 3 7 8 3 8 9 3 9 AB 3 10 BA 3 11 D 3 12 E 3 13 F 3 14 I 3 15 M 3 16 P 3 17 R 3 18 U 3 19 W 3 20 X

fiddle

Alex Patel · 0 rep · 8 hours ago

Your answer