SELECT id
FROM my_table
WHERE id IN (
SELECT id
FROM my_table
WHERE criteria_a = 19
ORDER BY create_when DESC
LIMIT 1000
);
This is the pattern I am looking for, but I need the criteria_a to be repeated for every value of criteria_a with the important focus being the LIMIT 1000 for any single value of criteria_a. There is no need to put a total LIMIT on the query, just to limit to the 1000 per criteria_a with the specific ORDER BY at that point. Put another way…
SELECT id
FROM my_table
WHERE id IN (
SELECT id
FROM my_table
WHERE criteria_a = 19
ORDER BY create_when DESC
LIMIT 1000
)
OR id IN (
SELECT id
FROM my_table
WHERE criteria_a = 20
ORDER BY create_when DESC
LIMIT 1000
);
Where I desire 2000 total rows. I could turn this into programming code (even a PostgreSQL FUNCTION) that loops over every value of criteria_a and replaces 19 in the example.
I don’t care of it is a JOIN or an IN, I’m more stuck on how to repeat the inner SELECT with the LIMIT 1000 based on sort and criteria_a. Can I do it without looping and/or UNION? Thank you.
I think you can use RANK as a window function and specify TOP 1000
Ok, I’m doing some reading: https://medium.com/@amulya349/how-to-select-top-n-rows-from-each-category-in-postgresql-39e3cfebb020
Good results with this approach. I hadn’t considered the RANK OVER PARTITION BY criteria_a values and it works like a champ. It moves the ORDER BY into the realm of focus (criteria_a) and performance seems decent enough… and it isn’t difficult to read the short statement.
SELECT COUNT(ranked_recency.*) AS post_row_count FROM ( SELECT id, post_id, community_id, published, rank() OVER ( PARTITION BY community_id ORDER BY published DESC, id DESC ) FROM post_aggregates) ranked_recency WHERE rank <= 1000 ;
Gives me the expected results over the 5+ million test rows I ran it against.
If you could elaborate on your idea of TOP, please do. I’m hoping there might be a way to wall the LIMIT 1000 into the inner query and not have the outer query need to WHERE filter rank on so many results?
Glad this is working for you. Using TOP probably was a bad idea and I think the way you used RANK <=1000 is a better approach.
If there was a way to safely exclude any of the records - like if you knew that when published was older than X days/months/years it would never make it into the final results, you could filter them out before ranking them. That might squeeze a little more performance out of the query, but could be risky if the data isn’t predictable enough.
Thank you