169 похожих чатов

So I have a table with about 30 Billion Rows.

I needed to insert 20 billion more.
I have 2 indexes. One is a HUGE field (80 characters) and inserts are random.
One is a simple BIGINT tied to a sequence...

The question becomes "Should I drop my indexes?". So I played with this.
It's educational. These are NOT SSDs. These are iSCSI drives over a Gigabit Network.

Inserting With the indexes: 1 Million Rows Per HOUR.
Without the indexes: 20 million Rows Per MINUTE (1,200 times Faster)
With ONLY the SEQUENCE indexed: (It took 18hrs to rebuild after dropping, LOL). ~5 Million Rows Per MINUTE.

Considering how long it takes to rebuild the SEQUENCE index.
I should have dropped it for the inserts. Effectively I have to do 20hrs worth of inserts (about 1hr per Billion rows), then 18hrs + ???hrs to rebuild the other index.

This is 38hrs of work plus however long the second index build adds to this. Probably 10 hrs if I am guessing (so 28 hrs to create that index is my guess, and I will run them at the same time). I will write an update when I find out.

But the loads with one index vary between 3.1 and 3.6 hours (some of this could be network demand).
So, loading 20 files is AT LEAST 65 hrs and then the time to build the last index.
Versus dropping both, spending 20hrs loading, and then ~28 hrs building the indexes again.

Luckily this is not something that is in use, and the downtime is widely available.
The table is NOT partitioned. The problem with partitioning this is that the pseudo random key makes the most sense as it is the most queried. Maybe if I have 4 partitions of 50 billion rows, it won't matter that much.

Anyways, it's always faster to load data without indexes (and certainly without triggers).
Without deleting the indexes, this could have taken 20,000 Hours (833 Days)... And it would be 99% just thrashing the disk drives! Instead it will take a week or so between holidays!

BTW... How do you learn?
You DO things. Specifically HARD things...

4 ответов

79 просмотров

Very interesting. Thanks for sharing @CaptainKirkFan

Thanx.. :)

Did you try "COPY"?

Kirk- Автор вопроса
Mateus Freitas
Did you try "COPY"?

Of course. COPY (or \COPY via psql) is simply the fastest way I know to load data. It is the format of the pg_dump/pg_restore (by default), and requires very little parsing, unlike parsing SQL for every insert. I can't fathom how long 1 BILLION insert statements would take... Just on the parsing time ALONE. COPY should be the first consideration of Bulk Loading data.

Похожие вопросы

Обсуждают сегодня

Господа, а что сейчас вообще с рынком труда на делфи происходит? Какова ситуация?
Rꙮman Yankꙮvsky
29
А вообще, что может смущать в самой Julia - бы сказал, что нет единого стандартного подхода по многим моментам, поэтому многое выглядит как "хаки" и произвол. Короче говоря, с...
Viktor G.
2
30500 за редактор? )
Владимир
47
а через ESC-код ?
Alexey Kulakov
29
Чёт не понял, я ж правильной функцией воспользовался чтобы вывести отладочную информацию? но что-то она не ловится
notme
18
У меня есть функция где происходит это: write_bit(buffer, 1); write_bit(buffer, 0); write_bit(buffer, 1); write_bit(buffer, 1); write_bit(buffer, 1); w...
~
14
Добрый день! Скажите пожалуйста, а какие программы вы бы рекомендовали написать для того, чтобы научиться управлять памятью? Можно написать динамический массив, можно связный ...
Филипп
7
Недавно Google Project Zero нашёл багу в SQLite с помощью LLM, о чём достаточно было шумно в определённых интернетах, которые сопровождались рассказами, что скоро всех "ибешни...
Alex Sherbakov
5
Ребят в СИ можно реализовать ООП?
Николай
33
https://github.com/erlang/otp/blob/OTP-27.1/lib/kernel/src/logger_h_common.erl#L174 https://github.com/erlang/otp/blob/OTP-27.1/lib/kernel/src/logger_olp.erl#L76 15 лет назад...
Maksim Lapshin
20
Карта сайта