gergelypolonkai-web-jekyll/content/blog/mastodon-database-hickup.rst

Mastodon database hickup
########################

:date: 2020-04-23T11:17Z
:category: blog
:tags: postgresql,mastodon
:url: blog/mastodon-database-hickup/
:save_as: blog/mastodon-database-hickup/index.html
:status: published
:author: Gergely Polonkai


After being online for almost a year, my `Mastodon instance <https://social.polonkai.eu>`_ is back
up.  However, it wasn’t as straightforward as i wanted it to be.

About a year ago my Mastodon server went down.  I messed up my docker images big time and the
then-new version 2.8.4 didn’t want to start up.  I left it there, and as i had a pretty rough year
(in the good sense) i didn’t have time to bring it up again.

One of the constant problems i had with this instance is the size of the attachment directory.  It
easily grew to an enormous size in no time, so i had to constantly clean things up.  Plus, it took
away the precious space away from my other projects.  So my first move was moving all these files
to a DigitalOcean Space; with the ``awscli`` Python package it was easy as pie.  If you want to do
it, too, i followed `this article
<https://angristan.xyz/2018/05/moving-mastodon-media-files-to-wasabi-object-storage/>`_.

Next came the database.

I tried to load the database dump, created using ``pg_dump -f mastodon.sql mastodon``, with
``psql -f mastodon.sql``.  Everything went fine, all the records created, except this error
message:

.. code-block:: text

   psql:mastodon-dump.sql:1691661: ERROR:  could not create unique index "index_tags_on_name"
   DETAIL:  Key (name)=(opensource) is duplicated.

First i tried to ignore it, but running `rails db:migrate` failed when it wanted to manipulate
this index.

So i issued this query:

.. code-block:: sql

   SELECT COUNT(*) AS count, name FROM tags GROUP BY name HAVING COUNT(*) > 1;

And the result was this:

.. code-block:: text

   count | name
       2 | opensource
       2 | xmpp
       2 | fdroid
       2 | socialmedia
       2 | blogging
       2 | c64
       2 | blog

Then collected the IDs of these duplicate tag records using this query:

.. code-block:: sql

   SELECT id, name, created_at
   FROM tags
   WHERE name in ('opensource',
                  'xmpp',
                  'fdroid',
                  'socialmedia',
                  'blogging',
                  'c64',
                  'blog')
   ORDER BY name, created_at;

.. code-block:: text

    id    | name        | created_at
   -------+-------------+------------
      159 | blog        | 2018-02-24 19:56:01.710702
    15941 | blog        | 2019-04-03 22:05:56.438488
      158 | blogging    | 2018-02-24 19:56:01.721354
    16006 | blogging    | 2019-04-09 12:13:20.852976
     5441 | c64         | 2018-07-31 00:50:56.172468
    16036 | c64         | 2019-04-14 19:56:40.692197
      924 | fdroid      | 2018-04-14 19:39:21.261817
    15947 | fdroid      | 2019-04-04 19:10:50.190317
      237 | opensource  | 2018-03-05 21:50:52.609723
    15929 | opensource  | 2019-04-03 11:49:21.772961
      251 | socialmedia | 2018-03-06 23:02:33.573775
    16034 | socialmedia | 2019-04-14 19:07:37.081635
      519 | xmpp        | 2018-03-28 16:02:05.023784
    15988 | xmpp        | 2019-04-07 08:06:59.429965

I wanted to see the ``created_at`` field just to know when things got bad.  As it turns out, it
was around April 2019, so it might have happened with the 2.8.0 upgrade.  Was it me, or some
problem with the migrations remains a mistery.

I also wanted to know about all the tables referencing tags:

.. code-block:: sql

   SELECT
       tc.table_name, kcu.column_name
   FROM
       information_schema.table_constraints AS tc
       JOIN information_schema.key_column_usage AS kcu
            ON tc.constraint_name = kcu.constraint_name
       JOIN information_schema.constraint_column_usage AS ccu
            ON ccu.constraint_name = tc.constraint_name
   WHERE constraint_type = 'FOREIGN KEY' AND ccu.table_name = 'tags';

.. code-block:: text

       table_name     | column_name
   -------------------+-------------
    statuses_tags     | tag_id
    account_tag_stats | tag_id

And last, but not least, i had to massage the dump a bit.  After creating a backup, i opened up
the file in ViM and looked for the place where it was loading the data into the ``tags`` table.
It looked like this:

.. code-block:: sql

   COPY public.tags (name, created_at, updated_at, id) FROM stdin;
   blogging        2018-02-24 19:56:01.710702      2018-02-24 19:56:01.710702      158
   blog    2018-02-24 19:56:01.721354      2018-02-24 19:56:01.721354      159
   opensource      2018-03-05 21:50:52.609723      2018-03-05 21:50:52.609723      237
   socialmedia     2018-03-06 23:02:33.573775      2018-03-06 23:02:33.573775      251
   xmpp    2018-03-28 16:02:05.023784      2018-03-28 16:02:05.023784      519
   fdroid  2018-04-14 19:39:21.261817      2018-04-14 19:39:21.261817      924
   c64     2018-07-31 00:50:56.172468      2018-07-31 00:50:56.172468      5441
   opensource      2019-04-03 11:49:21.772961      2019-04-03 11:49:21.772961      15929
   blog    2019-04-03 22:05:56.438488      2019-04-03 22:05:56.438488      15941
   fdroid  2019-04-04 19:10:50.190317      2019-04-04 19:10:50.190317      15947
   xmpp    2019-04-07 08:06:59.429965      2019-04-07 08:06:59.429965      15988
   blogging        2019-04-09 12:13:20.852976      2019-04-09 12:13:20.852976      16006
   socialmedia     2019-04-14 19:07:37.081635      2019-04-14 19:07:37.081635      16034
   c64     2019-04-14 19:56:40.692197      2019-04-14 19:56:40.692197      16036

For every tag, i removed the line with the latest date, and noted their IDs, then went on to the
``statuses_tags`` table, which looked like this (i don’t put all rows here as it would take a lot
of space):

.. code-block:: sql

   COPY public.statuses_tags (status_id, tag_id) FROM stdin;
   101862091116861098      15929

Here, i had to look for the IDs noted before in the last column (ie. at the end of a line), and
for every line i change the newer ID to the old ones (15929 became 237, and so on).

Finally, the ``account_tag_stats`` table:

.. code-block:: sql

   COPY public.account_tag_stats (id, tag_id, accounts_count, hidden, created_at, updated_at) FROM stdin;
   1       15929   0       f       2019-04-03 11:49:21.810564      2019-04-03 11:49:21.810564

Here, the ID i was looking for is in the second column, so i used the magical ViM regex
``^[0-9]\+\t\(15929\|15941\)`` (but with all the other IDs in the parentheses, too), and again
changed the new IDs to the old ones.

I still can’t believe i didn’t mess up the database too much, but it loaded just fine, and the
``db:migrate`` command executet just fine.

I also had some problems with so nginx reverse proxying, but it all turned out well, and my
instance is back online after a year.  Come follow me at `@gergely@polonkai.eu
<https://social.polonkai.eu/@gergely>`_!