Multithreaded Rockstar AMP

Despite the fact that this blog does what it's supposed to do (I hope), I can't help but keep messing with it. I guess with my day job being mostly backend work on internet shopping websites, this is my way of venting. Sometimes, it gives me an idea of what is going on behind the abstractions beneath what I work on, like search indexes. Other times, I want to toy around doing visual design.

About 2 years ago, Google decided to do something about abysmal page loading times for phones. They created Accelerated Mobile Pages (AMP). The idea was that people would see a lightning bolt next to search results, and know that the page would (probably) load fast. The nitty-gritty involves Google caching these pages, and those individual pages only have certain HTML tags, and anything else is forbidden. This seemed OK for my blog and I would have done it (as if it would need it), except it involved redoing my image tags.

A few months ago, I took a stab at it again. All it involved was searching for images in my posts, looking at the image to get its height and width, and insert into a specialized page. I've heard that having height and width present on an image is good for other reasons, so I figured "why not". I also needed AMP specific CSS. I'm hesitant to say "file", because it's included directly on the page, and it has to be less than 50 KB. My CSS is over that (due to embedded base64 fonts), but I trimmed the size down only for AMP's pages. Other pages get the full CSS.

I took the existing code I had for article pages, and tweaked it as necessary to meet AMP's requirements. They live on a different servlet. Replace "/article/" in the URL with "/amp/" for AMP formatted articles. This works pretty well. I use an extension that looks for the AMP page link in every page, and provides a way to switch between AMP and not-AMP pages. For people who don't have that luxury, I have a link on each AMP page to get to the "true" experience.

After a few weeks, my AMP pages have started appearing in Google searches, but only on my phone. The cyborgs running the place are prejudiced against desktop users. In the end, I don't think that AMP has sped up my pages. They are so light that adding a bunch of Javascript libraries to speed it up more doesn't work.

Back when I implemented the search feature, it had a huge problem in that it would only accept exact spellings. After all this time and research, I found a solution: trigrams. It works by dividing up a word into multiple 2 and 3 letter combinations, and looking over other words to see if they have a good amount of those combinations. I needed to do more pre-processing on my articles:

Before it hits the database, a version that doesn't have any links (or other things like symbols/URLs/images) is created. HTML tags are stripped with <.+?> and !?\\[\"?(.+?)\"?\\]\\(\\S+?(?:\\s\"?(.+?)\"?)?\\) strips markdown links and images.
When I post an article, the existing trigrams from all articles are deleted. These live in their own separate table from the articles. TRUNCATE TABLE toilet.articlewords
The trigrams are recreated from the link-free article text, and the trigram index on that table is recreated. INSERT INTO toilet.articlewords (SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', searchabletext) FROM toilet.article') ORDER BY word); ANALYZE toilet.articlewords;
On a search, the query is broken into words. Each word is searched against the article trigram table. SELECT array_to_string(array_agg(word),' | ') AS word FROM toilet.articlewords WHERE (word % ?query) = TRUE;
The trigrams that match each word is the new query. That goes into the regular search function.

I experimented with Java EE's ManagedExecutorPools many years ago, but something didn't work right. I started over, researching and toying around. Before I knew it, I was refactoring. (Spoilers: use @Resource ManagedExecutorService managedExecutorService; in a managed bean on a Java EE 7+ server.) After a few missteps (and lots of log entries), as is common for multi-threaded programming, I got it. I have extra CPU threads lying around not doing anything, especially since I upgraded my server to that i7. So let's put those threads to work!

When a post or comment is made the RSS feeds need rebuilt. However, the result is not important for what you see (the page). Another thread updates them, and they can finish whenever they please.

The biggest improvement comes from the restore backup process. It needs to parse those RSS feeds into posts and comments, and insert files into the database. I also use this to process and update every article (like for AMP). That update converts everything from markdown, runs several regexes, inserts the post into the database, and refreshes the search indexes. Those steps need the files (images) inserted into the database first, and each update takes considerable time. However, each file is independent of other files, and each post is independent of other posts. That screams "MULTI-THREAD ME PLEASE!" to me. It is much faster, down to about 30 or so seconds. I'm positive that it uses more memory, but since I have several gobs worth that has always sat around, I'm not worried in the least.

I've upgraded my password library from scrypt to Argon2. It takes a bit longer for password checking to happen, but this is worth it for more security.

Screenshot of Firefox inspecting my homepage.

Earlier this week, I found a new font: reey regular. You like? It makes my blog look like a rockstar autographed poster. There are lots of alternate letter forms going on, and many glyphs (font letters) go beyond the rectangle it should fit in. If a letter is repeated, that second letter looks a little different than the first. It looks very natural.

I've made an update to how comments work. They live inside an iframe. My idea is that the page outside the frame can be cached heavily (since it shouldn't change), but the comments are supposed to update, so they shouldn't be as heavily cached. This has the neat effect of simulating an AJAX call, since only the iframe updates when you comment, not the entire page. I have some strict CSRF protection, and I hope it plays nicely inside an iframe (it seems to). As an added bonus, the comments form is on the second request to the server, so if a new visitor comes directly to a page, the already have a cookie, and that person should not be a spambot!

I've gotten around to updating the git repository to all the code behind this blog. Wow, that was woefully out of date!