Drupal to WordPress: Migrating Database, URLs, Generating Posts…
The long and winding road of (low-level) migrating content from a Drupal website to WP. Workplan: MySQL and SFTP → SQL → CSV → scripting → import plugin → theme hacking…
Dump the old
- SQL dump, the usual way
- Copy (SFTP, scp, whatever) files: /srv/www/vhosts/$SITE/drupal/{mysql_dump,pdf_files,cover_photos}
What to migrate?
- Site (I'm helping a friend with) is the online "archive" of an academic quarterly.
- So far, it contains just (separate) PDFs of papers, and photos of issues' covers. (Later, we'll consider converting those PDFs to… HTML-based hypermedia documents — papers typically embed lots of sciency media.)
Information architecture: site specifications
- Papers (should be) grouped by the (printed, bound) journal (aka issue) they appeared in, which are assigned volume and sub-volume (quarter) number, as well as publication date (just year, actually), eg, "volume 63, number 4, 2008".
URLs, navigation
Content | Canonical URL (old/new) | … |
---|---|---|
Posts: papers, etc — content that’s added over time | ||
Paper | /2018/03/of_human_loss_and_rbc_survival/ | NB trailing slash |
Issue (category? post? date archive!) | Had no canonical URL. Maybe use date archive, eg /2018/03/, hacked to also show cover photos? | Cover photos, table of contents, intro/editorial? See notes below. |
PDF (attachment.php?) | Old: /sites/default/files/pozzi_1.pdf | Attachment to paper’s post |
Date archive (ie, multiple as well as single issue?) | /2018/, or /2018/03/ | |
RSS (auto discovery?) | Old: /taxonomy/term/19/0/feed, and /rss.xml ? WP: /feed/, /feed/comments/, /tag/foo/feed/ | Latest papers, issues, etc. See notes. |
Pages: about us, etc — relatively unchanging content | ||
Front page (page) | / | Currently same as about, but should be: Intro, and probably list latest papers or issue(s) |
About | /about | Long |
Contact form | /contact | Sends with SMTP |
Instructions for authors, submitting (page) | Old: /Instruction [sic] | Fix: /instructions, or /submitting |
Misc | ||
Media (images, video…) | ? | |
Author (search by) | /author/dolittle/ | Lists matching papers. NB trailing slash. |
Author’s profile? Index of authors? | ? | |
Tag (taxonomy) index? | /tag/antibiotics/ | |
Search results (full text search) | /?s=… |
WP's data model
- Posts are (as should be) HTML. PDFs will be media attachments, and need to be hyperlinked from posts.
- Attachments are (optionally) associated with posts via their post_parent field/property.
How to migrate?
- (No easy (plugins, readymade) solutions. Obviously, I'd rather avoid PHP and SQL, as much as possible. And XML/WXR. Cf bookmarks…)
- CSV seems easy enough to output from MySQL, maybe manipulate as plain text, and "import" into WP — pick from several plugins with different feature sets. Really Simple CSV Importer seems useful.
- What fields (aka columns) to generate for the CSV import?
- post_title
- post_content: something like "Currently only available in <a href="$LINK">PDF</a>.", where $LINK is "/wp-content/uploads/$FILENAME"
- post_date (string, passed to PHP's strtotime which presumably can handle ISO 8601 timestamps): where journal's title is like “Journal VOL.64 | NO.3, 2009”, we'll map into dates like "2009-09-01"
(SSH)
- Add "ServerAliveInterval 120" to /etc/ssh/ssh_config (client config) to prevent frequent disconnections.
- SFTP acted up, frequently, on a tweaked Debian Live I sometimes used — SSHFS to the rescue (even feels quicker in Thunar): $ mkdir ~/site; sshfs user@site.ddns.net:/srv/www/vhosts/site/ ~/site
Drupal's schema
- files.{fid, filepath, filemime}: filemime is either {application/pdf, image/png, image/jpeg}
- filename isn't unique, and doesn't match files in filesystem, which somehow got renamed to make them unique, eg:
(365, 1, 'pozzi.pdf', 'sites/default/files/pozzi.pdf', 'application/pdf', 1551860, 1, 1394553205), (483, 1, 'pozzi.pdf', 'sites/default/files/pozzi_0.pdf', 'application/pdf', 1489159, 1, 1459098492), (574, 1, 'pozzi.pdf', 'sites/default/files/pozzi_1.pdf', 'application/pdf', 1653356, 1, 1513451716)
Verified with:> select filename,count(filename) from files group by filename having count(filename)>1; 21 rows in set > select filepath from files group by filepath having count(filepath)>1; Empty set
- filename isn't unique, and doesn't match files in filesystem, which somehow got renamed to make them unique, eg:
- upload.{fid, nid}: fid is key into files, nid is key into node
- node.{nid, type, title}: where type=journal, title is like “Journal VOL.64 | NO.3, 2009”; where type=ipaper, title is paper’s title
- search_node_links.{sid, nid}: maps papers to journals (both IDs should be keys into node, sid for journal, nid for ipaper)
- simplenews_subscriptions.{mail, activated}: subscribers’ emails; activated={0,1}
SQL
- How to generate the CSV import? For each record in table "upload" we need these:
- post_title: title from node, where node.nid=upload.nid
- post_content: something like "Currently only available in <a href="$LINK">PDF</a>.", with $LINK set to "/wp-content/uploads/$FILENAME" and $FILENAME is files.filename
- post_date (string, passed to PHP's strtotime which presumably can handle ISO 8601 timestamps): where upload.nid=search_node_links.nid, search_node_links.sid points to node.nid that has node.type=journal and node.title is like “Journal VOL.64 | NO.3, 2009”, which we'll map into dates like "2009-09-01".
- This is the query I came up with (after quite a bit of debugging, because I've been all about NoSQL for decades now ;o):
> select files.filepath,paper.title,journal.title -> from upload -> inner join files -> on files.fid=upload.fid -> inner join node as paper -> on upload.nid=paper.nid -> inner join search_node_links as map -> on map.sid=upload.nid -> inner join node as journal -> on journal.nid=map.nid; 380 rows in set (0.01 sec)
- "explain" says all five queries used primary keys (except upload — entire table iterated), so efficient enough.
- (Should I have used left outer joins, instead, to look for inconsistencies?)
CSV
- (After getting error, "server is running with --secure-file-priv", got safe path to use, with "show variables like 'secure_file_priv';".)
-
> select ... into outfile '/var/lib/mysql-files/result.csv' fields terminated by ',' enclosed by '"' escaped by '\\' from ...
sends the output to a CSV file, easily. ~59KB.
AWKward
- Now, to translate journal titles like "Journal VOL.64 | NO.3, 2009" into dates like "2009-09-01".
- Regexps are safe enough for this (tiny, uniform, throwaway) data set:
$ grep --invert-match --extended-regexp '"Journal VOL.+ \| NO.([[:digit:]]+), ([[:digit:]]+)"' result.csv | wc --lines 0
- But, we wanted to convert issue numbers to months (by multiplying by 3), so this won't do (simply):
$ sed --regexp-extended 's/"Journal VOL.+ \| NO.([[:digit:]]+), ([[:digit:]]+)"/\2,\1/' result.csv
- awk (mawk? gawk?) should be able to, but… ugh! Painfully different parsing model, impedance mismatch…
- So, alternatives: Perl? Python? CoffeeScript (ie, Node)? Bash?
- Bash's builtin "read" is already line oriented ("-r" disables backslash processing):
while read -r; do echo $REPLY; done
But, parsing CSV lines with regexp (enhanced, with look behind, to deal with escaped commas?), and substring substitutions? POSIX (E)REs suck! - Perl can surely do it with a one liner, but, Pathologically…
- Python has CSV batteries included! ;o) Tho more verbose than we might have otherwise, at least no format/safety pains (intentionally rigid matching and conversions so would crash if mismatched, which it didn't).
Executable pseudocode:#!/usr/bin/python3 # Usage: ./fix_csv_to_import.py < result.csv > import.csv # Output CSV. out=csv.writer(sys.stdout,quoting=csv.QUOTE_NONNUMERIC,escapechar='\\',doublequote=False) # Headers. out.writerow(["post_content","post_title","post_date","post_type"]) # Crunch. for row in csv.reader(iter(sys.stdin.readline,'')): # Column 1: convert PDF's path to post_content, HTML. url=row[0].replace('sites/default/files','/wp-content/uploads',1) content='Currently only available as <a href="%s">PDF</a>.'%(url) # Column 3: convert to date string (ISO 8601, partial), mapping issue number (1-4) to month (3,6,9,12). def fn(match): return '%s-%02d-01'%(match.group(2),3*int(match.group(1))) created=r.sub(fn,row[2],1) out.writerow([content,row[1],created,'post'])
- Bash's builtin "read" is already line oriented ("-r" disables backslash processing):
Really Simple CSV Importer
Too easy. ;o)
Cover photos
- cover_french_bull_dog_september_2011.jpg, coverjune2010.png… Mess. Renamed (alas, manually) all files to uniform "cover_67_2_2012.{jpg,png}".
- PNGs should pro'ly be conveted to JPEGs. And shrunk?
- Future uploads — which URL will WP assign them?
Displaying cover images in date archives (and elsewhere?)
…
Redirections
…
Subscribers
-
> select mail from simplenews_subscriptions where activated=0; Empty set
> select mail into outfile '/var/lib/mysql-files/subscribers.list' from simplenews_subscriptions; Query OK, 604 rows affected (0.00 sec)
Bookmarks
…--
The real world is a special case