Drupal to WordPress: Migrating Database, URLs, Generating Posts…

The long and winding road of (low-level) migrating content from a Drupal website to WP. Workplan: MySQL and SFTP → SQL → CSV → scripting → import plugin → theme hacking…

Dump the old

  1. SQL dump, the usual way
  2. Copy (SFTP, scp, whatever) files: /srv/www/vhosts/$SITE/drupal/{mysql_dump,pdf_files,cover_photos}

What to migrate?

  1. Site (I'm helping a friend with) is the online "archive" of an academic quarterly.
  2. So far, it contains just (separate) PDFs of papers, and photos of issues' covers. (Later, we'll consider converting those PDFs to… HTML-based hypermedia documents — papers typically embed lots of sciency media.)

Information architecture: site specifications

  1. Papers (should be) grouped by the (printed, bound) journal (aka issue) they appeared in, which are assigned volume and sub-volume (quarter) number, as well as publication date (just year, actually), eg, "volume 63, number 4, 2008".

URLs, navigation

Content Canonical URL (old/new)
Posts: papers, etc — content that’s added over time
Paper /2018/03/of_human_loss_and_rbc_survival/ NB trailing slash
Issue (category? post? date archive!) Had no canonical URL. Maybe use date archive, eg /2018/03/, hacked to also show cover photos? Cover photos, table of contents, intro/editorial? See notes below.
PDF (attachment.php?) Old: /sites/default/files/pozzi_1.pdf Attachment to paper’s post
Date archive (ie, multiple as well as single issue?) /2018/, or /2018/03/
RSS (auto discovery?) Old: /taxonomy/term/19/0/feed, and /rss.xml ? WP: /feed/, /feed/comments/, /tag/foo/feed/ Latest papers, issues, etc. See notes.
Pages: about us, etc — relatively unchanging content
Front page (page) / Currently same as about, but should be: Intro, and probably list latest papers or issue(s)
About /about Long
Contact form /contact Sends with SMTP
Instructions for authors, submitting (page) Old: /Instruction [sic] Fix: /instructions, or /submitting
Misc
Media (images, video…) ?
Author (search by) /author/dolittle/ Lists matching papers. NB trailing slash.
Author’s profile? Index of authors? ?
Tag (taxonomy) index? /tag/antibiotics/
Search results (full text search) /?s=…

WP's data model

  1. Posts are (as should be) HTML. PDFs will be media attachments, and need to be hyperlinked from posts.
  2. Attachments are (optionally) associated with posts via their post_parent field/property.

How to migrate?

  1. (No easy (plugins, readymade) solutions. Obviously, I'd rather avoid PHP and SQL, as much as possible. And XML/WXR. Cf bookmarks…)
  2. CSV seems easy enough to output from MySQL, maybe manipulate as plain text, and "import" into WP — pick from several plugins with different feature sets. Really Simple CSV Importer seems useful.
  3. What fields (aka columns) to generate for the CSV import?
    1. post_title
    2. post_content: something like "Currently only available in <a href="$LINK">PDF</a>.", where $LINK is "/wp-content/uploads/$FILENAME"
    3. post_date (string, passed to PHP's strtotime which presumably can handle ISO 8601 timestamps): where journal's title is like “Journal VOL.64 | NO.3, 2009”, we'll map into dates like "2009-09-01"

(SSH)

  1. Add "ServerAliveInterval 120" to /etc/ssh/ssh_config (client config) to prevent frequent disconnections.
  2. SFTP acted up, frequently, on a tweaked Debian Live I sometimes used — SSHFS to the rescue (even feels quicker in Thunar): $ mkdir ~/site; sshfs user@site.ddns.net:/srv/www/vhosts/site/ ~/site

Drupal's schema

  1. files.{fid, filepath, filemime}: filemime is either {application/pdf, image/png, image/jpeg}
    1. filename isn't unique, and doesn't match files in filesystem, which somehow got renamed to make them unique, eg:
      (365, 1, 'pozzi.pdf', 'sites/default/files/pozzi.pdf', 'application/pdf', 1551860, 1, 1394553205),
      (483, 1, 'pozzi.pdf', 'sites/default/files/pozzi_0.pdf', 'application/pdf', 1489159, 1, 1459098492),
      (574, 1, 'pozzi.pdf', 'sites/default/files/pozzi_1.pdf', 'application/pdf', 1653356, 1, 1513451716)
      
      Verified with:
      > select filename,count(filename) from files group by filename having count(filename)>1;
      21 rows in set
      > select filepath from files group by filepath having count(filepath)>1;
      Empty set
  2. upload.{fid, nid}: fid is key into files, nid is key into node
  3. node.{nid, type, title}: where type=journal, title is like “Journal VOL.64 | NO.3, 2009”; where type=ipaper, title is paper’s title
  4. search_node_links.{sid, nid}: maps papers to journals (both IDs should be keys into node, sid for journal, nid for ipaper)
  5. simplenews_subscriptions.{mail, activated}: subscribers’ emails; activated={0,1}

SQL

  1. How to generate the CSV import? For each record in table "upload" we need these:
    1. post_title: title from node, where node.nid=upload.nid
    2. post_content: something like "Currently only available in <a href="$LINK">PDF</a>.", with $LINK set to "/wp-content/uploads/$FILENAME" and $FILENAME is files.filename
    3. post_date (string, passed to PHP's strtotime which presumably can handle ISO 8601 timestamps): where upload.nid=search_node_links.nid, search_node_links.sid points to node.nid that has node.type=journal and node.title is like “Journal VOL.64 | NO.3, 2009”, which we'll map into dates like "2009-09-01".
  2. This is the query I came up with (after quite a bit of debugging, because I've been all about NoSQL for decades now ;o):
    > select files.filepath,paper.title,journal.title
    -> from upload
    -> inner join files
    -> on files.fid=upload.fid
    -> inner join node as paper
    -> on upload.nid=paper.nid
    -> inner join search_node_links as map
    -> on map.sid=upload.nid
    -> inner join node as journal
    -> on journal.nid=map.nid;
    380 rows in set (0.01 sec)
    
  3. "explain" says all five queries used primary keys (except upload — entire table iterated), so efficient enough.
  4. (Should I have used left outer joins, instead, to look for inconsistencies?)

CSV

  1. (After getting error, "server is running with --secure-file-priv", got safe path to use, with "show variables like 'secure_file_priv';".)
  2. > select ... into outfile '/var/lib/mysql-files/result.csv' fields terminated by ',' enclosed by '"' escaped by '\\' from ...
    
    sends the output to a CSV file, easily. ~59KB.

AWKward

  1. Now, to translate journal titles like "Journal VOL.64 | NO.3, 2009" into dates like "2009-09-01".
  2. Regexps are safe enough for this (tiny, uniform, throwaway) data set:
    $ grep --invert-match --extended-regexp '"Journal VOL.+ \| NO.([[:digit:]]+), ([[:digit:]]+)"' result.csv | wc --lines
    0
    
  3. But, we wanted to convert issue numbers to months (by multiplying by 3), so this won't do (simply):
    $ sed --regexp-extended 's/"Journal VOL.+ \| NO.([[:digit:]]+), ([[:digit:]]+)"/\2,\1/' result.csv
  4. awk (mawk? gawk?) should be able to, but… ugh! Painfully different parsing model, impedance mismatch…
  5. So, alternatives: Perl? Python? CoffeeScript (ie, Node)? Bash?
    1. Bash's builtin "read" is already line oriented ("-r" disables backslash processing):
      while read -r; do echo $REPLY; done
      But, parsing CSV lines with regexp (enhanced, with look behind, to deal with escaped commas?), and substring substitutions? POSIX (E)REs suck!
    2. Perl can surely do it with a one liner, but, Pathologically
    3. Python has CSV batteries included! ;o) Tho more verbose than we might have otherwise, at least no format/safety pains (intentionally rigid matching and conversions so would crash if mismatched, which it didn't).
      Executable pseudocode:
      #!/usr/bin/python3
      # Usage: ./fix_csv_to_import.py < result.csv > import.csv
      # Output CSV.
      out=csv.writer(sys.stdout,quoting=csv.QUOTE_NONNUMERIC,escapechar='\\',doublequote=False)
      # Headers.
      out.writerow(["post_content","post_title","post_date","post_type"])
      # Crunch.
      for row in csv.reader(iter(sys.stdin.readline,'')):
      	# Column 1: convert PDF's path to post_content, HTML.
      	url=row[0].replace('sites/default/files','/wp-content/uploads',1)
      	content='Currently only available as <a href="%s">PDF</a>.'%(url)
      	# Column 3: convert to date string (ISO 8601, partial), mapping issue number (1-4) to month (3,6,9,12).
      	def fn(match):
      		return '%s-%02d-01'%(match.group(2),3*int(match.group(1)))
      	created=r.sub(fn,row[2],1)
      	out.writerow([content,row[1],created,'post'])
      

Really Simple CSV Importer

Too easy. ;o)

Cover photos

  1. cover_french_bull_dog_september_2011.jpg, coverjune2010.png… Mess. Renamed (alas, manually) all files to uniform "cover_67_2_2012.{jpg,png}".
  2. PNGs should pro'ly be conveted to JPEGs. And shrunk?
  3. Future uploads — which URL will WP assign them?

Displaying cover images in date archives (and elsewhere?)

Redirections

Subscribers

  1. > select mail from simplenews_subscriptions where activated=0;
    Empty set
  2. > select mail into outfile '/var/lib/mysql-files/subscribers.list' from simplenews_subscriptions;
    Query OK, 604 rows affected (0.00 sec)

Bookmarks


--
The real world is a special case

Canonical
https://decodecode.net/elitist/drupal-wordpress-migrating-database-urls-generating-posts
Tagged
Updated
Created