/    Sign up×
Community /Pin to ProfileBookmark

DOMDocument of class attribute to pick up innerhtml of an html file

How do I use the attribute class to pick up the innerhtml at a given url and html file? I’ve been looking all over the internet for a clear explanation.

Where am I going wrong (I’m not used to the “->” and “=>” since I don’t know what they represent or do):

[code]
<?php
//should come back to here
function walkDOMForTagAndClass($element, $tagName, $class, $callback) {
if ($element->nodeType !== 1) return false; // invalid element
// we force case as XML vs. SGML are inconsistent on ths
$tagName = strtoupper($tagName);
if ($walk = $element->firstChild) do {
if (
($walk->nodeType == 1) &&
(strtoupper($walk->nodeName) == $tagName) &&
($walk->attributes->getNamedItem(‘class’) == $class)
) $callback($walk);
} while (
$walk = $walk->firstChild || $walk->nextSibling || (
$walk->parentNode == $element ? false : $walk->parentNode.nextSibling
)
);
}
$file = “https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&t=KJV”;
$doc = new DOMDocument();
$doc->loadHTMLFile($file);
walkDOMForTagAndClass(
$doc,
‘div’,
//’columns tablet-8 small-10 tablet-order-3 small-order-2′,
‘nocrumbs’,
function($file) {
// do whatever it is you want with the matches here.
}
);

/*$html = “https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&t=KJV”;

$dom = new DOMDocument();
$dom->loadHTML($html);*/

//Evaluate Anchor tag in HTML
$xpath = new DOMXPath($doc);
$hrefs = $xpath->evaluate(“/html/body//a”);

for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute(‘href’);

//remove and set target attribute
$href->removeAttribute(‘target’);
$href->setAttribute(“target”, “_blank”);

$newURL=$url.”/newurl”;

//remove and set href attribute
$href->removeAttribute(‘href’);
$href->setAttribute(“href”, $newURL);
}

// save html
$file=$doc->saveHTML();

echo $file;
?>
[/code]

to post a comment

33 Comments(s)

Copy linkTweet thisAlerts:
@SempervivumJul 08.2018 — Seems to me that this can be done in a way much more easy by using getElementsByTagName. Get Elements having the required tag name, then loop through them and filter by class:
&lt;?php
//should come back to here
function walkDOMForTagAndClass($element, $tagName, $class, $callback)
{
$elems = $element-&gt;getElementsByTagName($tagName);
foreach ($elems as $ele) {
if ($ele-&gt;getAttribute("class") == $class) {
$callback($ele);
}
}
}
$file = "thread59.html";
$doc = new DOMDocument();
$doc-&gt;loadHTMLFile($file);
walkDOMForTagAndClass(
$doc,
'div',
'nocrumbs',
function ($ele) {
echo $ele-&gt;ownerDocument-&gt;saveHTML($ele);
}
);
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 08.2018 — Ok so it shows the entire page. I wanted the div where the class is "no crumbs":
<i>
</i>&lt;?php
//should come back to here
function walkDOMForTagAndClass($element, $tagName, $class, $callback)
{
$elems = $element-&gt;getElementsByTagName($tagName);
foreach ($elems as $ele) {
if ($ele-&gt;getAttribute("class") == $class) {
$callback($ele);
}
}
}
$file = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV"; //thread59.html
$doc = new DOMDocument();
$doc-&gt;loadHTMLFile($file);
walkDOMForTagAndClass(
$doc,
'div',
'nocrumbs',
function ($ele) {
echo $ele-&gt;ownerDocument-&gt;saveHTML($ele);
}
);
?&gt;
Copy linkTweet thisAlerts:
@SempervivumJul 08.2018 — The script works fine but that div has a lot of HTML in it, including a lot of other divs. The echo instruction outputs all of this, thus a large part of the page but not the complete one.
Copy linkTweet thisAlerts:
@SempervivumJul 09.2018 — Received an email notification of a new post by the TO but can't find it here. Did the TO delete it after posting?
Copy linkTweet thisAlerts:
@rootJul 09.2018 — It is easier to have permission from the site owner to have a feed like an atom feed that you can pull data from.

Generally, scraping a site is theft as you are neither the content owner nor have permission to use that data, even if they did have the data from a legitimate source and your taking that data, you still do not have permission,l you should really be asking for permission rather than just taking.
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 09.2018 — I'm trying to merge the coding above with what you provided. You don't need to answer the previous post. But I do wonder how to select an inner class:
<i>
</i> $file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=".$let.$s."&amp;t=KJV";
//$file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV"; //thread59.html
$doc = new DOMDocument();
$doc-&gt;loadHTMLFile($file_link);
walkDOMForTagAndClass(
$doc,
'div',
'nocrumbs',
function ($ele) {
//echo $ele-&gt;ownerDocument-&gt;saveHTML($ele);
}
);
walkDOMForTagAndClass(
$doc,
'div',
'bubHead',
function ($ele) {
echo $ele-&gt;ownerDocument-&gt;saveHTML($ele);
}
);

So if the first class is "nocrumbs" the second one is "bubHead". And the following processes I will need to look into the H1 tags, span tags, and em tags.
Copy linkTweet thisAlerts:
@SempervivumJul 09.2018 — Not sure if I understand correctly: class "bubHead" is nested inside "nocrumbs" and other "bubHead"s outside of "nocrumb" should not be taken into account?
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 09.2018 — @Sempervivum#1593736 yes exactly because I want to narrow down the search.
Copy linkTweet thisAlerts:
@SempervivumJul 09.2018 — I see, now I understand that the code in your initial posting makes sense, including the naming of the parameter, $element instead of $doc.

I assume that a recursice search will be necessary, i. e. .bubHead is not alwalys a direct child of .nocrumbs?

Pity that getElementsByTagName can be applied to the complete document only, not to a specific node.
Copy linkTweet thisAlerts:
@SempervivumJul 09.2018 — First approach, check if it fits your needs:
&lt;?php
ini_set('display_errors', '1');
error_reporting(E_ALL);
function walkDOMForTagAndClass($element, $tagName, $class, $callback)
{
$children = $element-&gt;childNodes;
foreach ($children as $child) {
if ($child-&gt;nodeType == XML_ELEMENT_NODE) {
if ($child-&gt;getAttribute("class") == $class
&amp;&amp; $child-&gt;tagName == $tagName) {
// Matching element found, call callback function
$callback($child);
} else {
walkDOMForTagAndClass($child, $tagName, $class, $callback);
}
}
}
}
$file = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV";
$file = "thread59.html";
$doc = new DOMDocument();
set_error_handler(function () { /* ignore errors */});
$doc-&gt;loadHTMLFile($file);
restore_error_handler();
$body = $doc-&gt;getElementsByTagName('body')-&gt;item(0);
walkDOMForTagAndClass($body, 'div', 'nocrumbs',
function ($ele) {
walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
echo '.bubHead found, content: ' . $ele-&gt;nodeValue . '&lt;br&gt;';
});
}
);
(I set the errorhandler to an empty function in order to be able to use my debugger without trouble)

HTML for testing:
&lt;!doctype html&gt;
&lt;html&gt;

&lt;head&gt;
&lt;meta charset="utf-8"&gt;
&lt;title&gt;Unbenanntes Dokument&lt;/title&gt;
&lt;/head&gt;

&lt;body&gt;

<i> </i>&lt;div class="nocrumbs"&gt;
<i> </i> &lt;div class="somediv"&gt;
<i> </i> &lt;div class="bubHead"&gt;Content of 1st bubHead&lt;/div&gt;
<i> </i> &lt;/div&gt;
<i> </i>&lt;/div&gt;
<i> </i>&lt;span class="nocrumbs"&gt;Element 4 tag not matching&lt;/span&gt;
<i> </i>&lt;div class="somediv"&gt;
<i> </i> &lt;div class="nocrumbs"&gt;
<i> </i> &lt;div class="somediv"&gt;
<i> </i> &lt;div class="bubHead"&gt;Content of 2nd bubHead&lt;/div&gt;
<i> </i> &lt;/div&gt;
<i> </i> &lt;/div&gt;
<i> </i>&lt;/div&gt;
<i> </i>&lt;div class="bla"&gt;Element 5 class not matching&lt;/div&gt;
<i> </i>&lt;div class="nocrumbs"&gt;
<i> </i> &lt;div class="bubHead"&gt;Content of 3rd bubHead&lt;/div&gt;
<i> </i>&lt;/div&gt;
&lt;/body&gt;

&lt;/html&gt;
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 09.2018 — Well eventually:

<i>
</i>&lt;div class="bubHead"&gt;
&lt;div&gt;
&lt;h1&gt;Lexicon :: Strong's H0 - &lt;em&gt;Not Available&lt;/em&gt;
&lt;/h1&gt;

the h1 tag will be chosen and H0 will be chosen out as the Strong's number and then what's inside the em tag as the transliteration.
Copy linkTweet thisAlerts:
@SempervivumJul 10.2018 — A few modifications should do the job:
&lt;?php
ini_set('display_errors', '1');
error_reporting(E_ALL);
function walkDOMForTagAndClass($element, $tagName, $class, $callback)
{
$children = $element-&gt;childNodes;
foreach ($children as $child) {
if ($child-&gt;nodeType == XML_ELEMENT_NODE) {
if (($child-&gt;getAttribute("class") == $class || $class == '')
&amp;&amp; $child-&gt;tagName == $tagName) {
// Matching element found, call callback function
$callback($child);
} else {
walkDOMForTagAndClass($child, $tagName, $class, $callback);
}
}
}
}
$file = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV";
$file = "thread59.html";
$doc = new DOMDocument();
set_error_handler(function () { /* ignore errors */});
$doc-&gt;loadHTMLFile($file);
restore_error_handler();
$body = $doc-&gt;getElementsByTagName('body')-&gt;item(0);
walkDOMForTagAndClass($body, 'div', 'nocrumbs',
function ($ele) {
walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
walkDOMForTagAndClass($ele, 'h1', '', function($ele) {
$html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
$html = str_replace(PHP_EOL, '', $html);
echo 'h1 found, HTML: ' . $html . '&lt;br&gt;';
preg_match('/:: Strong's (..).*&lt;em&gt;(.*)&lt;/em&gt;/', $html, $matches);
var_dump($matches);
});
});
}
);
HTML used for testing:
&lt;div class="nocrumbs"&gt;
&lt;div class="somediv"&gt;
&lt;div class="bubHead"&gt;Content of 1st bubHead&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;span class="nocrumbs"&gt;Element 4 tag not matching&lt;/span&gt;
&lt;div class="somediv"&gt;
&lt;div class="nocrumbs"&gt;
&lt;div class="somediv"&gt;
&lt;div class="bubHead"&gt;
&lt;div&gt;
&lt;h1&gt;Lexicon :: Strong's H0 -
&lt;em&gt;Not Available&lt;/em&gt;
&lt;/h1&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class="bla"&gt;Element 5 class not matching&lt;/div&gt;
&lt;div class="nocrumbs"&gt;
&lt;div class="bubHead"&gt;Content of 3rd bubHead&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 10.2018 — It says:
Notice: Trying to get property of non-object in C:...update_outlinetest_domdocument19.php on line 6

Warning: Invalid argument supplied for foreach() in C:...test_domdocument19.php on line 7[/quote]


Lines 6 and 7 are:
<i>
</i> $children = $element-&gt;childNodes;
foreach ($children as $child) {

It doesn't write anything else on the page.
Copy linkTweet thisAlerts:
@SempervivumJul 10.2018 — Works fine for me, even with your original site.

Which PHP version do you use? I have 7.09.
Copy linkTweet thisAlerts:
@gilgalbiblewheeauthorJul 10.2018 — XAMPP for Windows 5.6.15 on my laptop
Copy linkTweet thisAlerts:
@SempervivumJul 10.2018 — 
  • - phpinfo(); will display the PHP version.

  • - Try adding some debug info:
    var_dump($element);
    $children = $element-&gt;childNodes;
    var_dump($children);
    foreach ($children as $child) {
  • Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 10.2018 — @Sempervivum#1593764 PHP Version 5.6.15 is what I have.

    <i>
    </i>&lt;?php
    ini_set('display_errors', '1');
    error_reporting(E_ALL);
    function walkDOMForTagAndClass($element, $tagName, $class, $callback)
    {
    var_dump($element);
    $children = $element-&gt;childNodes;
    foreach ($children as $child) {
    if ($child-&gt;nodeType == XML_ELEMENT_NODE) {
    if (($child-&gt;getAttribute("class") == $class || $class == '')
    &amp;&amp; $child-&gt;tagName == $tagName) {
    // Matching element found, call callback function
    $callback($child);
    } else {
    walkDOMForTagAndClass($child, $tagName, $class, $callback);
    }
    }
    }
    }
    //$file = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV";
    $file = "test_thishtml7.html";
    $doc = new DOMDocument();
    set_error_handler(function () { /* ignore errors */});
    $doc-&gt;loadHTMLFile($file);
    restore_error_handler();
    $body = $doc-&gt;getElementsByTagName('body')-&gt;item(0);
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
    walkDOMForTagAndClass($ele, 'h1', '', function($ele) {
    $html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
    $html = str_replace(PHP_EOL, '', $html);
    echo 'h1 found, HTML: ' . $html . '&lt;br&gt;';
    preg_match('/:: Strong's (..).*&lt;em&gt;(.*)&lt;/em&gt;/', $html, $matches);
    var_dump($matches);
    });
    });
    }
    );
    phpinfo();
    ?&gt;

    var_dump($element); showed NULL
    Copy linkTweet thisAlerts:
    @SempervivumJul 11.2018 — Then it's no wonder that it doesn't work. Does your HTML have a page structure including body? As visible in the code, body is the starting point for the search.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 11.2018 — No, you're right. on this test page it doesn't have any body tags. Need to add.
    <i>
    </i>&lt;?php
    ini_set('display_errors', '1');
    error_reporting(E_ALL);
    function walkDOMForTagAndClass($element, $tagName, $class, $callback)
    {
    var_dump($element);
    $children = $element-&gt;childNodes;
    foreach ($children as $child) {
    if ($child-&gt;nodeType == XML_ELEMENT_NODE) {
    if (($child-&gt;getAttribute("class") == $class || $class == '')
    &amp;&amp; $child-&gt;tagName == $tagName) {
    // Matching element found, call callback function
    $callback($child);
    } else {
    walkDOMForTagAndClass($child, $tagName, $class, $callback);
    }
    }
    }
    }
    //$file = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV";
    $file = "test_thishtml7.html";
    $doc = new DOMDocument();
    set_error_handler(function () { /* ignore errors */});
    $doc-&gt;loadHTMLFile($file);
    restore_error_handler();
    $body = $doc-&gt;getElementsByTagName('body')-&gt;item(0);
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
    walkDOMForTagAndClass($ele, 'h1', '', function($ele) {
    $html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
    $html = str_replace(PHP_EOL, '', $html);
    echo 'h1 found, HTML: ' . $html . '&lt;br&gt;';
    preg_match('/:: Strong's (..).*&lt;em&gt;(.*)&lt;/em&gt;/', $html, $matches);
    var_dump($matches);
    });
    });
    }
    );
    //phpinfo();
    ?&gt;
    &lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"&gt;
    &lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
    &lt;head&gt;
    &lt;meta http-equiv="Content-Type" content="text/html; charset=utf-8" /&gt;
    &lt;title&gt;&lt;/title&gt;
    &lt;/head&gt;

    &lt;body&gt;
    &lt;/body&gt;
    &lt;/html&gt;

    Copy linkTweet thisAlerts:
    @SempervivumJul 12.2018 — And did it work after adding the body?
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 12.2018 — @Sempervivum#1593847 When I switched the link to the actual site it did work. But I'm trying to retrieve the rest of the info such as

    The Hebrew or Greek word found in the <h6 class="lexTitleHb">אָב</h6> tags (If it's Greek the class might be lextTitleGk);

    Root Word (Etymology) which is the Description found in:
    <i>
    </i> &lt;div class="columns small-12 table-styles"&gt;
    &lt;div class="lexicon-label"&gt;Root Word (Etymology)&lt;/div&gt;
    &lt;div&gt; A root &lt;/div&gt;
    &lt;/div&gt;

    and the Outline of Biblical Usage:
    &lt;div id="outlineBiblical" class="__hidden"&gt;&lt;div&gt; &lt;ol&gt;&lt;li&gt;&lt;p&gt;father of an individual&lt;/li&gt;&lt;li&gt;&lt;p&gt;of God as father of his people&lt;/li&gt;&lt;li&gt;&lt;p&gt;head or founder of a household, group, family, or clan&lt;/li&gt;&lt;li&gt;&lt;p&gt;ancestor&lt;ol&gt;&lt;li&gt;&lt;p&gt;grandfather, forefathers &amp;#8212; of person&lt;/li&gt;&lt;li&gt;&lt;p&gt;of people&lt;/li&gt;&lt;/ol&gt;&lt;/li&gt;&lt;li&gt;&lt;p&gt;originator or patron of a class, profession, or art&lt;/li&gt;&lt;li&gt;&lt;p&gt;of producer, generator (fig.)&lt;/li&gt;&lt;li&gt;&lt;p&gt;of benevolence and protection (fig.)&lt;/li&gt;&lt;li&gt;&lt;p&gt;term of respect and honour&lt;/li&gt;&lt;li&gt;&lt;p&gt;ruler or chief (spec.)&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;/div&gt;

    And the Strong’s Definitions:
    &lt;div class="lexStrongsDef"&gt; &lt;span class="Hb"&gt;אָב&lt;/span&gt; &lt;span class="strgtrans"&gt;ʼâb,&lt;/span&gt; awb; a primitive word; father, in a literal and immediate, or figurative and remote application:&amp;#8212;chief, (fore-) father(-less), &lt;span class="strongsEcks"&gt;&lt;a class="lexpop" rel="lexicon.strongsLegend"&gt;&amp;#215;&lt;/a&gt;&lt;/span&gt; patrimony, principal. Compare names in &amp;#39;Abi-&amp;#39;.&lt;/div&gt;

    This is what I have:
    <i>
    </i> $file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=".$let.$s."&amp;t=KJV";
    //$file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&amp;t=KJV"; //thread59.html
    $doc = new DOMDocument();
    set_error_handler(function () { /* ignore errors */});
    $doc-&gt;loadHTMLFile($file_link);
    restore_error_handler();
    $body = $doc-&gt;getElementsByTagName('body')-&gt;item(0);
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
    walkDOMForTagAndClass($ele, 'h1', '', function($ele) {
    $html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
    $html = str_replace(PHP_EOL, '', $html);
    echo 'h1 found, HTML: ' . $html . '&lt;br&gt;';
    preg_match('/:: Strong's (..).*&lt;em&gt;(.*)&lt;/em&gt;/', $html, $matches);
    //var_dump($matches);
    $blbstrongs = $matches[1];
    $blbtransliteration = $matches[2];
    echo "blbstrongs: ".$blbstrongs."&lt;br /&gt;n";
    echo "blbtransliteration: ".$blbtransliteration."&lt;br /&gt;n";
    //array_push($blbgroup, array($blbstrongs, $blbtransliteration));
    });
    });
    }
    );

    <i> </i>//var_dump($blbgroup);
    <i> </i>// to pick up the rest of the information
    <i> </i>walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    <i> </i> function ($ele) {
    <i> </i> walkDOMForTagAndClass($ele, 'div', 'columns small-12 table-styles', function($ele) {
    <i> </i> walkDOMForTagAndClass($ele, 'div', '', function($ele) {
    <i> </i> $html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
    <i> </i> $html = str_replace(PHP_EOL, '', $html);
    <i> </i> echo 'div found, HTML: ' . $html . '&lt;br&gt;';
    <i> </i> preg_match_all("#&lt;b(div)b[^&gt;]*&gt;(.*?)&lt;/b(div)b&gt;#si", $html, $divmatches, PREG_SET_ORDER);
    <i> </i> //preg_match('&lt;div&gt;(.*)&lt;/div&gt;/', $html, $divmatches);
    <i> </i> echo "-----&lt;br /&gt;&lt;br /&gt;n";
    <i> </i> echo "divmatches &lt;pre style="color: blue; font-weight: bold;"&gt;";
    <i> </i> var_dump($divmatches);
    <i> </i> echo "&lt;/pre&gt; ";
    <i> </i> echo "&lt;br /&gt;&lt;br /&gt;n";
    <i> </i> echo "--------------------&lt;br /&gt;&lt;br /&gt;n";

    <i> </i> $blbdescr = $divmatches[0][2];
    <i> </i> //$blbtransliteration = $matches[2];
    <i> </i> echo "&lt;span style="color: purple; font-weight: bold;"&gt;blbdescr: ".$blbdescr."&lt;/span&gt;&lt;br /&gt;n";
    <i> </i> //echo "blbtransliteration: ".$blbtransliteration."&lt;br /&gt;n";
    <i> </i> //array_push($blbgroup, array($blbstrongs, $blbtransliteration));
    <i> </i> });
    <i> </i> });
    <i> </i> }
    <i> </i>);
    <i> </i>
    <i> </i>//var_dump($blbgroup);
    Copy linkTweet thisAlerts:
    @SempervivumJul 12.2018 — Hebrew/Greed word seems to be easy as the corresponding h6 occurs only once in the complete document:
    walkDOMForTagAndClass($body, 'h6', 'lexTitleGk',
    function ($ele) {
    echo 'h6 found, content greek: ' . $ele-&gt;nodeValue . '&lt;br&gt;';
    });
    walkDOMForTagAndClass($body, 'h6', 'lexTitleHb',
    function ($ele) {
    echo 'h6 found, content hebrew: ' . $ele-&gt;nodeValue . '&lt;br&gt;';
    });


    For root word and strongs definition I worked out this:
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'columns small-12 table-styles', function ($ele) {
    $html = $ele-&gt;ownerDocument-&gt;saveHTML($ele);
    var_dump($html);
    // check for root word
    preg_match('/&lt;div class="lexicon-label"&gt;Root Word (Etymology)&lt;/div&gt;.*&lt;div&gt;(.*)&lt;/div&gt;/isU', $html, $matches);
    var_dump($matches);
    if (count($matches) == 2) {
    echo 'Root Word: ' . $matches[1];
    }
    // check for strongs definition
    preg_match('/&lt;div class="lexStrongsDef"&gt;(.*)&lt;/div&gt;/isU', $html, $matches);
    var_dump($matches);
    if (count($matches) == 2) {
    echo 'strongs definition: ' . $matches[1];
    }

    <i> </i> });
    <i> </i>})
    Check if it fits your needs.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 12.2018 — @Sempervivum#1593856 the last double brackets seem to need a semicolon, otherwise it throws an error.

    I think it does work well, although in my posting above, whatever is displayed in blue or purple in span tags (divmatches) is what worries me.

    "columns small-12 table-styles" occurs several times. And the 1st, 3rd and 4th occurrences (leaving out one class "columns small-12 table-styles __hidden" in between them) have to be included.

    But I also wonder if it would be a good idea to insert tags in the db table? Because Outline of Biblical Usage has ul/ol and li tags with them. And description has a tags with interesting title attributes defining the word.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 15.2018 — https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&t=KJV

    I want to extract:

    - the transliteration (which is found with the strongs in the em tag and is in the code below);

    - root word etymology (which is the description);

    - Outline of Biblical Usage

    I'm not sure where to go from here:

    [php]

    //should come back to here

    /*function walkDOMForTagAndClass($element, $tagName, $class, $callback){

    $elems = $element->getElementsByTagName($tagName);

    foreach ($elems as $ele) {

    if ($ele->getAttribute("class") == $class) {

    $callback($ele);

    }

    }

    }*
    /

    function walkDOMForTagAndClass($element, $tagName, $class, $callback)

    {

    /* echo "-----<br /><br />n";

    echo "element <pre style="color: red; font-weight: bold;">";

    var_dump($element);

    echo "</pre> ";

    echo "<br /><br />n";

    echo "-----<br /><br />n";*
    /

    $children = $element->childNodes;
    foreach ($children as $child) {
    if ($child->nodeType == XML_ELEMENT_NODE) {
    if (($child->getAttribute("class") == $class || $class == '')
    && $child->tagName == $tagName) {
    // Matching element found, call callback function
    $callback($child);
    } else {
    walkDOMForTagAndClass($child, $tagName, $class, $callback);
    }
    }
    }

    }

    /*****************************************************************/

    $total_strongs = array();

    $total_transliteration = array();

    $total_outline = array();

    $total_etymologyTransliteration = array();

    $total_description = array();

    $total_etymologyStrongs = array();

    $total_etymologyOutlines = array();

    $total_etym_desc = array();

    $total_heb_word = array();

    $total_etym_heb_word = array();

    $blbgroup = array();

    for($s=$startstrong;$s<($lim[1]+1);$s++){

    $all_etymologyTransliteration = array();

    $all_etymologyStrongs = array();

    $all_greek_blb = array();

    $all_description_blb = array();

    $all_etymoutlines = array();

    $file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=".$let.$s."&t=KJV";
    //$file_link = "https://www.blueletterbible.org/lang/lexicon/lexicon.cfm?Strongs=H1&t=KJV"; //thread59.html
    $doc = new DOMDocument();
    set_error_handler(function () { /* ignore errors */});
    $doc->loadHTMLFile($file_link);
    restore_error_handler();
    $body = $doc->getElementsByTagName('body')->item(0);
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'bubHead', function($ele) {
    walkDOMForTagAndClass($ele, 'h1', '', function($ele) {
    $html = $ele->ownerDocument->saveHTML($ele);
    $html = str_replace(PHP_EOL, '', $html);
    //echo 'h1 found, HTML: ' . $html . '<br>';
    preg_match('/:: Strong's (..).*<em>(.*)</em>/', $html, $matches);
    //var_dump($matches);
    $blbstrongs = $matches[1];
    $blbtransliteration = $matches[2];
    echo "blbstrongs: ".$blbstrongs."<br />n";
    echo "blbtransliteration: ".$blbtransliteration."<br />n";
    //array_push($blbgroup, array($blbstrongs, $blbtransliteration));
    });
    });
    }
    );

    //var_dump($blbgroup);
    // to pick up the rest of the information
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'columns small-12 table-styles', function($ele) {
    walkDOMForTagAndClass($ele, 'div', '', function($ele) {
    $html = $ele->ownerDocument->saveHTML($ele);
    $html = str_replace(PHP_EOL, '', $html);
    //echo 'div found, HTML: ' . $html . '<br>';
    //"#<b(div)b[^>]*>(.*?)</b(div)b>#si"
    preg_match_all("#<b(div)b[^>]*>(.*?)</b(div)b>#si", $html, $divmatches, PREG_SET_ORDER);
    //preg_match('<div>(.*)</div>/', $html, $divmatches);
    echo "-----<br /><br />n";
    echo "divmatches: <pre style="color: blue; font-weight: bold;">";
    var_dump($divmatches);
    echo "</pre> ";
    echo "<br /><br />n";
    echo "--------------------<br /><br />n";
    $blbdescr = $divmatches[0][2];
    //$blbtransliteration = $matches[2];
    echo "<span style="color: purple; font-weight: bold;">blbdescription: ".$blbdescr."</span><br />n";
    //echo "blbtransliteration: ".$blbtransliteration."<br />n";
    //array_push($blbgroup, array($blbstrongs, $blbtransliteration));
    });
    });
    }
    );

    //var_dump($blbgroup);
    walkDOMForTagAndClass($body, 'h6', 'lexTitleGk',
    function ($ele) {
    echo 'h6 found, content greek: ' . $ele->nodeValue . '<br>';
    echo "<hr /><br /><br />n";
    });
    walkDOMForTagAndClass($body, 'h6', 'lexTitleHb',
    function ($ele) {
    echo 'h6 found, content hebrew: ' . $ele->nodeValue . '<br>';
    echo "<hr /><br /><br />n";
    });
    walkDOMForTagAndClass($body, 'div', 'nocrumbs',
    function ($ele) {
    walkDOMForTagAndClass($ele, 'div', 'columns small-12 table-styles', function ($ele) {
    $html = $ele->ownerDocument->saveHTML($ele);
    var_dump($html);
    // check for root word
    preg_match('/<div class="lexicon-label">Root Word (Etymology)</div>.*<div>(.*)</div>/isU', $html, $etymmatches);
    echo "<pre style="color: red;">";
    var_dump($etymmatches);
    echo "</pre>";
    if (count($etymmatches) == 2) {
    echo "Root Word (etymmatches): <span style="font-weight: bold;">".$etymmatches[1]."</span>";
    echo "<hr /><br /><br />n";
    }
    // check for strongs definition
    preg_match('/<div class="lexStrongsDef">(.*)</div>/isU', $html, $defmatches);
    echo "<pre style="color: green;">";
    var_dump($defmatches);
    echo "</pre>";
    if (count($defmatches) == 2) {
    echo "strongs definition (defmatches): <span style="font-weight: bold;">".$defmatches[1]."</span>";
    echo "<hr /><br /><br />n";
    }
    });
    });

    [/php]
    Copy linkTweet thisAlerts:
    @rootJul 16.2018 — Instead of stealing the information, have you tried contacting the data owner, see if they would be willing to allow you to access that data rather than the usual scrape and run.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 16.2018 — The Strong's concordance is public domain. Many Bible websites use the Strong's Concordance.
    Copy linkTweet thisAlerts:
    @rootJul 16.2018 — You do not seem to understand the meaning of data ownership.

    If someone or a group or a company spent thousands of hours of time and effore to produce a public domain resource... THEY OWN THE DATA RIGHTS because they put the work in to archiving it.

    I was in a previous job, one of many office desk jobs I did for a local authority (as in local government) and you run an archive, that I used to get many research requests, accessing various resources and databases EVEN internally COST MONEY.

    Daft as it sounds, one department provides a service to another, there is a recharge fee which transfers from that departments budget in to the department that did the admin work or provided a service, to help pay for that service to be maintained.

    Despite it being a "PUBLIC RESOURCE" the owners of the data, despite all records being public domain, all had a search fee attached.

    Doesn't matter what the resource, if that site was offering its data freely, then they would have an feed for others to access that would return the data being requested, much like on blogs which have Atom feeds.

    So Ask, its NOT your data, despite being public domain, that data has an owner.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 16.2018 — I see your point. Maybe it's not worth the hassle to get it from a site then. After all they do offer free Strong's elsewhere:

    http://hopeinjesus.com.au/wp-content/uploads/2014/01/Strongs-Exhaustive-Concordance.pdf

    But I don't think it had that many out there at some point.
    Copy linkTweet thisAlerts:
    @rootJul 16.2018 — I gave a friend an example of how to get something for free that his searches returned paying sites.

    I found through searching carefully and using google tools to craft a filter, was able to obtain the desired item for free.

    The power of a search engine is not making a simple search and then reading through the return results, it filters to filter out the stuff you don't want to see.

    So the harder you look, the more likely you are to get that thing you want and for free without any one getting hurt.

    Some public domain material is easily accessible from archives that were set up to be free to start with, often this data is used and amalgamated with other data sources and that archivist will likely broker the data for a fee.

    I often found sites that wanted to sell me information and data that that government provides as a matter of public record and for free online, not all data has to be paid for, some of it has to be available as a matter of public record.

    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 16.2018 — @root#1593939 I find that's kind of a dilemma with the internet. I mean let's say you write a book and want to sell because you put an effort in there. But there might be a free pdf somewhere of your book. In a way the info you provide might be useful and the free ebook might be a blessing to someone. But it's hard to make money these days when it comes to information/internet.
    Copy linkTweet thisAlerts:
    @rootJul 16.2018 — Nope, copyright infringement is in play there.

    Data has no copyright when it is public domain material, much like the myth people have about privacy and death records, once you are gone, everything about you like public records is available to people as there is no privacy concerns once you're dead and buried.

    Data theft is still theft, even if it is public domain material if you have not sourced and paid for that data from the originator or controller of the source, which in many cases will be your local council or government or the appointed data agency to regulate, etc.

    Copyright is a civil matter.
    Copy linkTweet thisAlerts:
    @gilgalbiblewheeauthorJul 17.2018 — How then can I obtain copyright on what I published?
    Copy linkTweet thisAlerts:
    @SempervivumJul 17.2018 — I agree with root: You should consult the owner of this dictionary and ask for permission to grab it the way you do. And my opinion is you should offer a donation.
    ×

    Success!

    Help @gilgalbiblewhee spread the word by sharing this article on Twitter...

    Tweet This
    Sign in
    Forgot password?
    Sign in with TwitchSign in with GithubCreate Account
    about: ({
    version: 0.1.9 BETA 3.29,
    whats_new: community page,
    up_next: more Davinci•003 tasks,
    coming_soon: events calendar,
    social: @webDeveloperHQ
    });

    legal: ({
    terms: of use,
    privacy: policy
    });
    changelog: (
    version: 0.1.9,
    notes: added community page

    version: 0.1.8,
    notes: added Davinci•003

    version: 0.1.7,
    notes: upvote answers to bounties

    version: 0.1.6,
    notes: article editor refresh
    )...
    recent_tips: (
    tipper: @darkwebsites540,
    tipped: article
    amount: 10 SATS,

    tipper: @Samric24,
    tipped: article
    amount: 1000 SATS,

    tipper: Anonymous,
    tipped: article
    amount: 10 SATS,
    )...