How to extract content between two delimiters in PHP

Hi,

Here’s a function which is useful when you need to extract some content between two delimiters. For instance you need to extract content using a robot that connects to a page.

<?php
/*
Credits: Bit Repository
URL: http://www.bitrepository.com/web-programming/php/extracting-content-between-two-delimiters.html
*/

function extract_unit($string, $start, $end)
{
$pos = stripos($string, $start);

$str = substr($string, $pos);

$str_two = substr($str, strlen($start));

$second_pos = stripos($str_two, $end);

$str_three = substr($str_two, 0, $second_pos);

$unit = trim($str_three); // remove whitespaces

return $unit;
}

This is an usage example of this function:

$text = 'PHP is an acronym for "PHP: Hypertext Preprocessor".';

$unit = extract_unit($text, 'an', 'for');

// Outputs: acronym
echo $unit;
?>

How it works?

First, we use stripos() to determine the numeric position of the first occurrence of needle in the haystack string. In our example, there are 7 characters from the beginning of the string until ‘an’.

$pos = stripos($string, $start);

Now, we will use this information to get the content of $string, from the $pos character until the last one:

an acronym for “PHP: Hypertext Preprocessor”.

$str = substr($string, $pos);

Remove ‘an’ from the recently created string:

acronym for “PHP: Hypertext Preprocessor”.

$str_two = substr($str, strlen($start));

Determine the number of characters from the beginning of $str_two until ‘for’ (9 in this case):

$second_pos = stripos($str_two, $end);

Now use this number to get the content from the beginning of the string until ‘for':

$str_three = substr($str_two, 0, $second_pos);

The last variable would be equal with ‘ acronym ‘. Eventually, let’s strip the whitespaces from the beginning and ending of the string:

acronym

$unit = trim($str_three); // remove whitespaces

If you have any comments, suggestions regarding this snippet please post them.

Comment via Facebook

comments

Comments

  1. Flo says

    Hi, nice script!
    But I have additional question… what if I want all instances of $unit in an array, for example in case my string contains more than one location of $start and $end?
    example:
    $string = “This is a string“;
    $start = ““;
    $end = “
    “;

    desired output:
    array (“This”, “string”);

  2. Flo says

    my comment stripped the Bold-tags…
    i meant:
    $string = “[bold]This is a string[/bold]“;
    $start = “[bold]“;
    $end = “[/bold]“;

  3. Gabriel says

    Here’s the solution:

    <?php
    $string = "<b>This</b> is a <b>string</b>";
    $start = "<b>";
    $end = "<\/b>";
    
    // Regexp for the extractor
    $regexp = '/'.$start.'(.*)'.$end.'/Ui';
    
    preg_match_all($regexp, $string, $out, PREG_PATTERN_ORDER);
    
    $desired = $out[1];
    
    echo "<pre>"; print_r($desired); echo "</pre>";
    
    /*
    Output:
    
    Array
    (
        [0] => This
        [1] => string
    )
    */
    ?>

    Notice the backslash before the slash in the ending bold: <\/b>.

    Helpful address: http://www.php.net/manual/ro/function.preg-quote.php

  4. Flo says

    Wow, this is a great solution. I hoped that it would work without the use of an regular expression, but it works!
    Now I want to create a sort of database in an associative array. So for example data between tag A and B needs to get in the same record as data extracted from tag C and D in a what has to result in something like this:
    Array (“data from A and B” => “from C and D”,
    “data from A and B” => “from C and D”,
    and so on…
    )

    I now have the following:

    <?php
    function test($string, $start1, $end1, $start2, $end2)

    {
    // Regexp for the extractor
    $regexp1 = ‘/’.$start1.'(.*)’.$end1.’/Ui';
    $regexp2 = ‘/’.$start2.'(.*)’.$end2.’/Ui';

    preg_match_all($regexp1, $string, $out1, PREG_PATTERN_ORDER);
    preg_match_all($regexp2, $string, $out2, PREG_PATTERN_ORDER);

    $combined = array_combine($out1[1], $out2[1]);

    echo “”; print_r($combined); echo “”;

    //checking one row
    print_r (explode(“,”,$combined[3]));
    }
    ?>

    But it doesn’t work unfortunately…Thanks for your help in advance! It’s appreciated big time!

  5. Gabriel says

    Array (”data from A and B” => “from C and D”,
    “data from A and B” => “from C and D”,

    Are you sure that your example (with dupes) is a good one?

  6. Flo says

    yeah for example:

    $string = “<b>Street 23</b> – Paris
    <b>Street 43</b> – Berlin
    <b>Street 453</b> – London”;

    $string_1 = “<b>”;
    $string_1 = “”;
    $string_3 = “”;
    $string_4 = “”;

    required result:
    Array (”Street 23” => “Paris”,
    Street 43” => “Berlin”,
    “Street 453” => “London”)

    Thanks

  7. Gabriel says

    This is a way to get the desired result:

    <?php
    $string = '<b>Street 23</b> - Paris
    <b>Street 43</b> - Berlin
    <b>Street 453</b> - London';
    
    $array = explode("<b>", $string);
    
    $desired_array = array();
    
    foreach($array as $value)
    {
    $value = trim(strip_tags($value));
    
    if($value)
    	{
    	list($street, $city) = explode("-", $value); // hyphen is our delimiter
    	$desired_array[trim($street)] = trim($city); // remove whitespaces
    	}
    }
    
    echo "<pre>"; print_r($desired_array); echo "</pre>";
    
    /*
    Array
    (
        [Street 23] => Paris
        [Street 43] => Berlin
        [Street 453] => London
    )
    */
    ?>

    PS: Consider using &lt; for < and &gt; for > when you write a new comment.

  8. Maximus says

    My question, i use the following:

    $q = 0;
    function extract_unit($string, $start, $end)
    {
    $pos = stripos($string, $start);
    $str = substr($string, $pos);
    $str_two = substr($str, strlen($start));
    $second_pos = stripos($str_two, $end);
    $str_three = substr($str_two, $q, $second_pos);
    $unit = trim($str_three); // remove whitespaces
    return $unit;
    }

    $h2 = extract_unit($pagina, ”, ”);

    to extract the h2 header from a webpage. $pagina is the string which holds the sourcecode of a webpage. The only problem right now: It only extract the first h2 header from the page. I tried with a for loop, but then i get the same h2 header, multple times.

    I have a page with 5 h2 headers, and i want to extract all of them, and put them in an array for example.

    Anybody knows how?

    Maximus

Trackbacks

Leave a Reply

Your email address will not be published. Required fields are marked *